1
|
Molotkov I, Artomov M. Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study. BIOINFORMATICS ADVANCES 2023; 3:vbad128. [PMID: 37745001 PMCID: PMC10517638 DOI: 10.1093/bioadv/vbad128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/13/2023] [Accepted: 09/12/2023] [Indexed: 09/26/2023]
Abstract
Motivation Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. Results We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. Availability and implementation Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias.
Collapse
Affiliation(s)
- Ivan Molotkov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
- ITMO University, Saint Petersburg, Russia
| | - Mykyta Artomov
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
2
|
A co-training method based on parameter-free and single-step unlabeled data selection strategy with natural neighbors. INT J MACH LEARN CYB 2023. [DOI: 10.1007/s13042-023-01805-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
|
3
|
The Analysis of Relevant Gene Networks Based on Driver Genes in Breast Cancer. Diagnostics (Basel) 2022; 12:diagnostics12112882. [PMID: 36428940 PMCID: PMC9689550 DOI: 10.3390/diagnostics12112882] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/08/2022] [Accepted: 11/14/2022] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND The occurrence and development of breast cancer has a strong correlation with a person's genetics. Therefore, it is important to analyze the genetic factors of breast cancer for future development of potential targeted therapies from the genetic level. METHODS In this study, we complete an analysis of the relevant protein-protein interaction network relating to breast cancer. This includes three steps, which are breast cancer-relevant genes selection using mutual information method, protein-protein interaction network reconstruction based on the STRING database, and vital genes calculating by nodes centrality analysis. RESULTS The 230 breast cancer-relevant genes were chosen in gene selection to reconstruct the protein-protein interaction network and some vital genes were calculated by node centrality analyses. Node centrality analyses conducted with the top 10 and top 20 values of each metric found 19 and 39 statistically vital genes, respectively. In order to prove the biological significance of these vital genes, we carried out the survival analysis and DNA methylation analysis, inquired about the prognosis in other cancer tissues and the RNA expression level in breast cancer. The results all proved the validity of the selected genes. CONCLUSIONS These genes could provide a valuable reference in clinical treatment among breast cancer patients.
Collapse
|
4
|
Kikkawa R, Kajita H, Imanishi N, Aiso S, Bise R. Unsupervised Body Hair Detection by Positive-Unlabeled Learning in Photoacoustic Image. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:3349-3352. [PMID: 34891957 DOI: 10.1109/embc46164.2021.9630720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Photoacoustic (PA) imaging is a new imaging technology that can non-invasively visualize blood vessels and body hair in 3D. It is useful in cosmetic surgery for detecting body hair and computing metrics such as the number and thicknesses of hairs. Previous supervised body hair detection methods often do not work if the imaging conditions change from training data. We propose an unsupervised hair detection method. Hair samples were automatically extracted from unlabeled samples using prior knowledge about spatial structure. If hair (positive) samples and unlabeled samples are obtained, Positive Unlabeled (PU) learning becomes possible. PU methods can learn a binary classifier from positive samples and unlabeled samples. The advantage of the proposed method is that it can estimate an appropriate decision boundary in accordance with the distribution of the test data. Experimental results using real PA data demonstrate that the proposed approach effectively detects body hairs.
Collapse
|
5
|
Saikia SJ, Nirmala SR. Identification of disease genes and assessment of eye-related diseases caused by disease genes using JMFC and GDLNN. Comput Methods Biomech Biomed Engin 2021; 25:359-370. [PMID: 34384296 DOI: 10.1080/10255842.2021.1955358] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Early detection of disease genes helps humans to recover from certain gene-related diseases, like genetic eye diseases. This work identifies the possibility of eye diseasesfor the disease genes utilizing a Gaussian-activation function (G)-centric deeplearning neural network (GDLNN) model. In this work, human genes are selected by computing structural similarity and genes are clustered as disease genesand normal genes by using the JMFC clustering algorithm. Levy flight and Crossover and Mutation (LCM) centric Chicken Swarm Optimization (LCM-CSO) is employed for feature selection and GDLNN classifies the eye-related diseases for the input genes using the selected features.
Collapse
Affiliation(s)
- Samar Jyoti Saikia
- Department of Electronics and Communication Engineering, Gauhati University, Guwahati, Assam, India.,Department of Electronics and Communication Engineering, Assam Don Bosco University, Guwahati, Assam, India
| | - S R Nirmala
- Department of Electronics and Communication Engineering, Gauhati University, Guwahati, Assam, India.,School of Electronics and Communication Engineering, KLE Technological University, Hubli, Karnataka, India
| |
Collapse
|
6
|
Moturi S, Rao SNT, Vemuru S. Grey wolf assisted dragonfly-based weighted rule generation for predicting heart disease and breast cancer. Comput Med Imaging Graph 2021; 91:101936. [PMID: 34218121 DOI: 10.1016/j.compmedimag.2021.101936] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 01/06/2021] [Accepted: 05/07/2021] [Indexed: 11/29/2022]
Abstract
Disease prediction plays a significant role in the life of people, as predicting the threat of diseases is necessary for citizens to live life in a healthy manner. The current development of data mining schemes has offered several systems that concern on disease prediction. Even though the disease prediction system includes more advantages, there are still many challenges that might limit its realistic use, such as the efficiency of prediction and information protection. This paper intends to develop an improved disease prediction model, which includes three phases: Weighted Coalesce rule generation, Optimized feature extraction, and Classification. At first, Coalesce rule generation is carried out after data transformation that involves normalization and sequential labeling. Here, rule generation is done based on the weights (priority level) assigned for each attribute by the expert. The support of each rule is multiplied with the proposed weighted function, and the resultant weighted support is compared with the minimum support for selecting the rules. Further, the obtained rule is subject to the optimal feature selection process. The hybrid classifiers that merge Support Vector Machine (SVM), and Deep Belief Network (DBN) takes the role of classification, which characterizes whether the patient is affected with the disease or not. In fact, the optimized feature selection process depends on a new hybrid optimization algorithm by linking the Grey Wolf Optimization (GWO) with Dragonfly Algorithm (DA) and hence, the presented model is termed as Grey Wolf Levy Updated-DA (GWU-DA). Here, the heart disease and breast cancer data are taken, where the efficiency of the proposed model is validated by comparing over the state-of-the-art models. From the analysis, the proposed GWU-DA model for accuracy is 65.98 %, 53.61 %, 42.27 %, 35.05 %, 34.02 %, 11.34 %, 13.4 %, 10.31 %, 9.28 % and 9.89 % better than CBA + CPAR, MKL + ANFIS, RF + EA, WCBA, IQR + KNN + PSO, NL-DA + SVM + DBN, AWFS-RA, HCS-RFRS, ADS-SM-DNN and OSSVM-HGSA models at 60th learning percentage.
Collapse
Affiliation(s)
- Sireesha Moturi
- Research Scholar, Computer Science and Engineering, KLEF, Green Fields, Vaddeswaram, Andhra Pradesh, 522502, India.
| | - S N Tirumala Rao
- Professor, Computer Science and Engineering, Narasaraopeta Engineering College, Narasaraopet, Guntur(Dt), Andhra Pradesh, India
| | - Srikanth Vemuru
- Professor, Computer Science and Engineering, KLEF, Green Fields, Vaddeswaram, Andhra Pradesh, 522502, India
| |
Collapse
|
7
|
He M, Huang C, Liu B, Wang Y, Li J. Factor graph-aggregated heterogeneous network embedding for disease-gene association prediction. BMC Bioinformatics 2021; 22:165. [PMID: 33781206 PMCID: PMC8006390 DOI: 10.1186/s12859-021-04099-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 03/23/2021] [Indexed: 11/18/2022] Open
Abstract
Background Exploring the relationship between disease and gene is of great significance for understanding the pathogenesis of disease and developing corresponding therapeutic measures. The prediction of disease-gene association by computational methods accelerates the process. Results Many existing methods cannot fully utilize the multi-dimensional biological entity relationship to predict disease-gene association due to multi-source heterogeneous data. This paper proposes FactorHNE, a factor graph-aggregated heterogeneous network embedding method for disease-gene association prediction, which captures a variety of semantic relationships between the heterogeneous nodes by factorization. It produces different semantic factor graphs and effectively aggregates a variety of semantic relationships, by using end-to-end multi-perspectives loss function to optimize model. Then it produces good nodes embedding to prediction disease-gene association. Conclusions Experimental verification and analysis show FactorHNE has better performance and scalability than the existing models. It also has good interpretability and can be extended to large-scale biomedical network data analysis.
Collapse
Affiliation(s)
- Ming He
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Chen Huang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China
| | - Bo Liu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.,Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, Heilongjiang, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, Guangdong, China.
| |
Collapse
|
8
|
Zhang L, Hu J, Xu Q, Li F, Rao G, Tao C. A semantic relationship mining method among disorders, genes, and drugs from different biomedical datasets. BMC Med Inform Decis Mak 2020; 20:283. [PMID: 33317518 PMCID: PMC7734713 DOI: 10.1186/s12911-020-01274-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 09/22/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Semantic web technology has been applied widely in the biomedical informatics field. Large numbers of biomedical datasets are available online in the resource description framework (RDF) format. Semantic relationship mining among genes, disorders, and drugs is widely used in, for example, precision medicine and drug repositioning. However, most of the existing studies focused on a single dataset. It is not easy to find the most current relationships among disorder-gene-drug relationships since the relationships are distributed in heterogeneous datasets. How to mine their semantic relationships from different biomedical datasets is an important issue. METHODS First, a variety of biomedical datasets were converted into RDF triple data; then, multisource biomedical datasets were integrated into a storage system using a data integration algorithm. Second, nine query patterns among genes, disorders, and drugs from different biomedical datasets were designed. Third, the gene-disorder-drug semantic relationship mining algorithm is presented. This algorithm can query the relationships among various entities from different datasets. RESULTS AND CONCLUSIONS We focused on mining the putative and the most current disorder-gene-drug relationships about Parkinson's disease (PD). The results demonstrate that our method has significant advantages in mining and integrating multisource heterogeneous biomedical datasets. Twenty-five new relationships among the genes, disorders, and drugs were mined from four different datasets. The query results showed that most of them came from different datasets. The precision of the method increased by 2.51% compared to that of the multisource linked open data fusion method presented in the 4th International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). Moreover, the number of query results increased by 7.7%, and the number of correct queries increased by 9.5%.
Collapse
Affiliation(s)
- Li Zhang
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Jiamei Hu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Qianzhi Xu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Fang Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Guozheng Rao
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350 China
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, 300350 China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| |
Collapse
|
9
|
Xiao R, Yu X, Shi R, Zhang Z, Yu W, Li Y, Chen G, Gao J. Ecosystem health monitoring in the Shanghai-Hangzhou Bay Metropolitan Area: A hidden Markov modeling approach. ENVIRONMENT INTERNATIONAL 2019; 133:105170. [PMID: 31629171 DOI: 10.1016/j.envint.2019.105170] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Revised: 08/12/2019] [Accepted: 09/06/2019] [Indexed: 06/10/2023]
Abstract
Ecosystem health assessment is an important method for obtaining information on ecosystem conditions, and it plays a vital role in preserving and enhancing ecosystem health status. In addition, it provides useful information and knowledge for urban agglomeration development decision makers. However, ecological phenomena often vary considerably from one observation to the next, which makes it difficult to distinguish different status of the ecosystem health. In this study, hidden Markov model (HMM) was employed to simulate the internal-external correlations of ecosystem status through establishing the relationships between internal ecological health level and combination state of external observation. Based on the statistics and land use data in 2001, 2007 and 2013, the Vigor-Organization-Resilience (VOR) framework was employed to identify the ecosystem health in Shanghai-Hangzhou Bay Metropolitan (SHBM), in which the ecosystem health state was considered as a hidden state that could be estimated according to the conditions of vigor, organization and resilience. In addition, two parameter learning cases including mathematical statistics and extensible sequence method were employed to solve the iterative convergence problem of parameters in short-time series of ecosystem health simulation. Results show that HMM not only provides a comparable descriptive ability to that of the VOR model, but also can monitor ecosystem health at the optimal grid scale in SHBM. The combination of HMM and VOR greatly expands the spatiotemporal characteristics and provides a new research approach for the study of ecosystem health assessment of urban agglomerations.
Collapse
Affiliation(s)
- Rui Xiao
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
| | - Xiaoyu Yu
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
| | - Ruixing Shi
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
| | - Zhonghao Zhang
- Institute of Urban Studies, School of Environmental and Geographical Sciences, Shanghai Normal University, Shanghai 200234, China.
| | - Weixuan Yu
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
| | - Yansheng Li
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
| | - Guang Chen
- Chongqing Survey Institute, Chongqing 401121, China
| | - Jun Gao
- Institute of Urban Studies, School of Environmental and Geographical Sciences, Shanghai Normal University, Shanghai 200234, China
| |
Collapse
|