1
|
Miao Y, Hunter A, Georgilas I. Parameter Reduction and Optimisation for Point Cloud and Occupancy Mapping Algorithms. SENSORS 2021; 21:s21217004. [PMID: 34770311 PMCID: PMC8588047 DOI: 10.3390/s21217004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Revised: 10/13/2021] [Accepted: 10/17/2021] [Indexed: 12/03/2022]
Abstract
Occupancy mapping is widely used to generate volumetric 3D environment models from point clouds, informing a robotic platform which parts of the environment are free and which are not. The selection of the parameters that govern the point cloud generation algorithms and mapping algorithms affects the process and the quality of the final map. Although previous studies have been reported in the literature on optimising major parameter configurations, research in the process to identify optimal parameter sets to achieve best occupancy mapping performance remains limited. The current work aims to fill this gap with a two-step principled methodology that first identifies the most significant parameters by conducting Neighbourhood Component Analysis on all parameters and then optimise those using grid search with the area under the Receiver Operating Characteristic curve. This study is conducted on 20 data sets with specially designed targets, providing precise ground truths for evaluation purposes. The methodology is tested on OctoMap with point clouds created by applying StereoSGBM on the images from a stereo camera. A clear indication can be seen that mapping parameters are more important than point cloud generation parameters. Moreover, up to 15% improvement in mapping performance can be achieved over default parameters.
Collapse
|
2
|
Sudarshan M, Puli A, Subramanian L, Sankararaman S, Ranganath R. Contra: Contrarian statistics for controlled variable selection. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2021; 130:1900-1908. [PMID: 34522887 PMCID: PMC8436172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The holdout randomization test (HRT) discovers a set of covariates most predictive of a response. Given the covariate distribution, HRTs can explicitly control the false discovery rate (FDR). However, if this distribution is unknown and must be estimated from data, HRTs can inflate the FDR. To alleviate the inflation of FDR, we propose the contrarian randomization test (CONTRA), which is designed explicitly for scenarios where the covariate distribution must be estimated from data and may even be misspecified. Our key insight is to use an equal mixture of two "contrarian" probabilistic models in determining the importance of a covariate. One model is fit with the real data, while the other is fit using the same data, but with the covariate being tested replaced with samples from an estimate of the covariate distribution. CONTRA is flexible enough to achieve a power of 1 asymptotically, can reduce the FDR compared to state-of-the-art CVS methods when the covariate distribution is misspecified, and is computationally efficient in high dimensions and large sample sizes. We further demonstrate the effectiveness of CONTRA on numerous synthetic benchmarks, and highlight its capabilities on a genetic dataset.
Collapse
Affiliation(s)
| | | | | | | | - Rajesh Ranganath
- Courant Institute, New York University
- Center for Data Science, New York University
| |
Collapse
|
3
|
Zakaria MN. Validity and Reliability Aspects of a Newly Developed Questionnaire for Auditory Localization. J Int Adv Otol 2019; 15:182-183. [PMID: 30924773 PMCID: PMC6483440 DOI: 10.5152/iao.2019.5959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Affiliation(s)
- Mohd Normani Zakaria
- Audiology and Speech Pathology Programme, School of Health Sciences, Universiti Sains Malaysia, Kelantan, Malaysia
| |
Collapse
|
4
|
Gałan W, Bąk M, Jakubowska M. Host Taxon Predictor - A Tool for Predicting Taxon of the Host of a Newly Discovered Virus. Sci Rep 2019; 9:3436. [PMID: 30837511 PMCID: PMC6400966 DOI: 10.1038/s41598-019-39847-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 01/30/2019] [Indexed: 12/04/2022] Open
Abstract
Recent advances in metagenomics provided a valuable alternative to culture-based approaches for better sampling viral diversity. However, some of newly identified viruses lack sequence similarity to any of previously sequenced ones, and cannot be easily assigned to their hosts. Here we present a bioinformatic approach to this problem. We developed classifiers capable of distinguishing eukaryotic viruses from the phages achieving almost 95% prediction accuracy. The classifiers are wrapped in Host Taxon Predictor (HTP) software written in Python which is freely available at https://github.com/wojciech-galan/viruses_classifier. HTP’s performance was later demonstrated on a collection of newly identified viral genomes and genome fragments. In summary, HTP is a culture- and alignment-free approach for distinction between phages and eukaryotic viruses. We have also shown that it is possible to further extend our method to go up the evolutionary tree and predict whether a virus can infect narrower taxa.
Collapse
Affiliation(s)
- Wojciech Gałan
- Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University in Kraków, ul. Gronostajowa 7, 30-387, Kraków, Poland.
| | - Maciej Bąk
- Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University in Kraków, ul. Gronostajowa 7, 30-387, Kraków, Poland
| | - Małgorzata Jakubowska
- AGH University of Science and Technology, Faculty of Materials Science and Ceramics, al. Mickiewicza 30, 30-059, Kraków, Poland
| |
Collapse
|
5
|
Jung Y, El-Manzalawy Y, Dobbs D, Honavar VG. Partner-specific prediction of RNA-binding residues in proteins: A critical assessment. Proteins 2018; 87:198-211. [PMID: 30536635 PMCID: PMC6389706 DOI: 10.1002/prot.25639] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 10/10/2018] [Accepted: 11/29/2018] [Indexed: 01/06/2023]
Abstract
RNA-protein interactions play essential roles in regulating gene expression. While some RNA-protein interactions are "specific", that is, the RNA-binding proteins preferentially bind to particular RNA sequence or structural motifs, others are "non-RNA specific." Deciphering the protein-RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein-RNA interfaces, there is a need for computational methods to identify RNA-binding residues in proteins. While most of the existing computational methods for predicting RNA-binding residues in RNA-binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner-specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner-specific protein-RNA interface prediction tools, PS-PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA-specificity metric (RSM), for quantifying the RNA-specificity of the RNA binding residues predicted by such tools. Our results show that the RNA-binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner-agnostic metrics, RNA partner-specific methods are outperformed by the state-of-the-art partner-agnostic methods. We conjecture that either (a) the protein-RNA complexes in PDB are not representative of the protein-RNA interactions in nature, or (b) the current methods for partner-specific prediction of RNA-binding residues in proteins fail to account for the differences in RNA partner-specific versus partner-agnostic protein-RNA interactions, or both.
Collapse
Affiliation(s)
- Yong Jung
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, Pennsylvania.,Artificial Intelligence Research Laboratory, Pennsylvania State University, University Park, Pennsylvania.,The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania
| | - Yasser El-Manzalawy
- Artificial Intelligence Research Laboratory, Pennsylvania State University, University Park, Pennsylvania.,Clinical and Translational Sciences Institute, Pennsylvania State University, University Park, Pennsylvania.,College of Information Sciences and Technology, Pennsylvania State University, Pennsylvania
| | - Drena Dobbs
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, Iowa.,Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, Iowa
| | - Vasant G Honavar
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, Pennsylvania.,Artificial Intelligence Research Laboratory, Pennsylvania State University, University Park, Pennsylvania.,Institute for Cyberscience, Pennsylvania State University, University Park, Pennsylvania.,Clinical and Translational Sciences Institute, Pennsylvania State University, University Park, Pennsylvania.,The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania.,College of Information Sciences and Technology, Pennsylvania State University, Pennsylvania
| |
Collapse
|
6
|
Inácio de Carvalho V, Carvalho M, Branscum A. Bayesian bootstrap inference for the receiver operating characteristic surface. Stat (Int Stat Inst) 2018. [DOI: 10.1002/sta4.211] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
| | | | - Adam Branscum
- College of Public Health and Human Sciences Oregon State University Corvallis Oregon
| |
Collapse
|
7
|
Fei T, Zhang T, Shi W, Yu T. Mitigating the adverse impact of batch effects in sample pattern detection. Bioinformatics 2018; 34:2634-2641. [PMID: 29506177 PMCID: PMC6061843 DOI: 10.1093/bioinformatics/bty117] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Revised: 02/14/2018] [Accepted: 02/27/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation It is well known that batch effects exist in RNA-seq data and other profiling data. Although some methods do a good job adjusting for batch effects by modifying the data matrices, it is still difficult to remove the batch effects entirely. The remaining batch effect can cause artifacts in the detection of patterns in the data. Results In this study, we consider the batch effect issue in the pattern detection among the samples, such as clustering, dimension reduction and construction of networks between subjects. Instead of adjusting the original data matrices, we design an adaptive method to directly adjust the dissimilarity matrix between samples. In simulation studies, the method achieved better results recovering true underlying clusters, compared to the leading batch effect adjustment method ComBat. In real data analysis, the method effectively corrected distance matrices and improved the performance of clustering algorithms. Availability and implementation The R package is available at: https://github.com/tengfei-emory/QuantNorm. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Teng Fei
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| | - Tengjiao Zhang
- School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Weiyang Shi
- Ministry of Education Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, Qingdao, China
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| |
Collapse
|
8
|
To Duc K. bcROCsurface: an R package for correcting verification bias in estimation of the ROC surface and its volume for continuous diagnostic tests. BMC Bioinformatics 2017; 18:503. [PMID: 29151019 PMCID: PMC5694622 DOI: 10.1186/s12859-017-1914-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2016] [Accepted: 11/01/2017] [Indexed: 11/11/2022] Open
Abstract
Background Receiver operating characteristic (ROC) surface analysis is usually employed to assess the accuracy of a medical diagnostic test when there are three ordered disease status (e.g. non-diseased, intermediate, diseased). In practice, verification bias can occur due to missingness of the true disease status and can lead to a distorted conclusion on diagnostic accuracy. In such situations, bias–corrected inference tools are required. Results This paper introduce an R package, named bcROCsurface, which provides utility functions for verification bias–corrected ROC surface analysis. The shiny web application of the correction for verification bias in estimation of the ROC surface analysis is also developed. Conclusion bcROCsurface may become an important tool for the statistical evaluation of three–class diagnostic markers in presence of verification bias. The R package, readme and example data are available on CRAN. The web interface enables users less familiar with R to evaluate the accuracy of diagnostic tests, and can be found at http://khanhtoduc.shinyapps.io/bcROCsurface_shiny/.
Collapse
Affiliation(s)
- Khanh To Duc
- Department of Statistical Sciences, University of Padova, via C. Battisti, 241, Padova, 35121, Italy.
| |
Collapse
|
9
|
Liao P, Wu H, Yu T. ROC Curve Analysis in the Presence of Imperfect Reference Standards. STATISTICS IN BIOSCIENCES 2017; 9:91-104. [PMID: 28694878 PMCID: PMC5501420 DOI: 10.1007/s12561-016-9159-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2016] [Revised: 05/27/2016] [Accepted: 07/04/2016] [Indexed: 10/21/2022]
Abstract
The receiver operating characteristic (ROC) curve is an important tool for the evaluation and comparison of predictive models when the outcome is binary. If the class membership of the outcomes are known, ROC can be constructed for a model, and the ROC with greater area under the curve (AUC) indicates better performance. However in practice, imperfect reference standards often exist, in which class membership of every data point are not fully determined. This situation is especially prevalent in high-throughput biomedical data because obtaining perfect reference standards for all data points is either too costly or technically impractical. To construct ROC curves for these data, the common practice is to either ignore the uncertainties in references, or remove data points with high uncertainties. Such approaches may cause bias to the ROC curves and generate misleading results in method evaluation. Here we present a framework to incorporate membership uncertainties into the construction of ROC curve, termed the expected ROC or "eROC" curve. We develop an efficient procedure for the estimation of eROC curve. The advantages of using eROC are demonstrated using simulated and real data.
Collapse
Affiliation(s)
- Peizhou Liao
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, Tel.: +1(404)747-8400,
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, Tel.: +1(404)727-8633, ,
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia 30322, Tel.: +1(404)727-7671, ,
| |
Collapse
|
10
|
MEG Connectivity and Power Detections with Minimum Norm Estimates Require Different Regularization Parameters. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016; 2016:3979547. [PMID: 27092179 PMCID: PMC4820599 DOI: 10.1155/2016/3979547] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2015] [Revised: 01/19/2016] [Accepted: 02/14/2016] [Indexed: 11/24/2022]
Abstract
Minimum Norm Estimation (MNE) is an inverse solution method widely used to reconstruct the source time series that underlie magnetoencephalography (MEG) data. MNE addresses the ill-posed nature of MEG source estimation through regularization (e.g., Tikhonov regularization). Selecting the best regularization parameter is a critical step. Generally, once set, it is common practice to keep the same coefficient throughout a study. However, it is yet to be known whether the optimal lambda for spectral power analysis of MEG source data coincides with the optimal regularization for source-level oscillatory coupling analysis. We addressed this question via extensive Monte-Carlo simulations of MEG data, where we generated 21,600 configurations of pairs of coupled sources with varying sizes, signal-to-noise ratio (SNR), and coupling strengths. Then, we searched for the Tikhonov regularization coefficients (lambda) that maximize detection performance for (a) power and (b) coherence. For coherence, the optimal lambda was two orders of magnitude smaller than the best lambda for power. Moreover, we found that the spatial extent of the interacting sources and SNR, but not the extent of coupling, were the main parameters affecting the best choice for lambda. Our findings suggest using less regularization when measuring oscillatory coupling compared to power estimation.
Collapse
|
11
|
Schubert CM, Guennel T. Comparing Performance of Multiclass Classification Systems with ROC Manifolds: When Volume and Correct Classification Fails. COMMUN STAT-SIMUL C 2014. [DOI: 10.1080/03610918.2013.794284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
12
|
Yu T, Jones DP. Improving peak detection in high-resolution LC/MS metabolomics data using preexisting knowledge and machine learning approach. ACTA ACUST UNITED AC 2014; 30:2941-8. [PMID: 25005748 DOI: 10.1093/bioinformatics/btu430] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MOTIVATION Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics. RESULTS Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties. AVAILABILITY AND IMPLEMENTATION The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/ CONTACT: tyu8@emory.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tianwei Yu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health and Department of Medicine, School of Medicine, Emory University, Atlanta, GA 30322, USA
| | - Dean P Jones
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health and Department of Medicine, School of Medicine, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
13
|
Gou J, Zhao Y, Wei Y, Wu C, Zhang R, Qiu Y, Zeng P, Tan W, Yu D, Wu T, Hu Z, Lin D, Shen H, Chen F. Stability SCAD: a powerful approach to detect interactions in large-scale genomic study. BMC Bioinformatics 2014; 15:62. [PMID: 24580776 PMCID: PMC3984751 DOI: 10.1186/1471-2105-15-62] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2013] [Accepted: 02/18/2014] [Indexed: 11/25/2022] Open
Abstract
Background Evidence suggests that common complex diseases may be partially due to SNP-SNP interactions, but such detection is yet to be fully established in a high-dimensional small-sample (small-n-large-p) study. A number of penalized regression techniques are gaining popularity within the statistical community, and are now being applied to detect interactions. These techniques tend to be over-fitting, and are prone to false positives. The recently developed stability least absolute shrinkage and selection operator (SLASSO) has been used to control family-wise error rate, but often at the expense of power (and thus false negative results). Results Here, we propose an alternative stability selection procedure known as stability smoothly clipped absolute deviation (SSCAD). Briefly, this method applies a smoothly clipped absolute deviation (SCAD) algorithm to multiple sub-samples, and then identifies cluster ensemble of interactions across the sub-samples. The proposed method was compared with SLASSO and two kinds of traditional penalized methods by intensive simulation. The simulation revealed higher power and lower false discovery rate (FDR) with SSCAD. An analysis using the new method on the previously published GWAS of lung cancer confirmed all significant interactions identified with SLASSO, and identified two additional interactions not reported with SLASSO analysis. Conclusions Based on the results obtained in this study, SSCAD presents to be a powerful procedure for the detection of SNP-SNP interactions in large-scale genomic data.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | - Feng Chen
- Department of Epidemiology and Biostatistics and Ministry of Education (MOE) Key Lab for Modern Toxicology, School of Public Health, Nanjing Medical University, Nanjing, China.
| |
Collapse
|
14
|
Yu T, Peng H. Hierarchical clustering of high-throughput expression data based on general dependences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1080-1085. [PMID: 24334400 PMCID: PMC3905248 DOI: 10.1109/tcbb.2013.99] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
High-throughput expression technologies, including gene expression array and liquid chromatography--mass spectrometry (LC-MS) and so on, measure thousands of features, i.e., genes or metabolites, on a continuous scale. In such data, both linear and nonlinear relations exist between features. Nonlinear relations can reflect critical regulation patterns in the biological system. However, they are not identified and utilized by traditional clustering methods based on linear associations. Clustering based on general dependences, i.e., both linear and nonlinear relations, is hampered by the high dimensionality and high noise level of the data. We developed a sensitive nonparametric measure of general dependence between (groups of) random variables in high dimensions. Based on this dependence measure, we developed a hierarchical clustering method. In simulation studies, the method outperformed correlation- and mutual information (MI)-based hierarchical clustering methods in clustering features with nonlinear dependences. We applied the method to a microarray data set measuring the gene expression in cell-cycle time series to show it generates biologically relevant results. The R code is available at http://userwww.service.emory.edu/~tyu8/GDHC.
Collapse
Affiliation(s)
- Tianwei Yu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA
| | - Hesen Peng
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA
| |
Collapse
|