1
|
Fazli Khalaf Z, Liow JW, Nalliah S, Foong ALS. When Health Intersects with Gender and Sexual Diversity: Medical Students' Attitudes Towards LGBTQ Patients. JOURNAL OF HOMOSEXUALITY 2023; 70:1763-1786. [PMID: 35285780 DOI: 10.1080/00918369.2022.2042662] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A central tenet of the health professions is that of equitable access to health care. However, disparities in equitable healthcare provision continues to be a challenge in many societies due to prejudices against the LGBTQ community. This study was aimed at exploring the attitudes of medical students toward LGBTQ patients in Malaysia. A qualitative approach was adopted to seek depth of understanding of clinical year medical students' perceptions and attitudes toward LGBTQ patients. Data were collected in 2018 through individual interviews and focus group discussions with a total of 29 participants, using a semi-structured question guideline. Purposive sampling comprised representation from the three major ethnic groups in Malaysia. Thematic analysis using NVivo highlighted three main themes i.e., neutrality, in compliance with the Professional Code of Conduct; implicit biases and tolerance of an Odd Identity; explicit biases with prejudices and stereotyping. The lack of knowledge and understanding of the nature and issues of sexuality is problematic as found in this study. They are primarily biases and prejudices projected onto marginalized LGBTQ patients who must contend with multiple jeopardies in conservative societies such as in Malaysia. With some state policies framed around Islam the concern is with the belief among Malay/Islamic students for LGBTQ individuals to go through conversion 'therapies' to become cisgender and heterosexual.
Collapse
Affiliation(s)
- Zahra Fazli Khalaf
- Department of Psychology, College of Health and Human Sciences, North Carolina A&T State University, Greensboro, North Carolina, USA
| | - Jun Wei Liow
- Department of Social Work and Social Administration, The University of Hong Kong, Pokfulam, Hong Kong
| | - Sivalingam Nalliah
- School of Medicine, International Medical University, Kuala Lumpur, Malaysia
| | - Andrew L S Foong
- College of Health & Medicine, University of Tasmania, Hobart, Australia
- Faculty of Social Sciences, Quest International University, Perak Darul Ridzuan, Ipoh, Malaysia
| |
Collapse
|
2
|
Moss J, De Bin R. Modelling publication bias and p-hacking. Biometrics 2023; 79:319-331. [PMID: 34510407 DOI: 10.1111/biom.13560] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 08/26/2021] [Indexed: 11/27/2022]
Abstract
Publication bias and p-hacking are two well-known phenomena that strongly affect the scientific literature and cause severe problems in meta-analyses. Due to these phenomena, the assumptions of meta-analyses are seriously violated and the results of the studies cannot be trusted. While publication bias is very often captured well by the weighting function selection model, p-hacking is much harder to model and no definitive solution has been found yet. In this paper, we advocate the selection model approach to model publication bias and propose a mixture model for p-hacking. We derive some properties for these models, and we compare them formally and through simulations. Finally, two real data examples are used to show how the models work in practice.
Collapse
Affiliation(s)
- Jonas Moss
- Department of Mathematics, University of Oslo, Oslo, Norway
| | | |
Collapse
|
3
|
Deyneko IV. Guidelines on the performance evaluation of motif recognition methods in bioinformatics. Front Genet 2023; 14:1135320. [PMID: 36824436 PMCID: PMC9941176 DOI: 10.3389/fgene.2023.1135320] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 01/19/2023] [Indexed: 02/09/2023] Open
|
4
|
Kebschull M, Kroeger AT, Papapanou PN. Differential Expression, Functional and Machine Learning Analysis of High-Throughput -Omics Data Using Open-Source Tools. Methods Mol Biol 2023; 2588:317-351. [PMID: 36418696 DOI: 10.1007/978-1-0716-2780-8_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Today, -omics analyses, including the systematic cataloging of messenger RNA and microRNA sequences or DNA methylation patterns in a cell population, organ or tissue sample, allow for an unbiased, comprehensive genome-level analysis of complex diseases, offering a large advantage over earlier "candidate" gene or pathway analyses. A primary goal in the analysis of these high-throughput assays is the detection of those features among several thousand that differ between different groups of samples. In the context of oral biology, our group has successfully utilized -omics technology to identify key molecules and pathways in different diagnostic entities of periodontal disease.A major issue when inferring biological information from high-throughput -omics studies is the fact that the sheer volume of high-dimensional data generated by contemporary technology is not appropriately analyzed using common statistical methods employed in the biomedical sciences. Furthermore, machine learning methods facilitate the detection of additional patterns, beyond the mere identification of lists of features that differ between groups.Herein, we outline a robust and well-accepted bioinformatics workflow for the initial analysis of -omics data using open-source tools. We outline a differential expression analysis pipeline that can be used for data from both arrays and sequencing experiments, and offers the possibility to account for random or fixed effects. Furthermore, we present an overview of the possibilities for a functional analysis of the obtained data including subsequent machine learning approaches in form of (i) supervised classification algorithms in class validation and (ii) unsupervised clustering in class discovery.
Collapse
Affiliation(s)
- Moritz Kebschull
- Periodontal Research Group, Institute of Clinical Sciences, College of Medical & Dental Sciences, The University of Birmingham, Birmingham, UK. .,Division of Periodontics, Section of Oral, Diagnostic and Rehabilitation Sciences, Columbia University College of Dental Medicine, New York, NY, USA. .,Birmingham Community Healthcare NHS Trust, Birmingham, UK.
| | - Annika Therese Kroeger
- Birmingham Community Healthcare NHS Trust, Birmingham, UK.,Department of Oral Surgery, School of Dentistry, University of Birmingham, Birmingham, UK
| | - Panos N Papapanou
- Division of Periodontics, Section of Oral, Diagnostic and Rehabilitation Sciences, Columbia University College of Dental Medicine, New York, NY, USA
| |
Collapse
|
5
|
Gardner PP, Paterson JM, McGimpsey S, Ashari-Ghomi F, Umu SU, Pawlik A, Gavryushkin A, Black MA. Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software. Genome Biol 2022; 23:56. [PMID: 35172880 PMCID: PMC8851831 DOI: 10.1186/s13059-022-02625-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 02/06/2022] [Indexed: 11/29/2022] Open
Abstract
Background Computational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software. Results We find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs. Conclusions Our findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish—possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02625-x).
Collapse
Affiliation(s)
- Paul P Gardner
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand. .,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.
| | - James M Paterson
- Department of Civil and Natural Resources Engineering, University of Canterbury, Christchurch, New Zealand
| | | | - Fatemeh Ashari-Ghomi
- Research Group for Genomic Epidemiology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Sinan U Umu
- Department of Research, Cancer Registry of Norway, Oslo, Norway
| | | | - Alex Gavryushkin
- Department of Computer Science, University of Otago, Dunedin, New Zealand.,School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Michael A Black
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand
| |
Collapse
|
6
|
Westphal M, Zapf A, Brannath W. A multiple testing framework for diagnostic accuracy studies with co-primary endpoints. Stat Med 2022; 41:891-909. [PMID: 35075684 DOI: 10.1002/sim.9308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 12/12/2021] [Accepted: 12/17/2021] [Indexed: 11/08/2022]
Abstract
Major advances have been made regarding the utilization of machine learning techniques for disease diagnosis and prognosis based on complex and high-dimensional data. Despite all justified enthusiasm, overoptimistic assessments of predictive performance are still common in this area. However, predictive models and medical devices based on such models should undergo a throughout evaluation before being implemented into clinical practice. In this work, we propose a multiple testing framework for (comparative) phase III diagnostic accuracy studies with sensitivity and specificity as co-primary endpoints. Our approach challenges the frequent recommendation to strictly separate model selection and evaluation, that is, to only assess a single diagnostic model in the evaluation study. We show that our parametric simultaneous test procedure asymptotically allows strong control of the family-wise error rate. A multiplicity correction is also available for point and interval estimates. Moreover, we demonstrate in an extensive simulation study that our multiple testing strategy on average leads to a better final diagnostic model and increased statistical power. To plan such studies, we propose a Bayesian approach to determine the optimal number of models to evaluate simultaneously. For this purpose, our algorithm optimizes the expected final model performance given previous (hold-out) data from the model development phase. We conclude that an assessment of multiple promising diagnostic models in the same evaluation study has several advantages when suitable adjustments for multiple comparisons are employed.
Collapse
Affiliation(s)
- Max Westphal
- Institute for Statistics, University of Bremen, Bremen, Germany.,Max Westphal, Fraunhofer Institute for Digital Medicine MEVIS, Max-Von-Laue-Straße 2, 28359, Bremen, Germany
| | - Antonia Zapf
- Institute of Medical Biometry and Epidemiology, UKE Hamburg, Hamburg, Germany
| | - Werner Brannath
- Institute for Statistics, University of Bremen, Bremen, Germany.,Competence Center for Clinical Trials Bremen, University of Bremen, Bremen, Germany
| |
Collapse
|
7
|
Plyusnin I, Holm L, Törönen P. Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences. PLoS Comput Biol 2019; 15:e1007419. [PMID: 31682632 PMCID: PMC6855565 DOI: 10.1371/journal.pcbi.1007419] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 11/14/2019] [Accepted: 09/24/2019] [Indexed: 11/18/2022] Open
Abstract
Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
- Research Programme in Organismal and Evolutionary Biology, Faculty of Biosciences, University of Helsinki, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Helsinki, Finland
| |
Collapse
|
8
|
Gardner PP, Watson RJ, Morgan XC, Draper JL, Finn RD, Morales SE, Stott MB. Identifying accurate metagenome and amplicon software via a meta-analysis of sequence to taxonomy benchmarking studies. PeerJ 2019; 7:e6160. [PMID: 30631651 PMCID: PMC6322486 DOI: 10.7717/peerj.6160] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 11/14/2018] [Indexed: 01/26/2023] Open
Abstract
Metagenomic and meta-barcode DNA sequencing has rapidly become a widely-used technique for investigating a range of questions, particularly related to health and environmental monitoring. There has also been a proliferation of bioinformatic tools for analysing metagenomic and amplicon datasets, which makes selecting adequate tools a significant challenge. A number of benchmark studies have been undertaken; however, these can present conflicting results. In order to address this issue we have applied a robust Z-score ranking procedure and a network meta-analysis method to identify software tools that are consistently accurate for mapping DNA sequences to taxonomic hierarchies. Based upon these results we have identified some tools and computational strategies that produce robust predictions.
Collapse
Affiliation(s)
- Paul P Gardner
- Biomolecular Interactions Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.,Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Renee J Watson
- Biomolecular Interactions Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Xochitl C Morgan
- Department of Microbiology and Immunology, University of Otago, Dunedin, New Zealand
| | - Jenny L Draper
- Institute of Environmental Science and Research, Porirua, New Zealand
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Sergio E Morales
- Department of Microbiology and Immunology, University of Otago, Dunedin, New Zealand
| | - Matthew B Stott
- Biomolecular Interactions Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
9
|
Kebschull M, Papapanou PN. Exploring Genome-Wide Expression Profiles Using Machine Learning Techniques. Methods Mol Biol 2017; 1537:347-364. [PMID: 27924604 DOI: 10.1007/978-1-4939-6685-1_20] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Although contemporary high-throughput -omics methods produce high-dimensional data, the resulting wealth of information is difficult to assess using traditional statistical procedures. Machine learning methods facilitate the detection of additional patterns, beyond the mere identification of lists of features that differ between groups.Here, we demonstrate the utility of (1) supervised classification algorithms in class validation, and (2) unsupervised clustering in class discovery. We use data from our previous work that described the transcriptional profiles of gingival tissue samples obtained from subjects suffering from chronic or aggressive periodontitis (1) to test whether the two diagnostic entities were also characterized by differences on the molecular level, and (2) to search for a novel, alternative classification of periodontitis based on the tissue transcriptomes.Using machine learning technology, we provide evidence for diagnostic imprecision in the currently accepted classification of periodontitis, and demonstrate that a novel, alternative classification based on differences in gingival tissue transcriptomes is feasible. The outlined procedures allow for the unbiased interrogation of high-dimensional datasets for characteristic underlying classes, and are applicable to a broad range of -omics data.
Collapse
Affiliation(s)
- Moritz Kebschull
- Department of Periodontology, Operative and Preventive Dentistry, Faculty of Medicine, University of Bonn, Welschnonnenstr. 17, Bonn, D-53111, Germany.
- Division of Periodontics, Section of Oral, Diagnostic and Rehabilitation Sciences, Columbia University College of Dental Medicine, New York, NY, USA.
| | - Panos N Papapanou
- Division of Periodontics, Section of Oral, Diagnostic and Rehabilitation Sciences, Columbia University College of Dental Medicine, New York, NY, USA
| |
Collapse
|
10
|
Qian X, Dougherty ER. Bayesian Regression with Network Prior: Optimal Bayesian Filtering Perspective. IEEE TRANSACTIONS ON SIGNAL PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 64:6243-6253. [PMID: 28824268 PMCID: PMC5560447 DOI: 10.1109/tsp.2016.2605072] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The recently introduced intrinsically Bayesian robust filter (IBRF) provides fully optimal filtering relative to a prior distribution over an uncertainty class ofjoint random process models, whereas formerly the theory was limited to model-constrained Bayesian robust filters, for which optimization was limited to the filters that are optimal for models in the uncertainty class. This paper extends the IBRF theory to the situation where there are both a prior on the uncertainty class and sample data. The result is optimal Bayesian filtering (OBF), where optimality is relative to the posterior distribution derived from the prior and the data. The IBRF theories for effective characteristics and canonical expansions extend to the OBF setting. A salient focus of the present work is to demonstrate the advantages of Bayesian regression within the OBF setting over the classical Bayesian approach in the context otlinear Gaussian models.
Collapse
Affiliation(s)
- Xiaoning Qian
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA
| | - Edward R Dougherty
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA, and the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ 85004 USA
| |
Collapse
|
11
|
Zhang L, Wang L, Tian P, Tian S. Identification of Genes Discriminating Multiple Sclerosis Patients from Controls by Adapting a Pathway Analysis Method. PLoS One 2016; 11:e0165543. [PMID: 27846233 PMCID: PMC5112852 DOI: 10.1371/journal.pone.0165543] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 09/13/2016] [Indexed: 11/18/2022] Open
Abstract
The focus of analyzing data from microarray experiments has shifted from the identification of associated individual genes to that of associated biological pathways or gene sets. In bioinformatics, a feature selection algorithm is usually used to cope with the high dimensionality of microarray data. In addition to those algorithms that use the biological information contained within a gene set as a priori to facilitate the process of feature selection, various gene set analysis methods can be applied directly or modified readily for the purpose of feature selection. Significance analysis of microarray to gene-set reduction analysis (SAM-GSR) algorithm, a novel direction of gene set analysis, is one of such methods. Here, we explore the feature selection property of SAM-GSR and provide a modification to better achieve the goal of feature selection. In a multiple sclerosis (MS) microarray data application, both SAM-GSR and our modification of SAM-GSR perform well. Our results show that SAM-GSR can carry out feature selection indeed, and modified SAM-GSR outperforms SAM-GSR. Given pathway information is far from completeness, a statistical method capable of constructing biologically meaningful gene networks is of interest. Consequently, both SAM-GSR algorithms will be continuously revaluated in our future work, and thus better characterized.
Collapse
Affiliation(s)
- Lei Zhang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
- Department of Neurology, The Second Hospital of Jilin University, 218 Ziqiang Street, Changchun, Jilin, China, 130041
| | - Linlin Wang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Pu Tian
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin, China, 130021
| |
Collapse
|
12
|
Wheeler NE, Barquist L, Kingsley RA, Gardner PP. A profile-based method for identifying functional divergence of orthologous genes in bacterial genomes. Bioinformatics 2016; 32:3566-3574. [PMID: 27503221 PMCID: PMC5181535 DOI: 10.1093/bioinformatics/btw518] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Revised: 07/17/2016] [Accepted: 08/02/2016] [Indexed: 02/04/2023] Open
Abstract
Motivation: Next generation sequencing technologies have provided us with a wealth of information on genetic variation, but predicting the functional significance of this variation is a difficult task. While many comparative genomics studies have focused on gene flux and large scale changes, relatively little attention has been paid to quantifying the effects of single nucleotide polymorphisms and indels on protein function, particularly in bacterial genomics. Results: We present a hidden Markov model based approach we call delta-bitscore (DBS) for identifying orthologous proteins that have diverged at the amino acid sequence level in a way that is likely to impact biological function. We benchmark this approach with several widely used datasets and apply it to a proof-of-concept study of orthologous proteomes in an investigation of host adaptation in Salmonella enterica. We highlight the value of the method in identifying functional divergence of genes, and suggest that this tool may be a better approach than the commonly used dN/dS metric for identifying functionally significant genetic changes occurring in recently diverged organisms. Availability and Implementation: A program implementing DBS for pairwise genome comparisons is freely available at: https://github.com/UCanCompBio/deltaBS. Contact:nicole.wheeler@pg.canterbury.ac.nz or lars.barquist@uni-wuerzburg.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicole E Wheeler
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
| | - Lars Barquist
- Institute for Molecular Infection Biology, University of Wuerzburg, Wuerzburg, Germany
| | - Robert A Kingsley
- Institute of Food Research, Norwich Research Park, Norwich, UK.,Wellcome Trust Sanger Institute, Hinxton, UK
| | - Paul P Gardner
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand.,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.,Bio-protection Research Centre, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
13
|
Dalton LA, Yousefi MR. Data Requirements for Model-Based Cancer Prognosis Prediction. Cancer Inform 2016; 14:123-38. [PMID: 27127404 PMCID: PMC4844301 DOI: 10.4137/cin.s30801] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 02/02/2016] [Accepted: 02/07/2016] [Indexed: 11/20/2022] Open
Abstract
Cancer prognosis prediction is typically carried out without integrating scientific knowledge available on genomic pathways, the effect of drugs on cell dynamics, or modeling mutations in the population. Recent work addresses some of these problems by formulating an uncertainty class of Boolean regulatory models for abnormal gene regulation, assigning prognosis scores to each network based on intervention outcomes, and partitioning networks in the uncertainty class into prognosis classes based on these scores. For a new patient, the probability distribution of the prognosis class was evaluated using optimal Bayesian classification, given patient data. It was assumed that (1) disease is the result of several mutations of a known healthy network and that these mutations and their probability distribution in the population are known and (2) only a single snapshot of the patient's gene activity profile is observed. It was shown that, even in ideal settings where cancer in the population and the effect of a drug are fully modeled, a single static measurement is typically not sufficient. Here, we study what measurements are sufficient to predict prognosis. In particular, we relax assumption (1) by addressing how population data may be used to estimate network probabilities, and extend assumption (2) to include static and time-series measurements of both population and patient data. Furthermore, we extend the prediction of prognosis classes to optimal Bayesian regression of prognosis metrics. Even when time-series data is preferable to infer a stochastic dynamical network, we show that static data can be superior for prognosis prediction when constrained to small samples. Furthermore, although population data is helpful, performance is not sensitive to inaccuracies in the estimated network probabilities.
Collapse
Affiliation(s)
- Lori A. Dalton
- Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | | |
Collapse
|
14
|
Hu J, Li Y, Yang JY, Shen HB, Yu DJ. GPCR–drug interactions prediction using random forest with drug-association-matrix-based post-processing procedure. Comput Biol Chem 2016; 60:59-71. [DOI: 10.1016/j.compbiolchem.2015.11.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Revised: 08/04/2015] [Accepted: 11/10/2015] [Indexed: 12/21/2022]
|
15
|
Lawrence TJ, Kauffman KT, Amrine KCH, Carper DL, Lee RS, Becich PJ, Canales CJ, Ardell DH. FAST: FAST Analysis of Sequences Toolbox. Front Genet 2015; 6:172. [PMID: 26042145 PMCID: PMC4437040 DOI: 10.3389/fgene.2015.00172] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2015] [Accepted: 04/20/2015] [Indexed: 11/13/2022] Open
Abstract
FAST (FAST Analysis of Sequences Toolbox) provides simple, powerful open source command-line tools to filter, transform, annotate and analyze biological sequence data. Modeled after the GNU (GNU's Not Unix) Textutils such as grep, cut, and tr, FAST tools such as fasgrep, fascut, and fastr make it easy to rapidly prototype expressive bioinformatic workflows in a compact and generic command vocabulary. Compact combinatorial encoding of data workflows with FAST commands can simplify the documentation and reproducibility of bioinformatic protocols, supporting better transparency in biological data science. Interface self-consistency and conformity with conventions of GNU, Matlab, Perl, BioPerl, R, and GenBank help make FAST easy and rewarding to learn. FAST automates numerical, taxonomic, and text-based sorting, selection and transformation of sequence records and alignment sites based on content, index ranges, descriptive tags, annotated features, and in-line calculated analytics, including composition and codon usage. Automated content- and feature-based extraction of sites and support for molecular population genetic statistics make FAST useful for molecular evolutionary analysis. FAST is portable, easy to install and secure thanks to the relative maturity of its Perl and BioPerl foundations, with stable releases posted to CPAN. Development as well as a publicly accessible Cookbook and Wiki are available on the FAST GitHub repository at https://github.com/tlawrence3/FAST. The default data exchange format in FAST is Multi-FastA (specifically, a restriction of BioPerl FastA format). Sanger and Illumina 1.8+ FastQ formatted files are also supported. FAST makes it easier for non-programmer biologists to interactively investigate and control biological data at the speed of thought.
Collapse
Affiliation(s)
- Travis J Lawrence
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA
| | - Kyle T Kauffman
- Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| | - Katherine C H Amrine
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA ; Department of Viticulture and Enology, University of California, Davis Davis, CA, USA
| | - Dana L Carper
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA
| | - Raymond S Lee
- School of Engineering, University of California, Merced Merced, CA, USA
| | - Peter J Becich
- Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| | - Claudia J Canales
- School of Engineering, University of California, Merced Merced, CA, USA
| | - David H Ardell
- Quantitative and Systems Biology Program, University of California, Merced Merced, CA, USA ; Molecular Cell Biology Unit, School of Natural Sciences, University of California, Merced Merced, CA, USA
| |
Collapse
|
16
|
Yu DJ, Li Y, Hu J, Yang X, Yang JY, Shen HB. Disulfide Connectivity Prediction Based on Modelled Protein 3D Structural Information and Random Forest Regression. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:611-621. [PMID: 26357272 DOI: 10.1109/tcbb.2014.2359451] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Disulfide connectivity is an important protein structural characteristic. Accurately predicting disulfide connectivity solely from protein sequence helps to improve the intrinsic understanding of protein structure and function, especially in the post-genome era where large volume of sequenced proteins without being functional annotated is quickly accumulated. In this study, a new feature extracted from the predicted protein 3D structural information is proposed and integrated with traditional features to form discriminative features. Based on the extracted features, a random forest regression model is performed to predict protein disulfide connectivity. We compare the proposed method with popular existing predictors by performing both cross-validation and independent validation tests on benchmark datasets. The experimental results demonstrate the superiority of the proposed method over existing predictors. We believe the superiority of the proposed method benefits from both the good discriminative capability of the newly developed features and the powerful modelling capability of the random forest. The web server implementation, called TargetDisulfide, and the benchmark datasets are freely available at: http://csbio.njust.edu.cn/bioinf/TargetDisulfide for academic use.
Collapse
|
17
|
Hu J, He X, Yu DJ, Yang XB, Yang JY, Shen HB. A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PLoS One 2014; 9:e107676. [PMID: 25229688 PMCID: PMC4168127 DOI: 10.1371/journal.pone.0107676] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2014] [Accepted: 08/09/2014] [Indexed: 12/21/2022] Open
Abstract
Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.
Collapse
Affiliation(s)
- Jun Hu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Xue He
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
- Changshu Institute, Nanjing University of Science and Technology, Changshu, Jiangsu, China
- * E-mail: (DJY); (HBS)
| | - Xi-Bei Yang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
- School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, China
| | - Jing-Yu Yang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
- * E-mail: (DJY); (HBS)
| |
Collapse
|
18
|
Yu DJ, Hu J, Yan H, Yang XB, Yang JY, Shen HB. Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble. BMC Bioinformatics 2014; 15:297. [PMID: 25189131 PMCID: PMC4261549 DOI: 10.1186/1471-2105-15-297] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2014] [Accepted: 08/18/2014] [Indexed: 11/10/2022] Open
Abstract
Background Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated. Results In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction. Conclusions The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-297) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing 210094, China.
| | | | | | | | | | | |
Collapse
|
19
|
Darewicz M, Borawska J, Vegarud GE, Minkiewicz P, Iwaniak A. Angiotensin I-converting enzyme (ACE) inhibitory activity and ACE inhibitory peptides of salmon (Salmo salar) protein hydrolysates obtained by human and porcine gastrointestinal enzymes. Int J Mol Sci 2014; 15:14077-101. [PMID: 25123137 PMCID: PMC4159840 DOI: 10.3390/ijms150814077] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Revised: 06/24/2014] [Accepted: 07/16/2014] [Indexed: 01/21/2023] Open
Abstract
The objectives of the present study were two-fold: first, to detect whether salmon protein fractions possess angiotensin I-converting enzyme (ACE) inhibitory properties and whether salmon proteins can release ACE inhibitory peptides during a sequential in vitro hydrolysis (with commercial porcine enzymes) and ex vivo digestion (with human gastrointestinal enzymes). Secondly, to evaluate the ACE inhibitory activity of generated hydrolysates. A two-step ex vivo and in vitro model digestion was performed to simulate the human digestion process. Salmon proteins were degraded more efficiently by porcine enzymes than by human gastrointestinal juices and sarcoplasmic proteins were digested/hydrolyzed more easily than myofibrillar proteins. The ex vivo digested myofibrillar and sarcoplasmic duodenal samples showed IC50 values (concentration required to decrease the ACE activity by 50%) of 1.06 and 2.16 mg/mL, respectively. The in vitro hydrolyzed myofibrillar and sarcoplasmic samples showed IC50 values of 0.91 and 1.04 mg/mL, respectively. Based on the results of in silico studies, it was possible to identify 9 peptides of the ex vivo hydrolysates and 7 peptides of the in vitro hydrolysates of salmon proteins of 11 selected peptides. In both types of salmon hydrolysates, ACE-inhibitory peptides IW, IY, TVY and VW were identified. In the in vitro salmon protein hydrolysates an ACE-inhibitory peptides VPW and VY were also detected, while ACE-inhibitory peptides ALPHA, IVY and IWHHT were identified in the hydrolysates generated with ex vivo digestion. In our studies, we documented ACE inhibitory in vitro effects of salmon protein hydrolysates obtained by human and as well as porcine gastrointestinal enzymes.
Collapse
Affiliation(s)
- Małgorzata Darewicz
- Department of Food Biochemistry, Faculty of Food Science, University of Warmia and Mazury in Olsztyn, Olsztyn 10-726, Poland.
| | - Justyna Borawska
- Department of Food Biochemistry, Faculty of Food Science, University of Warmia and Mazury in Olsztyn, Olsztyn 10-726, Poland.
| | - Gerd E Vegarud
- Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås NO-1432, Norway.
| | - Piotr Minkiewicz
- Department of Food Biochemistry, Faculty of Food Science, University of Warmia and Mazury in Olsztyn, Olsztyn 10-726, Poland.
| | - Anna Iwaniak
- Department of Food Biochemistry, Faculty of Food Science, University of Warmia and Mazury in Olsztyn, Olsztyn 10-726, Poland.
| |
Collapse
|
20
|
Dougherty ER. On the impoverishment of scientific education. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2013; 2013:15. [PMID: 24215841 PMCID: PMC3826847 DOI: 10.1186/1687-4153-2013-15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 08/18/2013] [Accepted: 10/16/2013] [Indexed: 12/02/2022]
Abstract
Hannah Arendt, one of the foremost political philosophers of the twentieth century, has argued that it is the responsibility of educators not to leave children in their own world but instead to bring them into the adult world so that, as adults, they can carry civilization forward to whatever challenges it will face by bringing to bear the learning of the past. In the same collection of essays, she discusses the recognition by modern science that Nature is inconceivable in terms of ordinary human conceptual categories - as she writes, ‘unthinkable in terms of pure reason’. Together, these views on scientific education lead to an educational process that transforms children into adults, with a scientific adult being one who has the ability to conceptualize scientific systems independent of ordinary physical intuition. This article begins with Arendt’s basic educational and scientific points and develops from them a critique of current scientific education in conjunction with an appeal to educate young scientists in a manner that allows them to fulfill their potential ‘on the shoulders of giants’. While the article takes a general philosophical perspective, its specifics tend to be directed at biomedical education, in particular, how such education pertains to translational science.
Collapse
Affiliation(s)
- Edward R Dougherty
- Center for Bioinformatics and Genomic Systems Engineering, Department of Electrical and Computer Engineering, Texas A&M University, 3128 TAMU, College Station, TX 77843-3128, USA.
| |
Collapse
|
21
|
Minkiewicz P, Miciński J, Darewicz M, Bucholska J. Biological and Chemical Databases for Research into the Composition of Animal Source Foods. FOOD REVIEWS INTERNATIONAL 2013. [DOI: 10.1080/87559129.2013.818011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
22
|
Boulesteix AL. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics 2013; 29:2664-6. [PMID: 23929033 DOI: 10.1093/bioinformatics/btt458] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, 81377 Munich, Germany
| |
Collapse
|
23
|
Boulesteix AL, Lauer S, Eugster MJA. A plea for neutral comparison studies in computational sciences. PLoS One 2013; 8:e61562. [PMID: 23637855 PMCID: PMC3634809 DOI: 10.1371/journal.pone.0061562] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2012] [Accepted: 03/11/2013] [Indexed: 12/04/2022] Open
Abstract
In computational science literature including, e.g., bioinformatics, computational statistics or machine learning, most published articles are devoted to the development of "new methods", while comparison studies are generally appreciated by readers but surprisingly given poor consideration by many journals. This paper stresses the importance of neutral comparison studies for the objective evaluation of existing methods and the establishment of standards by drawing parallels with clinical research. The goal of the paper is twofold. Firstly, we present a survey of recent computational papers on supervised classification published in seven high-ranking computational science journals. The aim is to provide an up-to-date picture of current scientific practice with respect to the comparison of methods in both articles presenting new methods and articles focusing on the comparison study itself. Secondly, based on the results of our survey we critically discuss the necessity, impact and limitations of neutral comparison studies in computational sciences. We define three reasonable criteria a comparison study has to fulfill in order to be considered as neutral, and explicate general considerations on the individual components of a "tidy neutral comparison study". R codes for completely replicating our statistical analyses and figures are available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/plea2013.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-University of Munich, Munich, Germany.
| | | | | |
Collapse
|
24
|
Quo CF, Kaddi C, Phan JH, Zollanvari A, Xu M, Wang MD, Alterovitz G. Reverse engineering biomolecular systems using -omic data: challenges, progress and opportunities. Brief Bioinform 2012; 13:430-45. [PMID: 22833495 DOI: 10.1093/bib/bbs026] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). 'Data-driven' approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while 'design-driven' approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to -omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top-down and bottom-up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.
Collapse
Affiliation(s)
- Chang F Quo
- Georgia Institute of Technology, Atlanta, GA 30332, USA
| | | | | | | | | | | | | |
Collapse
|
25
|
Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y, Yang J. Enhancing Membrane Protein Subcellular Localization Prediction by Parallel Fusion of Multi-View Features. IEEE Trans Nanobioscience 2012; 11:375-85. [PMID: 22875262 DOI: 10.1109/tnb.2012.2208473] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Dongjun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | | | | | | | | | | | | |
Collapse
|
26
|
Abstract
MOTIVATION A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study? RESULTS This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed. AVAILABILITY We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/.
Collapse
Affiliation(s)
- Mohammadmahdi R Yousefi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | | |
Collapse
|
27
|
Dougherty ER, Zollanvari A, Braga-Neto UM. The illusion of distribution-free small-sample classification in genomics. Curr Genomics 2012; 12:333-41. [PMID: 22294876 PMCID: PMC3145263 DOI: 10.2174/138920211796429763] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2011] [Revised: 05/29/2011] [Accepted: 06/07/2011] [Indexed: 01/01/2023] Open
Abstract
Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
Collapse
|
28
|
|
29
|
Abstract
The full genomes of several closely related species are now available, opening an emerging field of investigation borrowing both from population genetics and phylogenetics. Providing we can properly model sequence evolution within populations undergoing speciation events, this resource enables us to estimate key population genetics parameters, such as ancestral population sizes and split times. Furthermore, we can enhance our understanding of the recombination process and investigate various selective forces. We discuss the basic speciation models for closely related species, including the isolation and isolation-with-migration models. A major point in our discussion is that only a few complete genomes contain much information about the whole population. The reason being that recombination unlinks genomic regions, and therefore a few genomes contain many segments with distinct histories. The challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of demography and selection. We survey different approaches for understanding ancestral species from analyses of genomic data from closely related species. In particular, we emphasize core assumptions and working hypothesis. Finally, we discuss computational and statistical challenges that arise in the analysis of population genomics data sets.
Collapse
Affiliation(s)
- Julien Y Dutheil
- Institut des Sciences de l'Évolution Montpellier (ISE-M), UMR 5554, CNRS, Unversité Montpellier, Montpellier, France.
| | | |
Collapse
|
30
|
Abstract
A review of 2010 research in translational bioinformatics provides much to marvel at. We have seen notable advances in personal genomics, pharmacogenetics, and sequencing. At the same time, the infrastructure for the field has burgeoned. While acknowledging that, according to researchers, the members of this field tend to be overly optimistic, the authors predict a bright future.
Collapse
Affiliation(s)
- Russ B Altman
- Department of Bioengineering, Stanford University School of Medicine, Stanford, California 94305-5444, USA.
| | | |
Collapse
|
31
|
Talley NJ, Fodor AA. Bugs, stool, and the irritable bowel syndrome: too much is as bad as too little? Gastroenterology 2011; 141:1555-9. [PMID: 21945058 DOI: 10.1053/j.gastro.2011.09.019] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
32
|
Sima C, Braga-Neto UM, Dougherty ER. High-dimensional bolstered error estimation. Bioinformatics 2011; 27:3056-64. [PMID: 21914630 DOI: 10.1093/bioinformatics/btr518] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces. RESULTS This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known. AVAILABILITY Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering. CONTACT edward@mail.ece.tamu.edu.
Collapse
Affiliation(s)
- Chao Sima
- Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | | | | |
Collapse
|
33
|
Capriotti E, Altman RB. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics 2011; 98:310-7. [PMID: 21763417 DOI: 10.1016/j.ygeno.2011.06.010] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 06/26/2011] [Accepted: 06/28/2011] [Indexed: 12/20/2022]
Abstract
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.
Collapse
Affiliation(s)
- Emidio Capriotti
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
34
|
Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C. Metagenomic biomarker discovery and explanation. Genome Biol 2011; 12:R60. [PMID: 21702898 PMCID: PMC3218848 DOI: 10.1186/gb-2011-12-6-r60] [Citation(s) in RCA: 8946] [Impact Index Per Article: 688.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2011] [Revised: 05/31/2011] [Accepted: 06/24/2011] [Indexed: 12/11/2022] Open
Abstract
This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.
Collapse
Affiliation(s)
- Nicola Segata
- Department of Biostatistics, 677 Huntington Avenue, Harvard School of Public Health, Boston, MA 02115, USA
| | | | | | | | | | | | | |
Collapse
|
35
|
Yousefi MR, Hua J, Dougherty ER. Multiple-rule bias in the comparison of classification rules. ACTA ACUST UNITED AC 2011; 27:1675-83. [PMID: 21546390 DOI: 10.1093/bioinformatics/btr262] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
MOTIVATION There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule. RESULTS This article provides a careful probabilistic analysis of the second issue and the 'multiple-rule bias', resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators. AVAILABILITY We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included.
Collapse
Affiliation(s)
- Mohammadmahdi R Yousefi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | | | | |
Collapse
|
36
|
Binder H, Porzelius C, Schumacher M. An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models. Biom J 2011; 53:170-89. [PMID: 21328602 DOI: 10.1002/bimj.201000152] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2010] [Revised: 12/22/2010] [Accepted: 12/23/2010] [Indexed: 11/07/2022]
Abstract
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high-dimensional molecular covariate data to a clinical endpoint. In low-dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high-dimensional settings, with a focus on models for time-to-event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B-cell lymphoma, some typical modeling issues from low-dimensional settings are illustrated in a high-dimensional application. First, the performance of classical stepwise regression is compared to stage-wise regression, as implemented by a component-wise likelihood-based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high-dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high-dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation.
Collapse
Affiliation(s)
- Harald Binder
- Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Str. 26, 79104 Freiburg, Germany.
| | | | | |
Collapse
|
37
|
Dougherty ER. Validation of gene regulatory networks: scientific and inferential. Brief Bioinform 2010; 12:245-52. [PMID: 21183477 DOI: 10.1093/bib/bbq078] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Gene regulatory network models are a major area of study in systems and computational biology and the construction of network models is among the most important problems in these disciplines. The critical epistemological issue concerns validation. Validity can be approached from two different perspectives (i) given a hypothesized network model, its scientific validity relates to the ability to make predictions from the model that can be checked against experimental observations; and (ii) the validity of a network inference procedure must be evaluated relative to its ability to infer a network from sample points generated by the network. This article examines both perspectives in the framework of a distance function between two networks. It considers some of the obstacles to validation and provides examples of both validation paradigms.
Collapse
Affiliation(s)
- Edward R Dougherty
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, USA.
| |
Collapse
|
38
|
FACT: functional annotation transfer between proteins with similar feature architectures. BMC Bioinformatics 2010; 11:417. [PMID: 20696036 PMCID: PMC2931517 DOI: 10.1186/1471-2105-11-417] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/09/2010] [Indexed: 11/24/2022] Open
Abstract
Background The increasing number of sequenced genomes provides the basis for exploring the genetic and functional diversity within the tree of life. Only a tiny fraction of the encoded proteins undergoes a thorough experimental characterization. For the remainder, bioinformatics annotation tools are the only means to infer their function. Exploiting significant sequence similarities to already characterized proteins, commonly taken as evidence for homology, is the prevalent method to deduce functional equivalence. Such methods fail when homologs are too diverged, or when they have assumed a different function. Finally, due to convergent evolution, functional equivalence is not necessarily linked to common ancestry. Therefore complementary approaches are required to identify functional equivalents. Results We present the Feature Architecture Comparison Tool http://www.cibiv.at/FACT to search for functionally equivalent proteins. FACT uses the similarity between feature architectures of two proteins, i.e., the arrangements of functional domains, secondary structure elements and compositional properties, as a proxy for their functional equivalence. A scoring function measures feature architecture similarities, which enables searching for functional equivalents in entire proteomes. Our evaluation of 9,570 EC classified enzymes revealed that FACT, using the full feature, set outperformed the existing architecture-based approaches by identifying significantly more functional equivalents as highest scoring proteins. We show that FACT can identify functional equivalents that share no significant sequence similarity. However, when the highest scoring protein of FACT is also the protein with the highest local sequence similarity, it is in 99% of the cases functionally equivalent to the query. We demonstrate the versatility of FACT by identifying a missing link in the yeast glutathione metabolism and also by searching for the human GolgA5 equivalent in Trypanosoma brucei. Conclusions FACT facilitates a quick and sensitive search for functionally equivalent proteins in entire proteomes. FACT is complementary to approaches using sequence similarity to identify proteins with the same function. Thus, FACT is particularly useful when functional equivalents need to be identified in evolutionarily distant species, or when functional equivalents are not homologous. The most reliable annotation transfers, however, are achieved when feature architecture similarity and sequence similarity are jointly taken into account.
Collapse
|
39
|
Kohlmann M, Held L, Grunert VP. Authors' reply. Biom J 2010. [DOI: 10.1002/bimj.201000119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
40
|
Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL. Over-optimism in bioinformatics: an illustration. Bioinformatics 2010; 26:1990-8. [PMID: 20581402 DOI: 10.1093/bioinformatics/btq323] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'promising' new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we 'fish for significance'. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method's characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.
Collapse
Affiliation(s)
- Monika Jelizarow
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Munich, Germany
| | | | | | | | | |
Collapse
|
41
|
Boulesteix AL, Hothorn T. Testing the additional predictive value of high-dimensional molecular data. BMC Bioinformatics 2010; 11:78. [PMID: 20144191 PMCID: PMC2837029 DOI: 10.1186/1471-2105-11-78] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 02/08/2010] [Indexed: 11/17/2022] Open
Abstract
Background While high-dimensional molecular data such as microarray gene expression data have been used for disease outcome prediction or diagnosis purposes for about ten years in biomedical research, the question of the additional predictive value of such data given that classical predictors are already available has long been under-considered in the bioinformatics literature. Results We suggest an intuitive permutation-based testing procedure for assessing the additional predictive value of high-dimensional molecular data. Our method combines two well-known statistical tools: logistic regression and boosting regression. We give clear advice for the choice of the only method parameter (the number of boosting iterations). In simulations, our novel approach is found to have very good power in different settings, e.g. few strong predictors or many weak predictors. For illustrative purpose, it is applied to the two publicly available cancer data sets. Conclusions Our simple and computationally efficient approach can be used to globally assess the additional predictive power of a large number of candidate predictors given that a few clinical covariates or a known prognostic index are already available. It is implemented in the R package "globalboosttest" which is publicly available from R-forge and will be sent to the CRAN as soon as possible.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr 15, D-81377 Munich, Germany.
| | | |
Collapse
|
42
|
Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol 2009; 9:85. [PMID: 20025773 PMCID: PMC2813849 DOI: 10.1186/1471-2288-9-85] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2008] [Accepted: 12/21/2009] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. METHODS In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. RESULTS We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. CONCLUSIONS The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Statistics, University of Munich, Ludwigstr 33, D-80539 Munich, Germany
- Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindenerstr 1, D-81677 Munich, Germany
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr 15, D-81377 Munich, Germany
| | - Carolin Strobl
- Department of Statistics, University of Munich, Ludwigstr 33, D-80539 Munich, Germany
| |
Collapse
|