1
|
Unleashing high content screening in hit detection – benchmarking AI workflows including novelty detection. Comput Struct Biotechnol J 2022; 20:5453-5465. [PMID: 36212538 PMCID: PMC9530837 DOI: 10.1016/j.csbj.2022.09.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 09/16/2022] [Accepted: 09/16/2022] [Indexed: 11/22/2022] Open
Abstract
Complex mixtures containing natural products are still an interesting source of novel drug candidates. High content screening (HCS) is a popular tool to screen for such. In particular, multiplexed HCS assays promise comprehensive bioactivity profiles, but generate also high amounts of data. Yet, only some machine learning (ML) applications for data analysis are available and these usually require a profound knowledge of the underlying cell biology. Unfortunately, there are no applications that simply predict if samples are biologically active or not (any kind of bioactivity). Within this work, we benchmark ML algorithms for binary classification, starting with classical ML models, which are the standard classifiers of the scikit-learn library or ensemble models of these classifiers (a total of 92 models tested). Followed by a partial least square regression (PLSR)-based classification (44 tested models in total) and simple artificial neural networks (ANNs) with dense layers (72 tested models in total). In addition, a novelty detection (ND) was examined, which is supposed to handle unknown patterns. For the final analysis the models, with and without upstream ND, were tested with two independent data sets. In our analysis, a stacking model, an ensamble model of class ML algorithms, performed best to predict new and unknown data. ND improved the predictions of the models and was useful to handle unknown patterns. Importantly, the classifier presented here can be easily rebuilt and be adapted to the data and demands of other groups. The hit detector (ND + stacking model) is universal and suitable for a broader application to support the search for new drug candidates.
Collapse
|
2
|
Chandrasekaran SN, Ceulemans H, Boyd JD, Carpenter AE. Image-based profiling for drug discovery: due for a machine-learning upgrade? Nat Rev Drug Discov 2021; 20:145-159. [PMID: 33353986 PMCID: PMC7754181 DOI: 10.1038/s41573-020-00117-w] [Citation(s) in RCA: 133] [Impact Index Per Article: 44.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/13/2020] [Indexed: 12/20/2022]
Abstract
Image-based profiling is a maturing strategy by which the rich information present in biological images is reduced to a multidimensional profile, a collection of extracted image-based features. These profiles can be mined for relevant patterns, revealing unexpected biological activity that is useful for many steps in the drug discovery process. Such applications include identifying disease-associated screenable phenotypes, understanding disease mechanisms and predicting a drug's activity, toxicity or mechanism of action. Several of these applications have been recently validated and have moved into production mode within academia and the pharmaceutical industry. Some of these have yielded disappointing results in practice but are now of renewed interest due to improved machine-learning strategies that better leverage image-based information. Although challenges remain, novel computational technologies such as deep learning and single-cell methods that better capture the biological information in images hold promise for accelerating drug discovery.
Collapse
Affiliation(s)
| | - Hugo Ceulemans
- Discovery Data Sciences, Janssen Pharmaceutica NV, Beerse, Belgium
| | - Justin D Boyd
- High Content Imaging Technology Center, Internal Medicine Research Unit, Pfizer Inc., Cambridge, MA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
3
|
Boyd J, Fennell M, Carpenter A. Harnessing the power of microscopy images to accelerate drug discovery: what are the possibilities? Expert Opin Drug Discov 2020; 15:639-642. [PMID: 32200648 DOI: 10.1080/17460441.2020.1743675] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Justin Boyd
- Internal Medicines Research Unit, Pfizer Inc ., Cambridge, MA, USA
| | - Myles Fennell
- Neuroscience and Platform Biology, Arvinas , New Haven, CT, USA
| | - Anne Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard , Cambridge, MA, USA
| |
Collapse
|
4
|
Warchal SJ, Dawson JC, Carragher NO. High-Dimensional Profiling: The Theta Comparative Cell Scoring Method. Methods Mol Biol 2018; 1787:171-181. [PMID: 29736718 DOI: 10.1007/978-1-4939-7847-2_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Principal component analysis enables dimensional reduction of multivariate datasets that are typical in high-content screening. A common analysis utilizing principal components is a distance measurement between a perturbagen-such as small-molecule treatment or shRNA knockdown-and a negative control. This method works well to identify active perturbagens, though it cannot discern between distinct phenotypic responses. Here, we describe an extension of the principal component analysis approach to multivariate high-content screening data to enable quantification of differences in direction in principal component space. The theta comparative cell scoring method can identify and quantify differential phenotypic responses between panels of cell lines to small-molecule treatment to support in vitro pharmacogenomics and drug mechanism-of-action studies.
Collapse
Affiliation(s)
- Scott J Warchal
- Cancer Research UK Edinburgh Centre, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - John C Dawson
- Cancer Research UK Edinburgh Centre, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - Neil O Carragher
- Cancer Research UK Edinburgh Centre, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
5
|
Marklein RA, Lam J, Guvendiren M, Sung KE, Bauer SR. Functionally-Relevant Morphological Profiling: A Tool to Assess Cellular Heterogeneity. Trends Biotechnol 2018; 36:105-118. [DOI: 10.1016/j.tibtech.2017.10.007] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Revised: 10/11/2017] [Accepted: 10/18/2017] [Indexed: 12/16/2022]
|
6
|
Warchal SJ, Dawson JC, Carragher NO. Development of the Theta Comparative Cell Scoring Method to Quantify Diverse Phenotypic Responses Between Distinct Cell Types. Assay Drug Dev Technol 2017; 14:395-406. [PMID: 27552144 PMCID: PMC5015429 DOI: 10.1089/adt.2016.730] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this article, we have developed novel data visualization tools and a Theta comparative cell scoring (TCCS) method, which supports high-throughput in vitro pharmacogenomic studies across diverse cellular phenotypes measured by multiparametric high-content analysis. The TCCS method provides a univariate descriptor of divergent compound-induced phenotypic responses between distinct cell types, which can be used for correlation with genetic, epigenetic, and proteomic datasets to support the identification of biomarkers and further elucidate drug mechanism-of-action. Application of these methods to compound profiling across high-content assays incorporating well-characterized cells representing known molecular subtypes of disease supports the development of personalized healthcare strategies without prior knowledge of a drug target. We present proof-of-principle data quantifying distinct phenotypic response between eight breast cancer cells representing four disease subclasses. Application of the TCCS method together with new advances in next-generation sequencing, induced pluripotent stem cell technology, gene editing, and high-content phenotypic screening are well placed to advance the identification of predictive biomarkers and personalized medicine approaches across a broader range of disease types and therapeutic classes.
Collapse
Affiliation(s)
- Scott J Warchal
- Institute of Genetics and Molecular Medicine, Cancer Research UK Edinburgh Centre, University of Edinburgh , Edinburgh, United Kingdom
| | - John C Dawson
- Institute of Genetics and Molecular Medicine, Cancer Research UK Edinburgh Centre, University of Edinburgh , Edinburgh, United Kingdom
| | - Neil O Carragher
- Institute of Genetics and Molecular Medicine, Cancer Research UK Edinburgh Centre, University of Edinburgh , Edinburgh, United Kingdom
| |
Collapse
|
7
|
Moutsatsos IK, Hossain I, Agarinis C, Harbinski F, Abraham Y, Dobler L, Zhang X, Wilson CJ, Jenkins JL, Holway N, Tallarico J, Parker CN. Jenkins-CI, an Open-Source Continuous Integration System, as a Scientific Data and Image-Processing Platform. SLAS DISCOVERY 2016; 22:238-249. [PMID: 27899692 PMCID: PMC5322829 DOI: 10.1177/1087057116679993] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
High-throughput screening generates large volumes of heterogeneous data that require a diverse set of computational tools for management, processing, and analysis. Building integrated, scalable, and robust computational workflows for such applications is challenging but highly valuable. Scientific data integration and pipelining facilitate standardized data processing, collaboration, and reuse of best practices. We describe how Jenkins-CI, an “off-the-shelf,” open-source, continuous integration system, is used to build pipelines for processing images and associated data from high-content screening (HCS). Jenkins-CI provides numerous plugins for standard compute tasks, and its design allows the quick integration of external scientific applications. Using Jenkins-CI, we integrated CellProfiler, an open-source image-processing platform, with various HCS utilities and a high-performance Linux cluster. The platform is web-accessible, facilitates access and sharing of high-performance compute resources, and automates previously cumbersome data and image-processing tasks. Imaging pipelines developed using the desktop CellProfiler client can be managed and shared through a centralized Jenkins-CI repository. Pipelines and managed data are annotated to facilitate collaboration and reuse. Limitations with Jenkins-CI (primarily around the user interface) were addressed through the selection of helper plugins from the Jenkins-CI community.
Collapse
Affiliation(s)
| | - Imtiaz Hossain
- 2 Centre for Proteomic Chemistry, NIBR, Postfach, Basel, Switzerland
| | - Claudia Agarinis
- 3 Developmental and Molecular Pathways, NIBR, Postfach, Basel, Switzerland
| | | | - Yann Abraham
- 5 The Janssen Pharmaceutical Companies of Johnson & Johnson, Beerse, Vlaanderen, Belgium
| | - Luc Dobler
- 6 République et Canton du Jura, Switzerland
| | - Xian Zhang
- 2 Centre for Proteomic Chemistry, NIBR, Postfach, Basel, Switzerland
| | | | - Jeremy L Jenkins
- 1 Developmental and Molecular Pathways, NIBR, Cambridge, MA, USA
| | - Nicholas Holway
- 7 Scientific Computing, NIBR Informatics, Novartis, Postfach, Basel, Switzerland
| | - John Tallarico
- 1 Developmental and Molecular Pathways, NIBR, Cambridge, MA, USA
| | - Christian N Parker
- 3 Developmental and Molecular Pathways, NIBR, Postfach, Basel, Switzerland
| |
Collapse
|
8
|
Montoya M, Dorval T, Bickle M. SLAS Europe High-Content Screening Conference in Dresden: A Glimpse of the Future? ACTA ACUST UNITED AC 2016; 21:883-6. [PMID: 27650790 DOI: 10.1177/1087057116662825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Affiliation(s)
- Maria Montoya
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Thierry Dorval
- Biotechnology Chemical-Biology, Insitut de Recherches Servier, Croissy-sur-Seine, France
| | - Marc Bickle
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| |
Collapse
|
9
|
Gubler H. High-Throughput Screening Data Analysis. NONCLINICAL STATISTICS FOR PHARMACEUTICAL AND BIOTECHNOLOGY INDUSTRIES 2016. [DOI: 10.1007/978-3-319-23558-5_5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
10
|
Smith K, Horvath P. Active Learning Strategies for Phenotypic Profiling of High-Content Screens. ACTA ACUST UNITED AC 2014; 19:685-95. [DOI: 10.1177/1087057114527313] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 02/17/2014] [Indexed: 12/20/2022]
Abstract
High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data. In this article, we investigate the impact of active learning on single-cell–based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We consider several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost and can be used to reveal the same phenotypic targets identified using SML. We also identify combinations of active learning strategies and SML methods which perform better than others on the phenotypic profiling problems we studied.
Collapse
Affiliation(s)
- Kevin Smith
- Light Microscopy and Screening Centre, ETH Zurich, Switzerland
| | - Peter Horvath
- Institute of Biochemistry, ETH Zurich, Switzerland
- Synthetic and Systems Biology Unit, Biological Research Center, Szeged, Hungary
| |
Collapse
|
11
|
Abraham Y, Zhang X, Parker CN. Multiparametric Analysis of Screening Data: Growing Beyond the Single Dimension to Infinity and Beyond. ACTA ACUST UNITED AC 2014; 19:628-39. [PMID: 24598104 DOI: 10.1177/1087057114524987] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2013] [Accepted: 01/14/2014] [Indexed: 11/16/2022]
Abstract
Advances in instrumentation now allow the development of screening assays that are capable of monitoring multiple readouts such as transcript or protein levels, or even multiple parameters derived from images. Such advances in assay technologies highlight the complex nature of biology and disease. Harnessing this complexity requires integration of all the different parameters that can be measured rather than just monitoring a single dimension as is commonly used. Although some of the methods used to combine multiple measurements, such as principal component analysis, are commonly used for microarray analysis, biologists are not yet using many of the tools that have been developed in other fields to address such issues. Visualization of multiparametric data sets is one of the major challenges in this field, and a depiction of the results in a manner that can be readily interpreted is essential. This article describes a number of assay systems being used to generate such data sets en masse, and the methods being applied to their visualization and analysis. We also discuss some of the challenges of applying methods developed in other fields to biology.
Collapse
Affiliation(s)
- Yann Abraham
- Novartis Institute for Biomedical Research, Basel, Switzerland
| | - Xian Zhang
- Novartis Institute for Biomedical Research, Basel, Switzerland
| | | |
Collapse
|
12
|
Azegrouz H, Karemore G, Torres A, Alaíz CM, Gonzalez AM, Nevado P, Salmerón A, Pellinen T, del Pozo MA, Dorronsoro JR, Montoya MC. Cell-based fuzzy metrics enhance high-content screening (HCS) assay robustness. ACTA ACUST UNITED AC 2013; 18:1270-83. [PMID: 24045580 DOI: 10.1177/1087057113501554] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
High-content screening (HCS) allows the exploration of complex cellular phenotypes by automated microscopy and is increasingly being adopted for small interfering RNA genomic screening and phenotypic drug discovery. We introduce a series of cell-based evaluation metrics that have been implemented and validated in a mono-parametric HCS for regulators of the membrane trafficking protein caveolin 1 (CAV1) and have also proved useful for the development of a multiparametric phenotypic HCS for regulators of cytoskeletal reorganization. Imaging metrics evaluate imaging quality such as staining and focus, whereas cell biology metrics are fuzzy logic-based evaluators describing complex biological parameters such as sparseness, confluency, and spreading. The evaluation metrics were implemented in a data-mining pipeline, which first filters out cells that do not pass a quality criterion based on imaging metrics and then uses cell biology metrics to stratify cell samples to allow further analysis of homogeneous cell populations. Use of these metrics significantly improved the robustness of the monoparametric assay tested, as revealed by an increase in Z' factor, Kolmogorov-Smirnov distance, and strict standard mean difference. Cell biology evaluation metrics were also implemented in a novel supervised learning classification method that combines them with phenotypic features in a statistical model that exceeded conventional classification methods, thus improving multiparametric phenotypic assay sensitivity.
Collapse
Affiliation(s)
- Hind Azegrouz
- 1Cellomics Unit, Department of Vascular Biology and Inflammation, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, Spain
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Reisen F, Zhang X, Gabriel D, Selzer P. Benchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints in Phenotypic Drug Discovery. ACTA ACUST UNITED AC 2013; 18:1284-97. [DOI: 10.1177/1087057113501390] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
High-content screening (HCS) is a powerful tool for drug discovery being capable of measuring cellular responses to chemical disturbance in a high-throughput manner. HCS provides an image-based readout of cellular phenotypes, including features such as shape, intensity, or texture in a highly multiplexed and quantitative manner. The corresponding feature vectors can be used to characterize phenotypes and are thus defined as HCS fingerprints. Systematic analyses of HCS fingerprints allow for objective computational comparisons of cellular responses. Such comparisons therefore facilitate the detection of different compounds with different phenotypic outcomes from high-throughput HCS campaigns. Feature selection methods and similarity measures, as a basis for phenotype identification and clustering, are critical for the quality of such computational analyses. We systematically evaluated 16 different similarity measures in combination with linear and nonlinear feature selection methods for their potential to capture biologically relevant image features. Nonlinear correlation-based similarity measures such as Kendall’s τ and Spearman’s ρ perform well in most evaluation scenarios, outperforming other frequently used metrics (such as the Euclidian distance). We also present four novel modifications of the connectivity map similarity that surpass the original version, in our experiments. This study provides a basis for generic phenotypic analysis in future HCS campaigns.
Collapse
Affiliation(s)
- Felix Reisen
- Novartis Institutes for Biomedical Research, Center for Proteomic Chemistry, Basel, Switzerland
| | - Xian Zhang
- Novartis Institutes for Biomedical Research, Center for Proteomic Chemistry, Basel, Switzerland
| | - Daniela Gabriel
- Novartis Institutes for Biomedical Research, Center for Proteomic Chemistry, Basel, Switzerland
| | - Paul Selzer
- Novartis Institutes for Biomedical Research, Center for Proteomic Chemistry, Basel, Switzerland
| |
Collapse
|
14
|
Kostrominova TY, Reiner DS, Haas RH, Ingermanson R, McDonough PM. Automated methods for the analysis of skeletal muscle fiber size and metabolic type. INTERNATIONAL REVIEW OF CELL AND MOLECULAR BIOLOGY 2013; 306:275-332. [PMID: 24016528 DOI: 10.1016/b978-0-12-407694-5.00007-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
It is of interest to quantify the size, shape, and metabolic subtype of skeletal muscle fibers in many areas of biomedical research. To do so, skeletal muscle samples are sectioned transversely to the length of the muscle and labeled for extracellular or membrane proteins to delineate the fiber boundaries and additionally for biomarkers related to function or metabolism. The samples are digitally photographed and the fibers "outlined" for quantification of fiber cross-sectional area (CSA) using pointing devices interfaced to a computer, which is tedious, prone to error, and can be nonobjective. Here, we review methods for characterizing skeletal muscle fibers and describe new automated techniques, which rapidly quantify CSA and biomarkers. We discuss the applications of these methods to the characterization of mitochondrial dysfunctions, which underlie a variety of human afflictions, and we present a novel approach, utilizing images from the online Human Protein Atlas to predict relationships between fiber-specific protein expression, function, and metabolism.
Collapse
|