1
|
Suzuki Y, Ménager H, Brancotte B, Vernet R, Nerin C, Boetto C, Auvergne A, Linhard C, Torchet R, Lechat P, Troubat L, Cho MH, Bouzigon E, Aschard H, Julienne H. Trait selection strategy in multi-trait GWAS: Boosting SNP discoverability. HGG ADVANCES 2024; 5:100319. [PMID: 38872309 PMCID: PMC11260573 DOI: 10.1016/j.xhgg.2024.100319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 06/11/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024] Open
Abstract
Since the first genome-wide association studies (GWASs), thousands of variant-trait associations have been discovered. However, comprehensively mapping the genetic determinant of complex traits through univariate testing can require prohibitive sample sizes. Multi-trait GWAS can circumvent this issue and improve statistical power by leveraging the joint genetic architecture of human phenotypes. Although many methodological hurdles of multi-trait testing have been solved, the strategy to select traits has been overlooked. In this study, we conducted multi-trait GWAS on approximately 20,000 combinations of 72 traits using an omnibus test as implemented in the Joint Analysis of Summary Statistics. We assessed which genetic features of the sets of traits analyzed were associated with an increased detection of variants compared with univariate screening. Several features of the set of traits, including the heritability, the number of traits, and the genetic correlation, drive the multi-trait test gain. Using these features jointly in predictive models captures a large fraction of the power gain of the multi-trait test (Pearson's r between the observed and predicted gain equals 0.43, p < 1.6 × 10-60). Applying an alternative multi-trait approach (Multi-Trait Analysis of GWAS), we identified similar features of interest, but with an overall 70% lower number of new associations. Finally, selecting sets based on our data-driven models systematically outperformed the common strategy of selecting clinically similar traits. This work provides a unique picture of the determinant of multi-trait GWAS statistical power and outlines practical strategies for multi-trait testing.
Collapse
Affiliation(s)
- Yuka Suzuki
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France.
| | - Hervé Ménager
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, 75015 Paris, France
| | - Bryan Brancotte
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, 75015 Paris, France
| | - Raphaël Vernet
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Cyril Nerin
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France
| | - Christophe Boetto
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France
| | - Antoine Auvergne
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France
| | - Christophe Linhard
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Rachel Torchet
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, 75015 Paris, France
| | - Pierre Lechat
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, 75015 Paris, France
| | - Lucie Troubat
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France
| | - Michael H Cho
- Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, 181 Longwood Avenue, Boston, MA 02115, USA; Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Emmanuelle Bouzigon
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Hugues Aschard
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France.
| | - Hanna Julienne
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, 75015 Paris, France; Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, 75015 Paris, France.
| |
Collapse
|
2
|
Suzuki Y, Ménager H, Brancotte B, Vernet R, Nerin C, Boetto C, Auvergne A, Linhard C, Torchet R, Lechat P, Troubat L, Cho MH, Bouzigon E, Aschard H, Julienne H. Trait selection strategy in multi-trait GWAS: Boosting SNPs discoverability. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.27.564319. [PMID: 37961722 PMCID: PMC10634875 DOI: 10.1101/2023.10.27.564319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Since the first Genome-Wide Association Studies (GWAS), thousands of variant-trait associations have been discovered. However, the sample size required to detect additional variants using standard univariate association screening is increasingly prohibitive. Multi-trait GWAS offers a relevant alternative: it can improve statistical power and lead to new insights about gene function and the joint genetic architecture of human phenotypes. Although many methodological hurdles of multi-trait testing have been discussed, the strategy to select trait, among overwhelming possibilities, has been overlooked. In this study, we conducted extensive multi-trait tests using JASS (Joint Analysis of Summary Statistics) and assessed which genetic features of the analysed sets were associated with an increased detection of variants as compared to univariate screening. Our analyses identified multiple factors associated with the gain in the association detection in multi-trait tests. Together, these factors of the analysed sets are predictive of the gain of the multi-trait test (Pearson's ρ equal to 0.43 between the observed and predicted gain, P < 1.6 × 10-60). Applying an alternative multi-trait approach (MTAG, multi-trait analysis of GWAS), we found that in most scenarios but particularly those with larger numbers of traits, JASS outperformed MTAG. Finally, we benchmark several strategies to select set of traits including the prevalent strategy of selecting clinically similar traits, which systematically underperformed selecting clinically heterogenous traits or selecting sets that issued from our data-driven models. This work provides a unique picture of the determinant of multi-trait GWAS statistical power and outline practical strategies for multi-trait testing.
Collapse
Affiliation(s)
- Yuka Suzuki
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Hervé Ménager
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
| | - Bryan Brancotte
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
| | - Raphaël Vernet
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Cyril Nerin
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Christophe Boetto
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Antoine Auvergne
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Christophe Linhard
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Rachel Torchet
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
| | - Pierre Lechat
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
| | - Lucie Troubat
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Michael H. Cho
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, 181 Longwood Ave, Boston, MA, 02115, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| | - Emmanuelle Bouzigon
- Université Paris Cité, Institut National de la Santé et de la Recherche Médicale (INSERM), UMR-1124, Group of Genomic Epidemiology of Multifactorial Diseases, Paris, France
| | - Hugues Aschard
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
| | - Hanna Julienne
- Institut Pasteur, Université Paris Cité, Department of Computational Biology, Paris, 75015 France
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
| |
Collapse
|
3
|
Durai P, Lee SJ, Lee JW, Pan CH, Park K. Iterative machine learning-based chemical similarity search to identify novel chemical inhibitors. J Cheminform 2023; 15:86. [PMID: 37742003 PMCID: PMC10517535 DOI: 10.1186/s13321-023-00760-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 09/12/2023] [Indexed: 09/25/2023] Open
Abstract
Machine learning-based chemical screening has made substantial progress in recent years. However, these predictions often have low accuracy and high uncertainty when identifying new active chemical scaffolds. Hence, a high proportion of retrieved compounds are not structurally novel. In this study, we proposed a strategy to address this issue by iteratively optimizing an evolutionary chemical binding similarity (ECBS) model using experimental validation data. Various data update and model retraining schemes were tested to efficiently incorporate new experimental data into ECBS models, resulting in a fine-tuned ECBS model with improved accuracy and coverage. To demonstrate the effectiveness of our approach, we identified the novel hit molecules for the mitogen-activated protein kinase kinase 1 (MEK1). These molecules showed sub-micromolar affinity (Kd 0.1-5.3 μM) to MEKs and were distinct from previously-known MEK1 inhibitors. We also determined the binding specificity of different MEK isoforms and proposed potential docking models. Furthermore, using de novo drug design tools, we utilized one of the new MEK inhibitors to generate additional drug-like molecules with improved binding scores. This resulted in the identification of several potential MEK1 inhibitors with better binding affinity scores. Our results demonstrated the potential of this approach for identifying novel hit molecules and optimizing their binding affinities.
Collapse
Affiliation(s)
- Prasannavenkatesh Durai
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Sue Jung Lee
- Natural Product Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Jae Wook Lee
- Natural Product Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Cheol-Ho Pan
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Keunwan Park
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea.
- Department of YM-KIST Bio-Health Convergence, Yonsei University, Wonju, 26493, Republic of Korea.
| |
Collapse
|
4
|
Borges-Miranda A, Silva-Mata FJ, Talavera-Bustamante I, Jiménez-Chacón J, Álvarez-Prieto M, Pérez-Martínez CS. The role of chemosensory relationships to improve raw materials’ selection for Premium cigar manufacture. CHEMICAL PAPERS 2021. [DOI: 10.1007/s11696-021-01577-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
5
|
Cozzolino D. From consumers' science to food functionality-Challenges and opportunities for vibrational spectroscopy. ADVANCES IN FOOD AND NUTRITION RESEARCH 2021; 97:119-146. [PMID: 34311898 DOI: 10.1016/bs.afnr.2021.03.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Current available methods used to measure or estimate the composition, functionality, and sensory properties of foods and food ingredients are destructive and time consuming. Therefore, new approaches are required by both the food industry and R&D organizations. Recent years have witnessed a steady growth on the applications and utilization of vibrational spectroscopy techniques [near (NIR), mid infrared (MIR), Raman] to analyse or estimate several properties in a wide range of foods and food ingredients. This chapter will provide with an overview of vibrational spectroscopy techniques, the combination of these techniques with multivariate data analysis, and examples on the use of these techniques to measure composition, and functional properties in a wide range of foods.
Collapse
Affiliation(s)
- Daniel Cozzolino
- Centre for Nutrition and Food Sciences, Queensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, Brisbane, QLD, Australia.
| |
Collapse
|
6
|
Wang T, Liu M, Huang S, Yuan H, Zhao J, Chen J. Surface-enhanced Raman spectroscopy method for classification of doxycycline hydrochloride and tylosin in duck meat using gold nanoparticles. Poult Sci 2021; 100:101165. [PMID: 33975036 PMCID: PMC8131734 DOI: 10.1016/j.psj.2021.101165] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 03/19/2021] [Accepted: 03/22/2021] [Indexed: 01/10/2023] Open
Abstract
This paper investigated on 478 duck meat samples for the identification of 2 kinds of antibiotics, that is, doxycycline hydrochloride and tylosin, that were classified based on surface-enhanced Raman spectroscopy (SERS) combined with multivariate techniques. The optimal detection parameters, including the effects of the adsorption time, and 2 enhancement substrates (i.e., gold nanoparticles as well as gold nanoparticles and NaCl) on Raman intensities, were analyzed using single factor analysis method. The results showed that the optimal adsorption time between gold nanoparticles and analytes was 2 min, and the colloidal gold nanoparticles without NaCl as the active substrate were more conducive to enhance the Raman spectra signal. The SERS data were pretreated by using the method of adaptive iterative penalty least square method (air-PLS) and second derivative, and from which the feature vectors were extracted with the help of principal component analysis. The first four principal components scores were selected as the input values of support vector machines model. The overall classification accuracy of the test set was 100%. The experimental results showed that the combination of SERS and multivariate analysis could identify the residues of doxycycline hydrochloride and tylosin in duck meat quickly and sensitively.
Collapse
Affiliation(s)
- Ting Wang
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China
| | - Muhua Liu
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China
| | - Shuanggen Huang
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China
| | - Haichao Yuan
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China
| | - Jinhui Zhao
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China.
| | - Jian Chen
- Key Laboratory of Modern Agricultural Equipment in Jiangxi Province, Jiangxi Agricultural University, Nanchang 330045, China
| |
Collapse
|
7
|
Huo J, Ma Y, Lu C, Li C, Duan K, Li H. Mahalanobis distance based similarity regression learning of NIRS for quality assurance of tobacco product with different variable selection methods. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2021; 251:119364. [PMID: 33493932 DOI: 10.1016/j.saa.2020.119364] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/13/2020] [Accepted: 12/17/2020] [Indexed: 06/12/2023]
Abstract
Quality assurance is one of the key issues in tobacco industry and many efforts have been put on the quality control. This paper introduces a new chemometrics technique to estimate the "quality similarity rate", which is used for quality control. The value of the quality similarity rate represents the similarity degree between the products and the standard reference samples, which is a global parameter that can be generated by either human assessors or machine learning. Supervised similarity regression models are built to automatically estimate the quality similarity rate value from NIRS data of tobacco leaf and smoke. For the similarity regression learning, the metric matrix is generated by a novel method which calculates the Mahalanobis distance from the segmented near infrared spectroscopy (NIRS). The results show the similarity regression learning can predict the quality similarity score well in high speed and can be improved with lasso (least absolute shrinkage and selection operator) related feature selection algorithms such as sRDA (sparse redundancy analysis) and glmnet.
Collapse
Affiliation(s)
- Juan Huo
- Zhengzhou University, Henan Province, China.
| | - Yuping Ma
- China Tobacco Henan Industrial Co., Ltd, Zhengzhou 450000, China
| | - Changtong Lu
- China Tobacco Henan Industrial Co., Ltd, Zhengzhou 450000, China
| | - Chenggang Li
- China Tobacco Henan Industrial Co., Ltd, Zhengzhou 450000, China
| | - Kun Duan
- China Tobacco Henan Industrial Co., Ltd, Zhengzhou 450000, China
| | - Huaiqi Li
- China Tobacco Henan Industrial Co., Ltd, Zhengzhou 450000, China.
| |
Collapse
|
8
|
Ortiz-Herrero L, Cardaba I, Bartolomé L, Alonso M, Maguregui M. Extension study of a statistical age prediction model for acrylic paints. Polym Degrad Stab 2020. [DOI: 10.1016/j.polymdegradstab.2020.109263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
9
|
Cozzolino D. The Sample, the Spectra and the Maths-The Critical Pillars in the Development of Robust and Sound Applications of Vibrational Spectroscopy. Molecules 2020; 25:E3674. [PMID: 32806655 PMCID: PMC7466136 DOI: 10.3390/molecules25163674] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/03/2020] [Accepted: 08/07/2020] [Indexed: 12/02/2022] Open
Abstract
The last two decades have witnessed an increasing interest in the use of the so-called rapid analytical methods or high throughput techniques. Most of these applications reported the use of vibrational spectroscopy methods (near infrared (NIR), mid infrared (MIR), and Raman) in a wide range of samples (e.g., food ingredients and natural products). In these applications, the analytical method is integrated with a wide range of multivariate data analysis (MVA) techniques (e.g., pattern recognition, modelling techniques, calibration, etc.) to develop the target application. The availability of modern and inexpensive instrumentation together with the access to easy to use software is determining a steady growth in the number of uses of these technologies. This paper underlines and briefly discusses the three critical pillars-the sample (e.g., sampling, variability, etc.), the spectra and the mathematics (e.g., algorithms, pre-processing, data interpretation, etc.)-that support the development and implementation of vibrational spectroscopy applications.
Collapse
Affiliation(s)
- Daniel Cozzolino
- Centre for Nutrition and Food Sciences, Queensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, Brisbane, Queensland 4072, Australia;
- ARC Training Centre for Uniquely Australian Foods, Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, Block 10, Level 1, 39 Kessels Rd, Coopers Plains Qld 4108, Australia
| |
Collapse
|
10
|
Chen H, Liu X, Chen A, Cai K, Lin B. Parametric-scaling optimization of pretreatment methods for the determination of trace/quasi-trace elements based on near infrared spectroscopy. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2020; 229:117959. [PMID: 31884401 DOI: 10.1016/j.saa.2019.117959] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 12/13/2019] [Accepted: 12/13/2019] [Indexed: 06/10/2023]
Abstract
This work proposes a parametric-scaling strategy to optimize the pretreatments of near infrared (NIR) spectroscopic data, so as to cope with the difficulty of NIR technology in detecting trace or quasi-trace elements. This novel strategy helps enhancing the signal to noise ratio and contributes to extracting features from the raw spectrum, so that the information corresponding to the trace elements could be detected much easier. However, due to the complexity of NIR data, it is difficult to comprehensively evaluate and compare the performance of different pretreatment methods, especially when multiple target components are determined simultaneously. For this reason, we create some comprehensive model indicators to define the goodness of pretreatments in simultaneous multiple detection of trace elements. In this paper two near infrared data sets have been investigated, one is used to determinate the key indices in the primary screening of thalassemia and the other one is used to detect the heavy metal pollutants in farmland soil. Results show that the proposed parametric-scaling optimization strategy can improve the effect of pretreatments in the determination of trace/quasi-trace elements, and the model performance with the optimized pretreated data is significantly superior to that with the raw data. The optimized Savitzky-Golay smoother (SGS) keeps its merits in the real data examples. Especially, the newly emerged methods optical path length estimation and correction (OPLEC) and Whittaker smoother (WTK), as well as their parametric-scaling modified methods, show their advantages in the comparison with other pretreatments. According to the results of our experiments, they have shown promising potential in the NIR rapid analysis of trace/quasi-trace elements in the field of biomedical science and agricultural science. This is expected to be tested for other analytes with larger variation.
Collapse
Affiliation(s)
- Huazhou Chen
- College of Science, Guilin University of Technology, Guilin 541004, China; Center for Data analysis and Algorithm Technology, Guilin University of Technology, Guilin 541004, China.
| | - Xiaoke Liu
- Department of Statistical Science, University College London, WC1E 6BT, United Kingdom
| | - An Chen
- College of Science, Guilin University of Technology, Guilin 541004, China; Center for Data analysis and Algorithm Technology, Guilin University of Technology, Guilin 541004, China
| | - Ken Cai
- College of Automation, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China
| | - Bin Lin
- College of Science, Guilin University of Technology, Guilin 541004, China; Center for Data analysis and Algorithm Technology, Guilin University of Technology, Guilin 541004, China
| |
Collapse
|
11
|
A Machine Learning Approach for Efficient Selection of Enzyme Concentrations and Its Application for Flux Optimization. Catalysts 2020. [DOI: 10.3390/catal10030291] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The metabolic engineering of pathways has been used extensively to produce molecules of interest on an industrial scale. Methods like gene regulation or substrate channeling helped to improve the desired product yield. Cell-free systems are used to overcome the weaknesses of engineered strains. One of the challenges in a cell-free system is selecting the optimized enzyme concentration for optimal yield. Here, a machine learning approach is used to select the enzyme concentration for the upper part of glycolysis. The artificial neural network approach (ANN) is known to be inefficient in extrapolating predictions outside the box: high predicted values will bump into a sort of “glass ceiling”. In order to explore this “glass ceiling” space, we developed a new methodology named glass ceiling ANN (GC-ANN). Principal component analysis (PCA) and data classification methods are used to derive a rule for a high flux, and ANN to predict the flux through the pathway using the input data of 121 balances of four enzymes in the upper part of glycolysis. The outcomes of this study are i. in silico selection of optimum enzyme concentrations for a maximum flux through the pathway and ii. experimental in vitro validation of the “out-of-the-box” fluxes predicted using this new approach. Surprisingly, flux improvements of up to 63% were obtained. Gratifyingly, these improvements are coupled with a cost decrease of up to 25% for the assay.
Collapse
|
12
|
Tkachev V, Sorokin M, Borisov C, Garazha A, Buzdin A, Borisov N. Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology. Int J Mol Sci 2020; 21:ijms21030713. [PMID: 31979006 PMCID: PMC7037338 DOI: 10.3390/ijms21030713] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 01/16/2020] [Accepted: 01/17/2020] [Indexed: 12/21/2022] Open
Abstract
(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naïve Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41–235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61–0.88 range to 0.70–0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.
Collapse
Affiliation(s)
- Victor Tkachev
- OmicsWayCorp, Walnut, CA 91788, USA; (V.T.); (M.S.); (A.G.)
| | - Maxim Sorokin
- OmicsWayCorp, Walnut, CA 91788, USA; (V.T.); (M.S.); (A.G.)
- Institute for Personailzed Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia
| | - Constantin Borisov
- National Research University—Higher School of Economics, 101000 Moscow, Russia;
| | - Andrew Garazha
- OmicsWayCorp, Walnut, CA 91788, USA; (V.T.); (M.S.); (A.G.)
| | - Anton Buzdin
- OmicsWayCorp, Walnut, CA 91788, USA; (V.T.); (M.S.); (A.G.)
- Institute for Personailzed Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Moscow Oblast, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia
| | - Nicolas Borisov
- OmicsWayCorp, Walnut, CA 91788, USA; (V.T.); (M.S.); (A.G.)
- Institute for Personailzed Medicine, I.M. Sechenov First Moscow State Medical University, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Moscow Oblast, Russia
- Correspondence: ; Tel.: +7-903-218-7261
| |
Collapse
|
13
|
Deep learning for vibrational spectral analysis: Recent progress and a practical guide. Anal Chim Acta 2019; 1081:6-17. [PMID: 31446965 DOI: 10.1016/j.aca.2019.06.012] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 05/13/2019] [Accepted: 06/05/2019] [Indexed: 12/19/2022]
Abstract
The development of chemometrics aims to provide an effective analysis approach for data generated by advanced analytical instruments. The success of existing analytical approaches in spectral analysis still relies on preprocessing and feature selection techniques to remove signal artifacts based on prior experiences. Data-driven deep learning analysis has been developed and successfully applied in many domains in the last few years. How to integrate deep learning with spectral analysis received increased attention for chemometrics. Approximately 20 recently published studies demonstrate that deep neural networks can learn critical patterns from raw spectra, which significantly reduces the demand for feature engineering. The composition of multiple processing layers improves the fitting and feature extraction capability and makes them applicable to various analytical tasks. This advance offers a new solution for chemometrics toward resolving challenges related to spectral data with rapidly increased sample numbers from various sources. We further provide a practical guide to the development of a deep convolutional neural network-based analytical workflow. The design of the network structure, tuning the hyperparameters in the training process, and repeatability of results is mainly discussed. Future studies are needed on interpretability and repeatability of the deep learning approach in spectral analysis.
Collapse
|
14
|
de Carvalho Rocha WF, Sheen DA. Determination of physicochemical properties of petroleum derivatives and biodiesel using GC/MS and chemometric methods with uncertainty estimation. FUEL (LONDON, ENGLAND) 2019; 243:413-422. [PMID: 38516536 PMCID: PMC10956500 DOI: 10.1016/j.fuel.2018.12.126] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/23/2024]
Abstract
The physicochemical properties of a substance, such as a fuel, can vary significantly with composition. Determining these properties with ASTM standard methods is both expensive and time-consuming, which has led to a desire to use chemometric modeling as an alternative. In this study, we compare the accuracy and robustness of two chemometric models, partial least squares (PLS) regression and support vector machine (SVM) with uncertainty estimation to determine how the physicochemical properties depend on the composition. A set of hydrocarbon mixtures, including crude oil, oil, gasoline, and biofuel/biodiesel, were collected. GC-MS data were taken, and physicochemical properties were measured for these mixtures using ASTM standard methods. PLS and SVM were used to develop predictive models of the physicochemical properties. Uncertainty in the estimated property values was estimated using a bootstrapping technique. With this uncertainty estimate, it is possible to assess the trustworthiness of any prediction, which ensures that the chemometric models can be applied for general purposes. SVM was found to be generally better for predicting the physicochemical properties, although we expect that with a more comprehensive data set the performance of the PLS models can be improved. We show in this work that PLS and SVM can be used to generate a predictive model of physicochemical properties based on GC-MS data. Combined with uncertainty analysis, these models provide robust predictions that can be used for regulatory, economic, and safety purposes.
Collapse
Affiliation(s)
| | - David A Sheen
- Chemical Sciences Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA
| |
Collapse
|
15
|
Tkachev V, Sorokin M, Mescheryakov A, Simonov A, Garazha A, Buzdin A, Muchnik I, Borisov N. FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier. Front Genet 2019; 9:717. [PMID: 30697229 PMCID: PMC6341065 DOI: 10.3389/fgene.2018.00717] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2018] [Accepted: 12/21/2018] [Indexed: 01/31/2023] Open
Abstract
Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels.
Collapse
Affiliation(s)
- Victor Tkachev
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States
| | - Maxim Sorokin
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States.,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
| | | | - Alexander Simonov
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States
| | - Andrew Garazha
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States
| | - Anton Buzdin
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States.,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia.,I.M. Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia
| | - Ilya Muchnik
- Hill Center, Rutgers University, Piscataway, NJ, United States
| | - Nicolas Borisov
- Department of Bioinformatics and Molecular Networks, OmicsWay Corporation, Walnut, CA, United States.,I.M. Sechenov First Moscow State Medical University (Sechenov University), Moscow, Russia
| |
Collapse
|
16
|
Peng Z, Li J, Li S, Pardo J, Zhou Y, Al-Youbi AO, Bashammakh AS, El-Shahawi MS, Leblanc RM. Quantification of Nucleic Acid Concentration in the Nanoparticle or Polymer Conjugates Using Circular Dichroism Spectroscopy. Anal Chem 2018; 90:2255-2262. [DOI: 10.1021/acs.analchem.7b04621] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Zhili Peng
- College
of Pharmacy and Chemistry, Dali University, Dali, Yunnan 671000, P. R. China
- Department
of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146, United States
| | - Jiaojiao Li
- Department
of Cellular Biology and Pharmacology, Herbert Wertheim College of
Medicine, Florida International University, 11200 S.W. 8th Street, Miami, Florida 33199, United States
| | - Shanghao Li
- Department
of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146, United States
- MP Biomedicals, 3 Hutton
Center Drive, #100, Santa Ana, California 92707, United States
| | - Joel Pardo
- Department
of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146, United States
| | - Yiqun Zhou
- Department
of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146, United States
| | - Abdulrahman O. Al-Youbi
- Department
of Chemistry, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Kingdom of Saudi Arabia
| | - Abdulaziz S. Bashammakh
- Department
of Chemistry, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Kingdom of Saudi Arabia
| | - Mohammad S. El-Shahawi
- Department
of Chemistry, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Kingdom of Saudi Arabia
| | - Roger M. Leblanc
- Department
of Chemistry, University of Miami, 1301 Memorial Drive, Coral Gables, Florida 33146, United States
| |
Collapse
|
17
|
Yuan LM, Chen X, Lai Y, Chen X, Shi Y, Zhu D, Li L. A Novel Strategy of Clustering Informative Variables for Quantitative Analysis of Potential Toxics Element in Tegillarca Granosa Using Laser-Induced Breakdown Spectroscopy. FOOD ANAL METHOD 2017. [DOI: 10.1007/s12161-017-1096-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
Palou A, Miró A, Blanco M, Larraz R, Gómez JF, Martínez T, González JM, Alcalà M. Calibration sets selection strategy for the construction of robust PLS models for prediction of biodiesel/diesel blends physico-chemical properties using NIR spectroscopy. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2017; 180:119-126. [PMID: 28284157 DOI: 10.1016/j.saa.2017.03.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Revised: 02/28/2017] [Accepted: 03/01/2017] [Indexed: 06/06/2023]
Abstract
Even when the feasibility of using near infrared (NIR) spectroscopy combined with partial least squares (PLS) regression for prediction of physico-chemical properties of biodiesel/diesel blends has been widely demonstrated, inclusion in the calibration sets of the whole variability of diesel samples from diverse production origins still remains as an important challenge when constructing the models. This work presents a useful strategy for the systematic selection of calibration sets of samples of biodiesel/diesel blends from diverse origins, based on a binary code, principal components analysis (PCA) and the Kennard-Stones algorithm. Results show that using this methodology the models can keep their robustness over time. PLS calculations have been done using a specialized chemometric software as well as the software of the NIR instrument installed in plant, and both produced RMSEP under reproducibility values of the reference methods. The models have been proved for on-line simultaneous determination of seven properties: density, cetane index, fatty acid methyl esters (FAME) content, cloud point, boiling point at 95% of recovery, flash point and sulphur.
Collapse
Affiliation(s)
- Anna Palou
- Department of Chemistry, Faculty of Sciences, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain
| | - Aira Miró
- Department of Chemistry, Faculty of Sciences, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain
| | - Marcelo Blanco
- Department of Chemistry, Faculty of Sciences, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain
| | - Rafael Larraz
- Centro de Investigación CEPSA, Avda. Punto Com 1, 28805 Alcalá de Henares, Madrid, Spain
| | - José Francisco Gómez
- Refinería Gibraltar - San Roque CEPSA, Puente Mayorga, s/n, 11360 San Roque, Cádiz, Spain
| | - Teresa Martínez
- CEPSA, Campo de las Naciones, Avda. del Partenón 12, 28042 Madrid, Spain
| | | | - Manel Alcalà
- Department of Chemistry, Faculty of Sciences, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain.
| |
Collapse
|
19
|
Filgueiras PR, Terra LA, Castro EV, Oliveira LM, Dias JC, Poppi RJ. Prediction of the distillation temperatures of crude oils using 1H NMR and support vector regression with estimated confidence intervals. Talanta 2015; 142:197-205. [DOI: 10.1016/j.talanta.2015.04.046] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Revised: 04/13/2015] [Accepted: 04/16/2015] [Indexed: 10/23/2022]
|
20
|
Wiesner K, Fuchs K, Gigler AM, Pastusiak R. Trends in Near Infrared Spectroscopy and Multivariate Data Analysis From an Industrial Perspective. ACTA ACUST UNITED AC 2014. [DOI: 10.1016/j.proeng.2014.11.292] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
21
|
Alves JCL, Poppi RJ. Simultaneous determination of hydrocarbon renewable diesel, biodiesel and petroleum diesel contents in diesel fuel blends using near infrared (NIR) spectroscopy and chemometrics. Analyst 2013; 138:6477-87. [DOI: 10.1039/c3an00883e] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|