1
|
Machine Learning Models Using Data Mining for Biomass Production from Yarrowia lipolytica Fermentation. FERMENTATION-BASEL 2023. [DOI: 10.3390/fermentation9030239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
Abstract
In this paper, a database of biomass production from Yarrowia lipolytica fermentation is prepared and constructed using machine learning and data mining approaches. The database is curated from 15 publications and consists of 301 rows of data with 25 predictors and 1 label. The predictors include inoculum size, temperature, pH, and time, while the label is the corresponding biomass production. The database is then divided into training, validation, and test datasets and analyzed as a supervised machine learning task for regression. Twenty-six regression models are employed and compared for their performance in predicting biomass production. The best-performing model is the Matern 5/2 Gaussian process regression model, which has the lowest root-mean-squared error of 0.75 g/L, the highest R squared of 0.90, and the lowest mean absolute error of 0.52 g/L. The t-test is used to identify the most important predictors, and 14 predictors are sufficient for creating an accurate model. These 14 predictors are fermentation time, peptone, temperature, total Kjeldahl nitrogen, shaking rate, total nitrogen, inoculum size, yeast extract, crude glycerol, glucose, oil and grease, media pH, ammonium sulfate, and olive oil. This research demonstrates the application of machine learning and data mining to estimate biomass production and gives insight into which parameters are essential for Yarrowia lipolytica fermentation.
Collapse
|
2
|
Busa J, Polaka I. Variability of Classification Results in Data with High Dimensionality and Small Sample Size. INFORMATION TECHNOLOGY AND MANAGEMENT SCIENCE 2021. [DOI: 10.7250/itms-2021-0007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The study focuses on the analysis of biological data containing information on the number of genome sequences of intestinal microbiome bacteria before and after antibiotic use. The data have high dimensionality (bacterial taxa) and a small number of records, which is typical of bioinformatics data. Classification models induced on data sets like this usually are not stable and the accuracy metrics have high variance. The aim of the study is to create a preprocessing workflow and a classification model that can perform the most accurate classification of the microbiome into groups before and after the use of antibiotics and lessen the variability of accuracy measures of the classifier. To evaluate the accuracy of the model, measures of the area under the ROC curve and the overall accuracy of the classifier were used. In the experiments, the authors examined how classification results were affected by feature selection and increased size of the data set.
Collapse
Affiliation(s)
- Jana Busa
- Riga Technical University, Riga, Latvia
| | | |
Collapse
|
3
|
Tian X, Chen M. Descriptor selection for predicting interfacial thermal resistance by machine learning methods. Sci Rep 2021; 11:739. [PMID: 33436976 PMCID: PMC7804206 DOI: 10.1038/s41598-020-80795-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2020] [Accepted: 12/28/2020] [Indexed: 01/29/2023] Open
Abstract
Interfacial thermal resistance (ITR) is a critical property for the performance of nanostructured devices where phonon mean free paths are larger than the characteristic length scales. The affordable, accurate and reliable prediction of ITR is essential for material selection in thermal management. In this work, the state-of-the-art machine learning methods were employed to realize this. Descriptor selection was conducted to build robust models and provide guidelines on determining the most important characteristics for targets. Firstly, decision tree (DT) was adopted to calculate the descriptor importances. And descriptor subsets with topX highest importances were chosen (topX-DT, X = 20, 15, 10, 5) to build models. To verify the transferability of the descriptors picked by decision tree, models based on kernel ridge regression, Gaussian process regression and K-nearest neighbors were also evaluated. Afterwards, univariate selection (UV) was utilized to sort descriptors. Finally, the top5 common descriptors selected by DT and UV were used to build concise models. The performance of these refined models is comparable to models using all descriptors, which indicates the high accuracy and reliability of these selection methods. Our strategy results in concise machine learning models for a fast prediction of ITR for thermal management applications.
Collapse
Affiliation(s)
- Xiaojuan Tian
- Department of Chemical Engineering, China University of Petroleum, Beijing, 102249, China.
| | - Mingguang Chen
- Physical Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
4
|
Sun Y, Hewitt M, Wilkinson SC, Davey N, Adams RG, Gullick DR, Moss GP. Development of a Gaussian Process - feature selection model to characterise (poly)dimethylsiloxane (Silastic ® ) membrane permeation. J Pharm Pharmacol 2020; 72:873-888. [PMID: 32246470 DOI: 10.1111/jphp.13263] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 03/08/2020] [Indexed: 11/27/2022]
Abstract
OBJECTIVES The current study aims to determine the effect of physicochemical descriptor selection on models of polydimethylsiloxane permeation. METHODS A total of 2942 descriptors were calculated for a data set of 77 chemicals. Data were processed to remove redundancy, single values, imbalanced and highly correlated data, yielding 1363 relevant descriptors. For four independent test sets, feature selection methods were applied and modelled via a variety of Machine Learning methods. KEY FINDINGS Two sets of molecular descriptors which can provide improved predictions, compared to existing models, have been identified. Best permeation predictions were found with Gaussian Process methods. The molecular descriptors describe lipophilicity, partial charge and hydrogen bonding as key determinants of PDMS permeation. CONCLUSIONS This study highlights important considerations in the development of relevant models and in the construction and use of the data sets used in such studies, particularly that highly correlated descriptors should be removed from data sets. Predictive models are improved by the methodology adopted in this study, notably the systematic evaluation of descriptors, rather than simply using any and all available descriptors, often based empirically on in vitro experiments. Such findings also have clear relevance to a number of other fields.
Collapse
Affiliation(s)
- Yi Sun
- School of Computer Science, University of Hertfordshire, Hatfield, UK
| | - Mark Hewitt
- School of Pharmacy, University of Wolverhampton, Wolverhampton, UK
| | - Simon C Wilkinson
- School of Biomedical, Nutritional and Sports Sciences, Medical School, University of Newcastle-upon-Tyne, Newcastle-upon-Tyne, UK
| | - Neil Davey
- School of Computer Science, University of Hertfordshire, Hatfield, UK
| | - Roderick G Adams
- School of Computer Science, University of Hertfordshire, Hatfield, UK
| | - Darren R Gullick
- School of Pharmacy & Biomedical Sciences, University of Portsmouth, Portsmouth, UK
| | - Gary P Moss
- The School of Pharmacy, Keele University, Keele, UK
| |
Collapse
|
5
|
Yamashita AY, Falcão AX, Leite NJ. The Residual Center of Mass: An Image Descriptor for the Diagnosis of Alzheimer Disease. Neuroinformatics 2019; 17:307-321. [PMID: 30328551 DOI: 10.1007/s12021-018-9390-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
A crucial quest in neuroimaging is the discovery of image features (biomarkers) associated with neurodegenerative disorders. Recent works show that such biomarkers can be obtained by image analysis techniques. However, these techniques cannot be directly compared since they use different databases and validation protocols. In this paper, we present an extensive study of image descriptors for the diagnosis of Alzheimer Disease (AD) and introduce a new one, named Residual Center of Mass (RCM). The RCM descriptor explores image moments and other techniques to enhance brain regions and select discriminative features for the diagnosis of AD. For validation, a Support Vector Machine (SVM) is trained with the selected features to classify images from normal subjects and patients with AD. We show that RCM with SVM achieves the best accuracies on a considerable number of exams by 10-fold cross-validation - 95.1% on 507 FDG-PET scans and 90.3% on 1374 MRI scans.
Collapse
|
6
|
Li H, Mendel KR, Lan L, Sheth D, Giger ML. Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma. Radiology 2019; 291:15-20. [PMID: 30747591 PMCID: PMC6445042 DOI: 10.1148/radiol.2019181113] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2018] [Revised: 12/14/2018] [Accepted: 01/02/2019] [Indexed: 11/11/2022]
Abstract
Background Previous studies have suggested that breast parenchymal texture features may reflect the biologic risk factors associated with breast cancer development. Therefore, combining the characteristics of normal parenchyma from the contralateral breast with radiomic features of breast tumors may improve the accuracy of digital mammography in the diagnosis of breast cancer. Purpose To determine whether the addition of radiomic analysis of contralateral breast parenchyma to the characterization of breast lesions with digital mammography improves lesion classification over that with radiomic tumor features alone. Materials and Methods This HIPAA-compliant, retrospective study included 182 patients (age range, 25-90 years; mean age, 55.9 years ± 14.9) who underwent mammography between June 2002 and July 2009. There were 106 malignant and 76 benign lesions. Automatic lesion segmentation and radiomic analysis were performed for each breast lesion. Radiomic texture analysis was applied in the normal regions of interest in the contralateral breast parenchyma to assess the mammographic parenchymal patterns. The classification performance of both individual features and the output from a Bayesian artificial neural network classifier was evaluated with the leave-one-patient-out method by using the area under the receiver operating characteristic curve (AUC) as the figure of merit in the task of differentiating between malignant and benign lesions. Results The performance of the combined lesion and parenchyma classifier in the differentiation between malignant and benign mammographic lesions was better than that with the lesion features alone (AUC = 0.84 ± 0.03 vs 0.79 ± 0.03, respectively; P = .047). Overall, six radiomic features-spiculation, margin sharpness, size, circularity from the tumor feature set, and skewness and power law beta from the parenchymal feature set-were selected more than 50% of the time during the feature selection process on the combined feature set. Conclusion Combining quantitative radiomic data from tumors with contralateral parenchyma characterizations may improve diagnostic accuracy for breast cancer. © RSNA, 2019 Online supplemental material is available for this article. See also the editorial by Shaffer in this issue.
Collapse
Affiliation(s)
- Hui Li
- From the Department of Radiology, University of Chicago, 5841 S
Maryland Ave, Chicago, IL 60637
| | - Kayla R. Mendel
- From the Department of Radiology, University of Chicago, 5841 S
Maryland Ave, Chicago, IL 60637
| | - Li Lan
- From the Department of Radiology, University of Chicago, 5841 S
Maryland Ave, Chicago, IL 60637
| | - Deepa Sheth
- From the Department of Radiology, University of Chicago, 5841 S
Maryland Ave, Chicago, IL 60637
| | - Maryellen L. Giger
- From the Department of Radiology, University of Chicago, 5841 S
Maryland Ave, Chicago, IL 60637
| |
Collapse
|
7
|
Urbaniak B, Nowicki P, Sikorska D, Samborski W, Kokot ZJ. The feature selection approach for evaluation of potential rheumatoid arthritis markers using MALDI-TOF datasets. Anal Biochem 2017; 525:29-37. [DOI: 10.1016/j.ab.2017.02.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 02/20/2017] [Accepted: 02/23/2017] [Indexed: 10/20/2022]
|
8
|
Li GQ, Liu Z, Shen HB, Yu DJ. TargetM6A: Identifying N6-Methyladenosine Sites From RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine. IEEE Trans Nanobioscience 2016; 15:674-682. [DOI: 10.1109/tnb.2016.2599115] [Citation(s) in RCA: 57] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
9
|
Hajduk J, Klupczynska A, Dereziński P, Matysiak J, Kokot P, Nowak DM, Gajęcka M, Nowak-Markwitz E, Kokot ZJ. A Combined Metabolomic and Proteomic Analysis of Gestational Diabetes Mellitus. Int J Mol Sci 2015; 16:30034-45. [PMID: 26694367 PMCID: PMC4691080 DOI: 10.3390/ijms161226133] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Revised: 10/28/2015] [Accepted: 11/20/2015] [Indexed: 12/12/2022] Open
Abstract
The aim of this pilot study was to apply a novel combined metabolomic and proteomic approach in analysis of gestational diabetes mellitus. The investigation was performed with plasma samples derived from pregnant women with diagnosed gestational diabetes mellitus (n = 18) and a matched control group (n = 13). The mass spectrometry-based analyses allowed to determine 42 free amino acids and low molecular-weight peptide profiles. Different expressions of several peptides and altered amino acid profiles were observed in the analyzed groups. The combination of proteomic and metabolomic data allowed obtaining the model with a high discriminatory power, where amino acids ethanolamine, L-citrulline, L-asparagine, and peptide ions with m/z 1488.59; 4111.89 and 2913.15 had the highest contribution to the model. The sensitivity (94.44%) and specificity (84.62%), as well as the total group membership classification value (90.32%) calculated from the post hoc classification matrix of a joint model were the highest when compared with a single analysis of either amino acid levels or peptide ion intensities. The obtained results indicated a high potential of integration of proteomic and metabolomics analysis regardless the sample size. This promising approach together with clinical evaluation of the subjects can also be used in the study of other diseases.
Collapse
Affiliation(s)
- Joanna Hajduk
- Department of Inorganic and Analytical Chemistry, Poznan University of Medical Sciences, 6 Grunwaldzka Street, Poznań 60-780, Poland.
| | - Agnieszka Klupczynska
- Department of Inorganic and Analytical Chemistry, Poznan University of Medical Sciences, 6 Grunwaldzka Street, Poznań 60-780, Poland.
| | - Paweł Dereziński
- Department of Inorganic and Analytical Chemistry, Poznan University of Medical Sciences, 6 Grunwaldzka Street, Poznań 60-780, Poland.
| | - Jan Matysiak
- Department of Inorganic and Analytical Chemistry, Poznan University of Medical Sciences, 6 Grunwaldzka Street, Poznań 60-780, Poland.
| | - Piotr Kokot
- Obstetrics and Gynecology Ward, District Hospital in Mielec, 22a Żeromskiego Street, Mielec 39-300, Poland.
| | - Dorota M Nowak
- Departmentof Genetics and Pharmaceutical Microbiology, Poznan University of Medical Sciences, Święcickiego 4 Street, Poznań 60-781, Poland.
| | - Marzena Gajęcka
- Departmentof Genetics and Pharmaceutical Microbiology, Poznan University of Medical Sciences, Święcickiego 4 Street, Poznań 60-781, Poland.
- Institute of Human Genetics, Polish Academy of Sciences, 32 Strzeszyńska Street, Poznań 60-479, Poland.
| | - Ewa Nowak-Markwitz
- Gynecologic Oncology Department, Poznan University of Medical Sciences, Polna 33 Street, Poznań 60-535, Poland.
| | - Zenon J Kokot
- Department of Inorganic and Analytical Chemistry, Poznan University of Medical Sciences, 6 Grunwaldzka Street, Poznań 60-780, Poland.
| |
Collapse
|
10
|
Mamun KA, Mace M, Lutman ME, Stein J, Liu X, Aziz T, Vaidyanathan R, Wang S. Movement decoding using neural synchronization and inter-hemispheric connectivity from deep brain local field potentials. J Neural Eng 2015; 12:056011. [DOI: 10.1088/1741-2560/12/5/056011] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|