1
|
Schipper A, Rutten M, van Gammeren A, Harteveld CL, Urrechaga E, Weerkamp F, den Besten G, Krabbe J, Slomp J, Schoonen L, Broeren M, van Wijnen M, Huijskens MJAJ, Koopmann T, van Ginneken B, Kusters R, Kurstjens S. Machine Learning-Based Prediction of Hemoglobinopathies Using Complete Blood Count Data. Clin Chem 2024:hvae081. [PMID: 38906831 DOI: 10.1093/clinchem/hvae081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Accepted: 05/13/2024] [Indexed: 06/23/2024]
Abstract
BACKGROUND Hemoglobinopathies, the most common inherited blood disorder, are frequently underdiagnosed. Early identification of carriers is important for genetic counseling of couples at risk. The aim of this study was to develop and validate a novel machine learning model on a multicenter data set, covering a wide spectrum of hemoglobinopathies based on routine complete blood count (CBC) testing. METHODS Hemoglobinopathy test results from 10 322 adults were extracted retrospectively from 8 Dutch laboratories. eXtreme Gradient Boosting (XGB) and logistic regression models were developed to differentiate negative from positive hemoglobinopathy cases, using 7 routine CBC parameters. External validation was conducted on a data set from an independent Dutch laboratory, with an additional external validation on a Spanish data set (n = 2629) specifically for differentiating thalassemia from iron deficiency anemia (IDA). RESULTS The XGB and logistic regression models achieved an area under the receiver operating characteristic (AUROC) of 0.88 and 0.84, respectively, in distinguishing negative from positive hemoglobinopathy cases in the independent external validation set. Subclass analysis showed that the XGB model reached an AUROC of 0.97 for β-thalassemia, 0.98 for α0-thalassemia, 0.95 for homozygous α+-thalassemia, 0.78 for heterozygous α+-thalassemia, and 0.94 for the structural hemoglobin variants Hemoglobin C, Hemoglobin D, Hemoglobin E. Both models attained AUROCs of 0.95 in differentiating IDA from thalassemia. CONCLUSIONS Both the XGB and logistic regression model demonstrate high accuracy in predicting a broad range of hemoglobinopathies and are effective in differentiating hemoglobinopathies from IDA. Integration of these models into the laboratory information system facilitates automated hemoglobinopathy detection using routine CBC parameters.
Collapse
Affiliation(s)
- Anoeska Schipper
- Laboratory of Clinical Chemistry and Hematology, Jeroen Bosch Hospital's, Hertogenbosch, the Netherlands
- Diagnostic Image Analysis Group, Radboudumc, Nijmegen, the Netherlands
| | - Matthieu Rutten
- Diagnostic Image Analysis Group, Radboudumc, Nijmegen, the Netherlands
- Department of Radiology, Jeroen Bosch Hospital's, Hertogenbosch, the Netherlands
| | - Adriaan van Gammeren
- Laboratory of Clinical Chemistry and Laboratory Medicine, Amphia Hospital, Breda, the Netherlands
| | - Cornelis L Harteveld
- Department of Clinical Genetics, Laboratory for Genome Diagnostics, Leiden University Medical Center, Leiden, the Netherlands
| | - Eloísa Urrechaga
- Laboratory of Hematology, Hospital Universitario Galdakao Usansolo, Galdakao, Spain
| | - Floor Weerkamp
- Laboratory of Clinical Chemistry, Maasstad Hospital, Rotterdam, the Netherlands
| | - Gijs den Besten
- Laboratory of Clinical Chemistry and Laboratory Medicine, Isala Hospital, Zwolle, the Netherlands
| | - Johannes Krabbe
- Laboratory of Clinical Chemistry and Hematology, Medisch Spectrum Twente/Medlon BV, Enschede, the Netherlands
| | - Jennichjen Slomp
- Laboratory of Clinical Chemistry and Hematology, Medisch Spectrum Twente/Medlon BV, Enschede, the Netherlands
| | - Lise Schoonen
- Laboratory of Clinical Chemistry, Maasstad Hospital, Rotterdam, the Netherlands
- Laboratory of Clinical Chemistry and Laboratory Medicine, Canisius Wilhelmina Hospital, Nijmegen, the Netherlands
| | - Maarten Broeren
- Laboratory of Clinical Chemistry and Laboratory Medicine, Máxima Medical Center, Eindhoven, the Netherlands
| | - Merel van Wijnen
- Laboratory of Clinical Chemistry and Laboratory Medicine, Meander Medical Center, Amersfoort, the Netherlands
| | - Mirelle J A J Huijskens
- Department of Clinical Chemistry and Haematology, Zuyderland Medical Center, Sittard/Heerlen, the Netherlands
| | - Tamara Koopmann
- Department of Clinical Genetics, Laboratory for Genome Diagnostics, Leiden University Medical Center, Leiden, the Netherlands
| | - Bram van Ginneken
- Diagnostic Image Analysis Group, Radboudumc, Nijmegen, the Netherlands
| | - Ron Kusters
- Laboratory of Clinical Chemistry and Hematology, Jeroen Bosch Hospital's, Hertogenbosch, the Netherlands
- Department of Health Technology and Services Research, Technical Medical Centre, University of Twente, Enschede, the Netherlands
| | - Steef Kurstjens
- Laboratory of Clinical Chemistry and Hematology, Jeroen Bosch Hospital's, Hertogenbosch, the Netherlands
| |
Collapse
|
2
|
Zhang F, Zhan J, Wang Y, Cheng J, Wang M, Chen P, Ouyang J, Li J. Enhancing thalassemia gene carrier identification in non-anemic populations using artificial intelligence erythrocyte morphology analysis and machine learning. Eur J Haematol 2024; 112:692-700. [PMID: 38154920 DOI: 10.1111/ejh.14160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 12/30/2023]
Abstract
BACKGROUND Non-anemic thalassemia trait (TT) accounted for a high proportion of TT cases in South China. OBJECTIVE To use artificial intelligence (AI) analysis of erythrocyte morphology and machine learning (ML) to identify TT gene carriers in a non-anemic population. METHODS Digital morphological data from 76 TT gene carriers and 97 controls were collected. The AI technology-based Mindray MC-100i was used to quantitatively analyze the percentage of abnormal erythrocytes. Further, ML was used to construct a prediction model. RESULTS Non-anemic TT carriers accounted for over 60% of the TT cases. Random Forest was selected as the prediction model and named TT@Normal. The TT@Normal algorithm showed outstanding performance in the training, validation, and external validation sets and could efficiently identify TT carriers in the non-anemic population. The top three weights in the TT@Normal model were the target cells, microcytes, and teardrop cells. Elevated percentages of abnormal erythrocytes should raise a strong suspicion of being a TT gene carrier. TT@Normal could be promoted and used as a visualization and sharing tool. It is accessible through a URL link and can be used by medical staff online to predict the possibility of TT gene carriage in a non-anemic population. CONCLUSIONS The ML-based model TT@Normal could efficiently identify TT carriers in non-anemic people. Elevated percentages of target cells, microcytes, and teardrop cells should raise a strong suspicion of being a TT gene carrier.
Collapse
Affiliation(s)
- Fan Zhang
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Jieyu Zhan
- Department of Pediatric, Baiyun District Maternal and Child Healthcare Centre, Guangzhou, China
| | - Yang Wang
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Jing Cheng
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Meinan Wang
- IVD Domestic Clinical Application Department, Mindray Biomedical Electronics Co., Ltd, Shenzhen City, China
| | - Peisong Chen
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Juan Ouyang
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Junxun Li
- Department of Laboratory Science, First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
3
|
Saleem M, Aslam W, Lali MIU, Rauf HT, Nasr EA. Predicting Thalassemia Using Feature Selection Techniques: A Comparative Analysis. Diagnostics (Basel) 2023; 13:3441. [PMID: 37998577 PMCID: PMC10670018 DOI: 10.3390/diagnostics13223441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 11/25/2023] Open
Abstract
Thalassemia represents one of the most common genetic disorders worldwide, characterized by defects in hemoglobin synthesis. The affected individuals suffer from malfunctioning of one or more of the four globin genes, leading to chronic hemolytic anemia, an imbalance in the hemoglobin chain ratio, iron overload, and ineffective erythropoiesis. Despite the challenges posed by this condition, recent years have witnessed significant advancements in diagnosis, therapy, and transfusion support, significantly improving the prognosis for thalassemia patients. This research empirically evaluates the efficacy of models constructed using classification methods and explores the effectiveness of relevant features that are derived using various machine-learning techniques. Five feature selection approaches, namely Chi-Square (χ2), Exploratory Factor Score (EFS), tree-based Recursive Feature Elimination (RFE), gradient-based RFE, and Linear Regression Coefficient, were employed to determine the optimal feature set. Nine classifiers, namely K-Nearest Neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GBC), Linear Regression (LR), AdaBoost, Extreme Gradient Boosting (XGB), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM), were utilized to evaluate the performance. The χ2 method achieved accuracy, registering 91.56% precision, 91.04% recall, and 92.65% f-score when aligned with the LR classifier. Moreover, the results underscore that amalgamating over-sampling with Synthetic Minority Over-sampling Technique (SMOTE), RFE, and 10-fold cross-validation markedly elevates the detection accuracy for αT patients. Notably, the Gradient Boosting Classifier (GBC) achieves 93.46% accuracy, 93.89% recall, and 92.72% F1 score.
Collapse
Affiliation(s)
- Muniba Saleem
- Department of Computer Science & Information Technology, The Government Sadiq College Women University Bahawalpur, Bahawalpur 63100, Pakistan;
| | - Waqar Aslam
- Department of Information Security, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
| | | | - Hafiz Tayyab Rauf
- Centre for Smart Systems, AI and Cybersecurity, Staffordshire University, Stoke-on-Trent ST4 2DE, UK;
| | - Emad Abouel Nasr
- Industrial Engineering Department, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia;
| |
Collapse
|