Sajjadnia Z, Khayami R, Moosavi MR. Preprocessing Breast Cancer Data to Improve the Data Quality, Diagnosis Procedure, and Medical Care Services.
Cancer Inform 2020;
19:1176935120917955. [PMID:
32528221 PMCID:
PMC7262833 DOI:
10.1177/1176935120917955]
[Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 03/09/2020] [Indexed: 11/17/2022] Open
Abstract
In recent years, due to an increase in the incidence of different cancers,
various data sources are available in this field. Consequently, many researchers
have become interested in the discovery of useful knowledge from available data
to assist faster decision-making by doctors and reduce the negative consequences
of such diseases. Data mining includes a set of useful techniques in the
discovery of knowledge from the data: detecting hidden patterns and finding
unknown relations. However, these techniques face several challenges with
real-world data. Particularly, dealing with inconsistencies, errors, noise, and
missing values requires appropriate preprocessing and data preparation
procedures. In this article, we investigate the impact of preprocessing to
provide high-quality data for classification techniques. A wide range of
preprocessing and data preparation methods are studied, and a set of
preprocessing steps was leveraged to obtain appropriate classification results.
The preprocessing is done on a real-world breast cancer dataset of the Reza
Radiation Oncology Center in Mashhad with various features and a great
percentage of null values, and the results are reported in this article. To
evaluate the impact of the preprocessing steps on the results of classification
algorithms, this case study was divided into the following 3 experiments:
Breast cancer recurrence prediction without data preprocessing
Breast cancer recurrence prediction by error removal
Breast cancer recurrence prediction by error removal and filling null values
Then, in each experiment, dimensionality reduction techniques are used to select
a suitable subset of features for the problem at hand. Breast cancer recurrence
prediction models are constructed using the 3 widely used classification
algorithms, namely, naïve Bayes, k-nearest neighbor, and
sequential minimal optimization. The evaluation of the experiments is done in
terms of accuracy, sensitivity, F-measure, precision, and G-mean measures. Our
results show that recurrence prediction is significantly improved after data
preprocessing, especially in terms of sensitivity, F-measure, precision, and
G-mean measures.
Collapse