1
|
Kefeli J, Berkowitz J, Acitores Cortina JM, Tsang KK, Tatonetti NP. Generalizable and automated classification of TNM stage from pathology reports with external validation. Nat Commun 2024; 15:8916. [PMID: 39414770 PMCID: PMC11484761 DOI: 10.1038/s41467-024-53190-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 10/04/2024] [Indexed: 10/18/2024] Open
Abstract
Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present BB-TEN: Big Bird - TNM staging Extracted from Notes, a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815-0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.
Collapse
Affiliation(s)
- Jenna Kefeli
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Jacob Berkowitz
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
- Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jose M Acitores Cortina
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
- Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Kevin K Tsang
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
- Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Nicholas P Tatonetti
- Department of Systems Biology, Columbia University, New York, NY, USA.
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
- Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| |
Collapse
|
2
|
Kefeli J, Tatonetti N. TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models. PATTERNS (NEW YORK, N.Y.) 2024; 5:100933. [PMID: 38487800 PMCID: PMC10935496 DOI: 10.1016/j.patter.2024.100933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 10/16/2023] [Accepted: 01/25/2024] [Indexed: 03/17/2024]
Abstract
In cancer research, pathology report text is a largely untapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing the data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using artificial intelligence (AI) allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. Finally, we perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.
Collapse
Affiliation(s)
- Jenna Kefeli
- Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Nicholas Tatonetti
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| |
Collapse
|
3
|
Kaufmann B, Busby D, Das CK, Tillu N, Menon M, Tewari AK, Gorin MA. Validation of a Zero-shot Learning Natural Language Processing Tool to Facilitate Data Abstraction for Urologic Research. Eur Urol Focus 2024; 10:279-287. [PMID: 38278710 DOI: 10.1016/j.euf.2024.01.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 12/18/2023] [Accepted: 01/15/2024] [Indexed: 01/28/2024]
Abstract
BACKGROUND Urologic research often requires data abstraction from unstructured text contained within the electronic health record. A number of natural language processing (NLP) tools have been developed to aid with this time-consuming task; however, the generalizability of these tools is typically limited by the need for task-specific training. OBJECTIVE To describe the development and validation of a zero-shot learning NLP tool to facilitate data abstraction from unstructured text for use in downstream urologic research. DESIGN, SETTING, AND PARTICIPANTS An NLP tool based on the GPT-3.5 model from OpenAI was developed and compared with three physicians for time to task completion and accuracy for abstracting 14 unique variables from a set of 199 deidentified radical prostatectomy pathology reports. The reports were processed in vectorized and scanned formats to establish the impact of optical character recognition on data abstraction. INTERVENTION A zero-shot learning NLP tool for data abstraction. OUTCOME MEASUREMENTS AND STATISTICAL ANALYSIS The tool was compared with the human abstractors in terms of superiority for data abstraction speed and noninferiority for accuracy. RESULTS AND LIMITATIONS The human abstractors required a median (interquartile range) of 93 s (72-122 s) per report for data abstraction, whereas the software required a median of 12 s (10-15 s) for the vectorized reports and 15 s (13-17 s) for the scanned reports (p < 0.001 for all paired comparisons). The accuracies of the three human abstractors were 94.7% (95% confidence interval [CI], 93.8-95.5%), 97.8% (95% CI, 97.2-98.3%), and 96.4% (95% CI, 95.6-97%) for the combined set of 2786 data points. The tool had accuracy of 94.2% (95% CI, 93.3-94.9%) for the vectorized reports and was noninferior to the human abstractors at a margin of -10% (α = 0.025). The tool had slightly lower accuracy of 88.7% (95% CI 87.5-89.9%) for the scanned reports, making it noninferior to two of three human abstractors. CONCLUSIONS The developed zero-shot learning NLP tool offers urologic researchers a highly generalizable and accurate method for data abstraction from unstructured text. An open access version of the tool is available for immediate use by the urologic community. PATIENT SUMMARY In this report, we describe the design and validation of an artificial intelligence tool for abstracting discrete data from unstructured notes contained within the electronic medical record. This freely available tool, which is based on the GPT-3.5 technology from OpenAI, is intended to facilitate research and scientific discovery by the urologic community.
Collapse
Affiliation(s)
- Basil Kaufmann
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Department of Urology, University Hospital Zurich, University of Zurich, Zurich, Switzerland.
| | - Dallin Busby
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Chandan Krushna Das
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Neeraja Tillu
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Mani Menon
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ashutosh K Tewari
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Michael A Gorin
- Milton and Carroll Petrie Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
4
|
Simoulin A, Thiebaut N, Neuberger K, Ibnouhsein I, Brunel N, Viné R, Bousquet N, Latapy J, Reix N, Molière S, Lodi M, Mathelin C. From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107693. [PMID: 37453367 DOI: 10.1016/j.cmpb.2023.107693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 05/25/2023] [Accepted: 06/23/2023] [Indexed: 07/18/2023]
Abstract
PURPOSE A considerable amount of valuable information is present in electronic health records (EHRs) however it remains inaccessible because it is embedded into unstructured narrative documents that cannot be easily analyzed. We wanted to develop and evaluate a methodology able to extract and structure information from electronic health records in breast cancer. METHODS We developed a software platform called Onconum (ClinicalTrials.gov Identifier: NCT02810093) which uses a hybrid method relying on machine learning approaches and rule-based lexical methods. It is based on natural language processing techniques that allows a targeted analysis of free-text medical data related to breast cancer, independently of any pre-existing dictionary, in a French context (available in N files). We then evaluated it on a validation cohort called Senometry. FINDINGS Senometry cohort included 9,599 patients with breast cancer (both invasive and in situ), treated between 2000 and 2017 in the breast cancer unit of Strasbourg University Hospitals. Extraction rates ranged from 45 to 100%, depending on the type of each parameter. Precision of extracted information was 68%-94% compared to a structured cohort, and 89%-98% compared to manually structured databases and it retrieved more rare occurrences compared to another database search engine (+17%). INTERPRETATION This innovative method can accurately structure relevant medical information embedded in EHRs in the context of breast cancer. Missing data handling is the main limitation of this method however multiple sources can be incorporated to reduce this limit. Nevertheless, this methodology does not need neither pre-existing dictionaries nor manually annotated corpora. It can therefore be easily implemented in non-English-speaking countries and in other diseases outside breast cancer, and it allows prospective inclusion of new patients.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Nicolas Bousquet
- Quantmetry, 52 rue d'Anjou, 75008 Paris, France; Sorbonne University, 4 place Jussieu, 75005 Paris, France
| | | | - Nathalie Reix
- ICube UMR 7537, Strasbourg University / CNRS, Fédération de Médecine Translationnelle de Strasbourg, 67200 Strasbourg, France; Biochemistry and Molecular Biology Laboratory, Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France
| | - Sébastien Molière
- Radiology Department, Strasbourg University Hospitals, 1 avenue Molière, 67098 Strasbourg, France
| | - Massimo Lodi
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| | - Carole Mathelin
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| |
Collapse
|
5
|
Kefeli J, Tatonetti N. Benchmark Pathology Report Text Corpus with Cancer Type Classification. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.08.03.23293618. [PMID: 37609238 PMCID: PMC10441484 DOI: 10.1101/2023.08.03.23293618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
In cancer research, pathology report text is a largely un-tapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly-available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using AI allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to publicly available report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. We perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.
Collapse
Affiliation(s)
- Jenna Kefeli
- Department of Systems Biology, Columbia University, New York, New York, 10032, United States
| | - Nicholas Tatonetti
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, 90048, United States
| |
Collapse
|
6
|
Kefeli J, Tatonetti N. Generalizable and Automated Classification of TNM Stage from Pathology Reports with External Validation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.06.26.23291912. [PMID: 37425701 PMCID: PMC10327265 DOI: 10.1101/2023.06.26.23291912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7,000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8,000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815-0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning.
Collapse
Affiliation(s)
- Jenna Kefeli
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Nicholas Tatonetti
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA
- Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| |
Collapse
|
7
|
Wu G, Cheligeer C, Brisson AM, Quan ML, Cheung WY, Brenner D, Lupichuk S, Teman C, Basmadjian RB, Popwich B, Xu Y. A New Method of Identifying Pathologic Complete Response After Neoadjuvant Chemotherapy for Breast Cancer Patients Using a Population-Based Electronic Medical Record System. Ann Surg Oncol 2023; 30:2095-2103. [PMID: 36542249 DOI: 10.1245/s10434-022-12955-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 12/01/2022] [Indexed: 12/24/2022]
Abstract
BACKGROUND Accurate identification of pathologic complete response (pCR) from population-based electronic narrative data in a timely and cost-efficient manner is critical. This study aimed to derive and validate a set of natural language processing (NLP)-based machine-learning algorithms to capture pCR from surgical pathology reports of breast cancer patients who underwent neoadjuvant chemotherapy (NAC). METHODS This retrospective cohort study included all invasive breast cancer patients who underwent NAC and subsequent curative-intent surgery during their admission at all four tertiary acute care hospitals in Calgary, Alberta, Canada, between 1 January 2010 and 31 December 2017. Surgical pathology reports were extracted and processed with NLP. Decision tree classifiers were constructed and validated against chart review results. Machine-learning algorithms were evaluated with a performance matrix including sensitivity, specificity, positive predictive value (PPV), negative predictive value [NPV], accuracy, area under the receiver operating characteristic curve [AUC], and F1 score. RESULTS The study included 351 female patients. Of these patients, 102 (29%) achieved pCR after NAC. The high-sensitivity model achieved a sensitivity of 90.5% (95% confidence interval [CI], 69.6-98.9%), a PPV of 76% (95% CI, 59.6-87.2), an accuracy of 88.6% (95% CI, 78.7-94.9%), an AUC of 0.891 (95% CI, 0.795-0.987), and an F1 score of 82.61. The high-PPV algorithm reached a sensitivity of 85.7% (95% CI, 63.7-97%), a PPV of 81.8% (95% CI, 63.4-92.1%), an accuracy of 90% (95% CI, 80.5-95.9%), an AUC of 0.888 (95% CI, 0.790-0.985), and an F1 score of 83.72. The high-F1 score algorithm obtained a performance equivalent to that of the high-PPV algorithm. CONCLUSION The developed algorithms demonstrated excellent accuracy in identifying pCR from surgical pathology reports of breast cancer patients who received NAC treatment.
Collapse
Affiliation(s)
- Guosong Wu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Cheligeer Cheligeer
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Alberta Health Services, Calgary, AB, Canada
| | - Anne-Marie Brisson
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada
| | - May Lynn Quan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Winson Y Cheung
- Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Darren Brenner
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada
| | - Sasha Lupichuk
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada
| | - Carolin Teman
- Department of Pathology and Laboratory Medicine, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Robert Barkev Basmadjian
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Brittany Popwich
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Yuan Xu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
- Departments of Oncology, Community Health Sciences, and Surgery, and The Center for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N4Z6, Canada.
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
| |
Collapse
|
8
|
Harrison JM, Yala A, Mikhael P, Roldan J, Ciprani D, Michelakos T, Bolm L, Qadan M, Ferrone C, Fernandez-Del Castillo C, Lillemoe KD, Santus E, Hughes K. Successful Development of a Natural Language Processing Algorithm for Pancreatic Neoplasms and Associated Histologic Features. Pancreas 2023; 52:e219-e223. [PMID: 37716007 DOI: 10.1097/mpa.0000000000002242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/18/2023]
Abstract
OBJECTIVES Natural language processing (NLP) algorithms can interpret unstructured text for commonly used terms and phrases. Pancreatic pathologies are diverse and include benign and malignant entities with associated histologic features. Creating a pancreas NLP algorithm can aid in electronic health record coding as well as large database creation and curation. METHODS Text-based pancreatic anatomic and cytopathologic reports for pancreatic cancer, pancreatic ductal adenocarcinoma, neuroendocrine tumor, intraductal papillary neoplasm, tumor dysplasia, and suspicious findings were collected. This dataset was split 80/20 for model training and development. A separate set was held out for testing purposes. We trained using convolutional neural network to predict each heading. RESULTS Over 14,000 reports were obtained from the Mass General Brigham Healthcare System electronic record. Of these, 1252 reports were used for algorithm development. Final accuracy and F1 scores relative to the test set ranged from 95% and 98% for each queried pathology. To understand the dependence of our results to training set size, we also generated learning curves. Scoring metrics improved as more reports were submitted for training; however, some queries had high index performance. CONCLUSIONS Natural language processing algorithms can be used for pancreatic pathologies. Increased training volume, nonoverlapping terminology, and conserved text structure improve NLP algorithm performance.
Collapse
Affiliation(s)
- Jon Michael Harrison
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Adam Yala
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mass
| | - Peter Mikhael
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mass
| | - Jorge Roldan
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Debora Ciprani
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Theodoros Michelakos
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Louisa Bolm
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Motaz Qadan
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | - Cristina Ferrone
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| | | | | | - Enrico Santus
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mass
| | - Kevin Hughes
- From the Department of GI and General Surgery, Massachusetts General Hospital, Boston
| |
Collapse
|
9
|
Yang Y, Lu Y, Yan W. A comprehensive review on knowledge graphs for complex diseases. Brief Bioinform 2023; 24:6931722. [PMID: 36528805 DOI: 10.1093/bib/bbac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 11/02/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Medical College of Soochow University, and Center for Systems Biology, Soochow University, Suzhou 215123, China
| |
Collapse
|
10
|
Schiappa R, Contu S, Culie D, Thamphya B, Chateau Y, Gal J, Bailleux C, Haudebourg J, Ferrero JM, Barranger E, Chamorey E. RUBY: Natural Language Processing of French Electronic Medical Records for Breast Cancer Research. JCO Clin Cancer Inform 2022; 6:e2100199. [PMID: 35960900 PMCID: PMC9470144 DOI: 10.1200/cci.21.00199] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 05/06/2022] [Accepted: 07/08/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Electronic medical records are a valuable source of information about patients' clinical status but are often free-text documents that require laborious manual review to be exploited. Techniques from computer science have been investigated, but the literature has marginally focused on non-English language texts. We developed RUBY, a tool designed in collaboration with IBM-France to automatically structure clinical information from French medical records of patients with breast cancer. MATERIALS AND METHODS RUBY, which exploits state-of-the-art Named Entity Recognition models combined with keyword extraction and postprocessing rules, was applied on clinical texts. We investigated the precision of RUBY in extracting the target information. RESULTS RUBY has an average precision of 92.8% for the Surgery report, 92.7% for the Pathology report, 98.1% for the Biopsy report, and 81.8% for the Consultation report. CONCLUSION These results show that the automatic approach has the potential to effectively extract clinical knowledge from an extensive set of electronic medical records, reducing the manual effort required and saving a significant amount of time. A deeper semantic analysis and further understanding of the context in the text, as well as training on a larger and more recent set of reports, including those containing highly variable entities and the use of ontologies, could further improve the results.
Collapse
Affiliation(s)
- Renaud Schiappa
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Sara Contu
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Dorian Culie
- Cervico-facial Oncology Surgical Department, University Institute of Face and Neck, University of Côte d'Azur, Nice, France
| | - Brice Thamphya
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Yann Chateau
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Jocelyn Gal
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Caroline Bailleux
- Department of Medical Oncology, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Juliette Haudebourg
- Anatomy and Pathological Cytology Laboratory, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Jean-Marc Ferrero
- Anatomy and Pathological Cytology Laboratory, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Emmanuel Barranger
- Department of Medical Oncology, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Emmanuel Chamorey
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| |
Collapse
|
11
|
Wang L, Fu S, Wen A, Ruan X, He H, Liu S, Moon S, Mai M, Riaz IB, Wang N, Yang P, Xu H, Warner JL, Liu H. Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing. JCO Clin Cancer Inform 2022; 6:e2200006. [PMID: 35917480 PMCID: PMC9470142 DOI: 10.1200/cci.22.00006] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 03/18/2022] [Accepted: 06/15/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The advancement of natural language processing (NLP) has promoted the use of detailed textual data in electronic health records (EHRs) to support cancer research and to facilitate patient care. In this review, we aim to assess EHR for cancer research and patient care by using the Minimal Common Oncology Data Elements (mCODE), which is a community-driven effort to define a minimal set of data elements for cancer research and practice. Specifically, we aim to assess the alignment of NLP-extracted data elements with mCODE and review existing NLP methodologies for extracting said data elements. METHODS Published literature studies were searched to retrieve cancer-related NLP articles that were written in English and published between January 2010 and September 2020 from main literature databases. After the retrieval, articles with EHRs as the data source were manually identified. A charting form was developed for relevant study analysis and used to categorize data including four main topics: metadata, EHR data and targeted cancer types, NLP methodology, and oncology data elements and standards. RESULTS A total of 123 publications were selected finally and included in our analysis. We found that cancer research and patient care require some data elements beyond mCODE as expected. Transparency and reproductivity are not sufficient in NLP methods, and inconsistency in NLP evaluation exists. CONCLUSION We conducted a comprehensive review of cancer NLP for research and patient care using EHRs data. Issues and barriers for wide adoption of cancer NLP were identified and discussed.
Collapse
Affiliation(s)
- Liwei Wang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Xiaoyang Ruan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Huan He
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sijia Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sungrim Moon
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Michelle Mai
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Irbaz B. Riaz
- Department of Hematology/Oncology, Mayo Clinic, Scottsdale, AZ
| | - Nan Wang
- Department of Computer Science and Engineering, College of Science and Engineering, University of Minnesota, Minneapolis, MN
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, AZ
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Jeremy L. Warner
- Departments of Medicine (Hematology/Oncology), Vanderbilt University, Nashville, TN
- Department Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| |
Collapse
|
12
|
Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc 2022; 29:1208-1216. [PMID: 35333345 PMCID: PMC9196678 DOI: 10.1093/jamia/ocac040] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 03/06/2022] [Accepted: 03/09/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. MATERIALS AND METHODS A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. RESULTS All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. CONCLUSIONS The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.
Collapse
Affiliation(s)
- Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Nan Wang
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Liwei Wang
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hongfang Liu
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
13
|
Mehrpour O, Hoyte C, Goss F, Shirazi FM, Nakhaee S. Decision tree algorithm can determine the outcome of repeated supratherapeutic ingestion (RSTI) exposure to acetaminophen: review of 4500 national poison data system cases. Drug Chem Toxicol 2022:1-7. [DOI: 10.1080/01480545.2022.2083149] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Omid Mehrpour
- Data Science Institute, Southern Methodist University, Dallas, TX, USA
- Denver Health and Hospital Authority, Denver, CO, USA
| | - Christopher Hoyte
- Department of Emergency Medicine, University of Colorado Hospital, Aurora, Colorado
| | - Foster Goss
- Department of Emergency Medicine, University of Colorado Hospital, Aurora, Colorado
| | - Farshad M. Shirazi
- Arizona Poison & Drug Information Center, University of Arizona, College of Pharmacy and University of Arizona, College of Medicine, Tucson, AZ, USA
| | - Samaneh Nakhaee
- Medical Toxicology and Drug Abuse Research Center (MTDRC), Birjand University of Medical Sciences (BUMS), Birjand, Iran
| |
Collapse
|
14
|
Blanchard AE, Gao S, Yoon HJ, Christian JB, Durbin EB, Wu XC, Stroup A, Doherty J, Schwartz SM, Wiggins C, Coyle L, Penberthy L, Tourassi GD. A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification. IEEE J Biomed Health Inform 2022; 26:2796-2803. [PMID: 35020599 PMCID: PMC9533247 DOI: 10.1109/jbhi.2022.3141976] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
Collapse
|
15
|
A BERT model generates diagnostically relevant semantic embeddings from pathology synopses with active learning. COMMUNICATIONS MEDICINE 2022; 1:11. [PMID: 35602188 PMCID: PMC9053264 DOI: 10.1038/s43856-021-00008-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 05/13/2021] [Indexed: 02/08/2023] Open
Abstract
Background Pathology synopses consist of semi-structured or unstructured text summarizing visual information by observing human tissue. Experts write and interpret these synopses with high domain-specific knowledge to extract tissue semantics and formulate a diagnosis in the context of ancillary testing and clinical information. The limited number of specialists available to interpret pathology synopses restricts the utility of the inherent information. Deep learning offers a tool for information extraction and automatic feature generation from complex datasets. Methods Using an active learning approach, we developed a set of semantic labels for bone marrow aspirate pathology synopses. We then trained a transformer-based deep-learning model to map these synopses to one or more semantic labels, and extracted learned embeddings (i.e., meaningful attributes) from the model's hidden layer. Results Here we demonstrate that with a small amount of training data, a transformer-based natural language model can extract embeddings from pathology synopses that capture diagnostically relevant information. On average, these embeddings can be used to generate semantic labels mapping patients to probable diagnostic groups with a micro-average F1 score of 0.779 Â ± 0.025. Conclusions We provide a generalizable deep learning model and approach to unlock the semantic information inherent in pathology synopses toward improved diagnostics, biodiscovery and AI-assisted computational pathology.
Collapse
|
16
|
Vision for Improving Pregnancy Health: Innovation and the Future of Pregnancy Research. Reprod Sci 2022; 29:2908-2920. [PMID: 35534766 PMCID: PMC9537127 DOI: 10.1007/s43032-022-00951-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 04/15/2022] [Indexed: 10/25/2022]
Abstract
Understanding, predicting, and preventing pregnancy disorders have been a major research target. Nonetheless, the lack of progress is illustrated by research results related to preeclampsia and other hypertensive pregnancy disorders. These remain a major cause of maternal and infant mortality worldwide. There is a general consensus that the rate of progress toward understanding pregnancy disorders lags behind progress in other aspects of human health. In this presentation, we advance an explanation for this failure and suggest solutions. We propose that progress has been impeded by narrowly focused research training and limited imagination and innovation, resulting in the failure to think beyond conventional research approaches and analytical strategies. Investigations have been largely limited to hypothesis-generating approaches constrained by attempts to force poorly defined complex disorders into a single "unifying" hypothesis. Future progress could be accelerated by rethinking this approach. We advise taking advantage of innovative approaches that will generate new research strategies for investigating pregnancy abnormalities. Studies should begin before conception, assessing pregnancy longitudinally, before, during, and after pregnancy. Pregnancy disorders should be defined by pathophysiology rather than phenotype, and state of the art agnostic assessment of data should be adopted to generate new ideas. Taking advantage of new approaches mandates emphasizing innovation, inclusion of large datasets, and use of state of the art experimental and analytical techniques. A revolution in understanding pregnancy-associated disorders will depend on networks of scientists who are driven by an intense biological curiosity, a team spirit, and the tools to make new discoveries.
Collapse
|
17
|
Yoo S, Yoon E, Boo D, Kim B, Kim S, Paeng JC, Yoo IR, Choi IY, Kim K, Ryoo HG, Lee SJ, Song E, Joo YH, Kim J, Lee HY. Transforming Thyroid Cancer Diagnosis and Staging Information from Unstructured Reports to the Observational Medical Outcome Partnership Common Data Model. Appl Clin Inform 2022; 13:521-531. [PMID: 35705182 PMCID: PMC9200482 DOI: 10.1055/s-0042-1748144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
BACKGROUND Cancer staging information is an essential component of cancer research. However, the information is primarily stored as either a full or semistructured free-text clinical document which is limiting the data use. By transforming the cancer-specific data to the Observational Medical Outcome Partnership Common Data Model (OMOP CDM), the information can contribute to establish multicenter observational cancer studies. To the best of our knowledge, there have been no studies on OMOP CDM transformation and natural language processing (NLP) for thyroid cancer to date. OBJECTIVE We aimed to demonstrate the applicability of the OMOP CDM oncology extension module for thyroid cancer diagnosis and cancer stage information by processing free-text medical reports. METHODS Thyroid cancer diagnosis and stage-related modifiers were extracted with rule-based NLP from 63,795 thyroid cancer pathology reports and 56,239 Iodine whole-body scan reports from three medical institutions in the Observational Health Data Sciences and Informatics data network. The data were converted into the OMOP CDM v6.0 according to the OMOP CDM oncology extension module. The cancer staging group was derived and populated using the transformed CDM data. RESULTS The extracted thyroid cancer data were completely converted into the OMOP CDM. The distributions of histopathological types of thyroid cancer were approximately 95.3 to 98.8% of papillary carcinoma, 0.9 to 3.7% of follicular carcinoma, 0.04 to 0.54% of adenocarcinoma, 0.17 to 0.81% of medullary carcinoma, and 0 to 0.3% of anaplastic carcinoma. Regarding cancer staging, stage-I thyroid cancer accounted for 55 to 64% of the cases, while stage III accounted for 24 to 26% of the cases. Stage-II and -IV thyroid cancers were detected at a low rate of 2 to 6%. CONCLUSION As a first study on OMOP CDM transformation and NLP for thyroid cancer, this study will help other institutions to standardize thyroid cancer-specific data for retrospective observational research and participate in multicenter studies.
Collapse
Affiliation(s)
- Sooyoung Yoo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Eunsil Yoon
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Dachung Boo
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Borham Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Seok Kim
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Jin Chul Paeng
- Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| | - Ie Ryung Yoo
- Division of Nuclear Medicine, Department of Radiology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, South Korea
| | - In Young Choi
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Kwangsoo Kim
- Transdisciplinary Department of Medicine and Advanced Technology, Seoul National University Hospital, Seoul, South Korea
| | - Hyun Gee Ryoo
- Department of Nuclear Medicine, Seoul National University Hospital, Seoul, South Korea.,Department of Nuclear Medicine, Seoul National University Bundang Hospital, Seongnam, South Korea
| | - Sun Jung Lee
- Department of Medical Informatics, The Catholic University of Korea, College of Medicine, Seoul, South Korea.,Department of Biomedicine and Health Sciences, The Catholic University of Korea, College of Medicine, Seoul, South Korea
| | - Eunhye Song
- Department of Data Science Research, Innovative Medical Technology Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Young-Hwan Joo
- Biomedical Research Institute, Seoul National University Hospital, Seoul, South Korea
| | - Junmo Kim
- Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea
| | - Ho-Young Lee
- Office of eHealth Research and Business, Healthcare Innovation Park, Seoul National University Bundang Hospital, Seongnam, South Korea.,Department of Nuclear Medicine, Seoul National University, College of Medicine, Seoul, South Korea
| |
Collapse
|
18
|
Yang X, Mu D, Peng H, Li H, Wang Y, Wang P, Wang Y, Han S. Research and Application of Artificial Intelligence Based on Electronic Health Records of Patients With Cancer: Systematic Review. JMIR Med Inform 2022; 10:e33799. [PMID: 35442195 PMCID: PMC9069295 DOI: 10.2196/33799] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 01/24/2022] [Accepted: 03/14/2022] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND With the accumulation of electronic health records and the development of artificial intelligence, patients with cancer urgently need new evidence of more personalized clinical and demographic characteristics and more sophisticated treatment and prevention strategies. However, no research has systematically analyzed the application and significance of artificial intelligence based on electronic health records in cancer care. OBJECTIVE The aim of this study was to conduct a review to introduce the current state and limitations of artificial intelligence based on electronic health records of patients with cancer and to summarize the performance of artificial intelligence in mining electronic health records and its impact on cancer care. METHODS Three databases were systematically searched to retrieve potentially relevant papers published from January 2009 to October 2020. Four principal reviewers assessed the quality of the papers and reviewed them for eligibility based on the inclusion criteria in the extracted data. The summary measures used in this analysis were the number and frequency of occurrence of the themes. RESULTS Of the 1034 papers considered, 148 papers met the inclusion criteria. Cancer care, especially cancers of female organs and digestive organs, could benefit from artificial intelligence based on electronic health records through cancer emergencies and prognostic estimates, cancer diagnosis and prediction, tumor stage detection, cancer case detection, and treatment pattern recognition. The models can always achieve an area under the curve of 0.7. Ensemble methods and deep learning are on the rise. In addition, electronic medical records in the existing studies are mainly in English and from private institutional databases. CONCLUSIONS Artificial intelligence based on electronic health records performed well and could be useful for cancer care. Improving the performance of artificial intelligence can help patients receive more scientific-based and accurate treatments. There is a need for the development of new methods and electronic health record data sharing and for increased passion and support from cancer specialists.
Collapse
Affiliation(s)
- Xinyu Yang
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Dongmei Mu
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Hao Peng
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Hua Li
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Ying Wang
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Ping Wang
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Yue Wang
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| | - Siqi Han
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, China
- Department of Medical Informatics, School of Public Health, Jilin University, Changchun, China
| |
Collapse
|
19
|
Santos T, Tariq A, Gichoya JW, Trivedi H, Banerjee I. Automatic Classification of Cancer Pathology Reports: A Systematic Review. J Pathol Inform 2022; 13:100003. [PMID: 35242443 PMCID: PMC8860734 DOI: 10.1016/j.jpi.2022.100003] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 11/12/2021] [Indexed: 11/30/2022] Open
Abstract
Pathology reports primarily consist of unstructured free text and thus the clinical information contained in the reports is not trivial to access or query. Multiple natural language processing (NLP) techniques have been proposed to automate the coding of pathology reports via text classification. In this systematic review, we follow the guidelines proposed by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; Page et al., 2020: BMJ.) to identify the NLP systems for classifying pathology reports published between the years of 2010 and 2021. Based on our search criteria, a total of 3445 records were retrieved, and 25 articles met the final review criteria. We benchmarked the systems based on methodology, complexity of the prediction task and core types of NLP models: i) Rule-based and Intelligent systems, ii) statistical machine learning, and iii) deep learning. While certain tasks are well addressed by these models, many others have limitations and remain as open challenges, such as, extraction of many cancer characteristics (size, shape, type of cancer, others) from pathology reports. We investigated the final set of papers (25) and addressed their potential as well as their limitations. We hope that this systematic review helps researchers prioritize the development of innovated approaches to tackle the current limitations and help the advancement of cancer research.
Collapse
Affiliation(s)
- Thiago Santos
- Department of Computer Science, Emory University, Atlanta, GA, USA
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Corresponding author.
| | - Amara Tariq
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA
| | - Judy Wawira Gichoya
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Department of Radiology, Emory School of Medicine, Atlanta, GA, USA
| | - Hari Trivedi
- Department of Biomedical Informatics, Emory School of Medicine, Atlanta, GA, USA
- Department of Radiology, Emory School of Medicine, Atlanta, GA, USA
| | - Imon Banerjee
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA
- Department of Computer Engineering, Arizona State University, AZ, USA
| |
Collapse
|
20
|
Arabahmadi M, Farahbakhsh R, Rezazadeh J. Deep Learning for Smart Healthcare-A Survey on Brain Tumor Detection from Medical Imaging. SENSORS (BASEL, SWITZERLAND) 2022; 22:1960. [PMID: 35271115 PMCID: PMC8915095 DOI: 10.3390/s22051960] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 02/18/2022] [Accepted: 02/28/2022] [Indexed: 12/13/2022]
Abstract
Advances in technology have been able to affect all aspects of human life. For example, the use of technology in medicine has made significant contributions to human society. In this article, we focus on technology assistance for one of the most common and deadly diseases to exist, which is brain tumors. Every year, many people die due to brain tumors; based on "braintumor" website estimation in the U.S., about 700,000 people have primary brain tumors, and about 85,000 people are added to this estimation every year. To solve this problem, artificial intelligence has come to the aid of medicine and humans. Magnetic resonance imaging (MRI) is the most common method to diagnose brain tumors. Additionally, MRI is commonly used in medical imaging and image processing to diagnose dissimilarity in different parts of the body. In this study, we conducted a comprehensive review on the existing efforts for applying different types of deep learning methods on the MRI data and determined the existing challenges in the domain followed by potential future directions. One of the branches of deep learning that has been very successful in processing medical images is CNN. Therefore, in this survey, various architectures of CNN were reviewed with a focus on the processing of medical images, especially brain MRI images.
Collapse
Affiliation(s)
| | - Reza Farahbakhsh
- Institut Polytechnique de Paris, Telecom SudParis, 91000 Evry, France;
| | - Javad Rezazadeh
- North Tehran Branch, Azad University, Tehran 1667914161, Iran;
- Kent Institute Australia, Sydney, NSW 2000, Australia
| |
Collapse
|
21
|
Bashath S, Perera N, Tripathi S, Manjang K, Dehmer M, Streib FE. A data-centric review of deep transfer learning with applications to text data. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.11.061] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
22
|
Beinecke J, Heider D. Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making. BioData Min 2021; 14:49. [PMID: 34844620 PMCID: PMC8628399 DOI: 10.1186/s13040-021-00283-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Accepted: 11/10/2021] [Indexed: 02/08/2023] Open
Abstract
Clinical data sets have very special properties and suffer from many caveats in machine learning. They typically show a high-class imbalance, have a small number of samples and a large number of parameters, and have missing values. While feature selection approaches and imputation techniques address the former problems, the class imbalance is typically addressed using augmentation techniques. However, these techniques have been developed for big data analytics, and their suitability for clinical data sets is unclear.This study analyzed different augmentation techniques for use in clinical data sets and subsequent employment of machine learning-based classification. It turns out that Gaussian Noise Up-Sampling (GNUS) is not always but generally, is as good as SMOTE and ADASYN and even outperform those on some datasets. However, it has also been shown that augmentation does not improve classification at all in some cases.
Collapse
Affiliation(s)
- Jacqueline Beinecke
- Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-Meerwein-Str. 6, 35043, Marburg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, Philipps-University of Marburg, Hans-Meerwein-Str. 6, 35043, Marburg, Germany.
| |
Collapse
|
23
|
Pironet A, Poirel HA, Tambuyzer T, De Schutter H, van Walle L, Mattheijssens J, Henau K, Van Eycken L, Van Damme N. Machine Learning-Based Extraction of Breast Cancer Receptor Status From Bilingual Free-Text Pathology Reports. Front Digit Health 2021; 3:692077. [PMID: 34713168 PMCID: PMC8522027 DOI: 10.3389/fdgth.2021.692077] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 07/19/2021] [Indexed: 11/13/2022] Open
Abstract
As part of its core business of gathering population-based information on new cancer diagnoses, the Belgian Cancer Registry receives free-text pathology reports, describing results of (pre-)malignant specimens. These reports are provided by 82 laboratories and written in 2 national languages, Dutch or French. For breast cancer, the reports characterize the status of estrogen receptor, progesterone receptor, and Erb-b2 receptor tyrosine kinase 2. These biomarkers are related with tumor growth and prognosis and are essential to define therapeutic management. The availability of population-scale information about their status in breast cancer patients can therefore be considered crucial to enrich real-world scientific studies and to guide public health policies regarding personalized medicine. The main objective of this study is to expand the data available at the Belgian Cancer Registry by automatically extracting the status of these biomarkers from the pathology reports. Various types of numeric features are computed from over 1,300 manually annotated reports linked to breast tumors diagnosed in 2014. A range of popular machine learning classifiers, such as support vector machines, random forests and logistic regressions, are trained on this data and compared using their F1 scores on a separate validation set. On a held-out test set, the best performing classifiers achieve F1 scores ranging from 0.89 to 0.92 for the four classification tasks. The extraction is thus reliable and allows to significantly increase the availability of this valuable information on breast cancer receptor status at a population level.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Kris Henau
- Belgian Cancer Registry, Brussels, Belgium
| | | | | |
Collapse
|
24
|
Gandouz M, Holzmann H, Heider D. Machine learning with asymmetric abstention for biomedical decision-making. BMC Med Inform Decis Mak 2021; 21:294. [PMID: 34702225 PMCID: PMC8549182 DOI: 10.1186/s12911-021-01655-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 10/13/2021] [Indexed: 02/08/2023] Open
Abstract
Machine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.
Collapse
Affiliation(s)
- Mariem Gandouz
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, University of Marburg, 35032, Marburg, Germany
| | - Hajo Holzmann
- Department of Statistics, Faculty of Mathematics and Computer Science, University of Marburg, 35032, Marburg, Germany
| | - Dominik Heider
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, University of Marburg, 35032, Marburg, Germany.
| |
Collapse
|
25
|
Mehrpour O, Saeedi F, Hoyte C. Decision tree outcome prediction of acute acetaminophen exposure in the United States: A study of 30,000 cases from the National Poison Data System. Basic Clin Pharmacol Toxicol 2021; 130:191-199. [PMID: 34649297 DOI: 10.1111/bcpt.13674] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 10/08/2021] [Accepted: 10/12/2021] [Indexed: 12/25/2022]
Abstract
Acetaminophen is one of the most commonly used analgesic drugs in the United States. However, the outcomes of acute acetaminophen overdose might be very serious in some cases. Therefore, prediction of the outcomes of acute acetaminophen exposure is crucial. This study is a 6-year retrospective cohort study using National Poison Data System (NPDS) data. A decision tree algorithm was used to determine the risk predictors of acetaminophen exposure. The decision tree model had an accuracy of 0.839, an accuracy of 0.836, a recall of 0.72, a specificity of 0.86 and an F1_score of 0.76 for the test group and an accuracy of 0.848, a recall of 0.85, a recall of 0.74, a specificity of 0.87 and an F1_score of 0.78 for the training group. Our results showed that elevated serum levels of liver enzymes, other liver function test abnormality, anorexia, acidosis, electrolyte abnormality, increased bilirubin, coagulopathy, abdominal pain, coma, increased anion gap, tachycardia and hypotension were the most important factors in determining the outcome of acute acetaminophen exposure. Therefore, the decision tree model is a reliable approach in determining the prognosis of acetaminophen exposure cases and can be used in an emergency room or during hospitalization.
Collapse
Affiliation(s)
- Omid Mehrpour
- Data Science Institute, Southern Methodist University, Dallas, Texas, USA.,Rocky Mountain Poison and Drug Safety, Denver Health and Hospital Authority, Denver, Colorado, USA
| | - Farhad Saeedi
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran.,Medical Toxicology and Drug Abuse Research Center (MTDRC), Birjand University of Medical Sciences (BUMS), Birjand, Iran
| | - Christopher Hoyte
- University of Colorado Anschutz Medical Campus, Aurora, Colorado, USA.,University of Colorado Hospital, Aurora, Colorado, USA
| |
Collapse
|
26
|
Abedian S, Sholle ET, Adekkanattu PM, Cusick MM, Weiner SE, Shoag JE, Hu JC, Campion TR. Automated Extraction of Tumor Staging and Diagnosis Information From Surgical Pathology Reports. JCO Clin Cancer Inform 2021; 5:1054-1061. [PMID: 34694896 PMCID: PMC8812635 DOI: 10.1200/cci.21.00065] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Revised: 08/25/2021] [Accepted: 09/29/2021] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Typically stored as unstructured notes, surgical pathology reports contain data elements valuable to cancer research that require labor-intensive manual extraction. Although studies have described natural language processing (NLP) of surgical pathology reports to automate information extraction, efforts have focused on specific cancer subtypes rather than across multiple oncologic domains. To address this gap, we developed and evaluated an NLP method to extract tumor staging and diagnosis information across multiple cancer subtypes. METHODS The NLP pipeline was implemented on an open-source framework called Leo. We used a total of 555,681 surgical pathology reports of 329,076 patients to develop the pipeline and evaluated our approach on subsets of reports from patients with breast, prostate, colorectal, and randomly selected cancer subtypes. RESULTS Averaged across all four cancer subtypes, the NLP pipeline achieved an accuracy of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.89 for T staging, 0.90 for N staging, and 0.97 for M staging. It achieved an F1 score of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.88 for T staging, 0.90 for N staging, and 0.24 for M staging. CONCLUSION The NLP pipeline was developed to extract tumor staging and diagnosis information across multiple cancer subtypes to support the research enterprise in our institution. Although it was not possible to demonstrate generalizability of our NLP pipeline to other institutions, other institutions may find value in adopting a similar NLP approach-and reusing code available at GitHub-to support the oncology research enterprise with elements extracted from surgical pathology reports.
Collapse
Affiliation(s)
- Sajjad Abedian
- Information Technologies and Services Department, Weill Cornell Medicine, New York, NY
| | - Evan T. Sholle
- Information Technologies and Services Department, Weill Cornell Medicine, New York, NY
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY
| | | | - Marika M. Cusick
- Information Technologies and Services Department, Weill Cornell Medicine, New York, NY
| | - Stephanie E. Weiner
- Information Technologies and Services Department, Weill Cornell Medicine, New York, NY
| | - Jonathan E. Shoag
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY
| | - Jim C. Hu
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY
| | - Thomas R. Campion
- Information Technologies and Services Department, Weill Cornell Medicine, New York, NY
- Department of Urology, Weill Cornell Medicine, New York, NY
- Clinical and Translational Science Center, Weill Cornell Medicine, New York, NY
- Department of Pediatrics, Weill Cornell Medicine, New York, NY
| |
Collapse
|
27
|
Conceição SIR, Couto FM. Text Mining for Building Biomedical Networks Using Cancer as a Case Study. Biomolecules 2021; 11:biom11101430. [PMID: 34680062 PMCID: PMC8533101 DOI: 10.3390/biom11101430] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/24/2021] [Accepted: 09/27/2021] [Indexed: 12/15/2022] Open
Abstract
In the assembly of biological networks it is important to provide reliable interactions in an effort to have the most possible accurate representation of real-life systems. Commonly, the data used to build a network comes from diverse high-throughput essays, however most of the interaction data is available through scientific literature. This has become a challenge with the notable increase in scientific literature being published, as it is hard for human curators to track all recent discoveries without using efficient tools to help them identify these interactions in an automatic way. This can be surpassed by using text mining approaches which are capable of extracting knowledge from scientific documents. One of the most important tasks in text mining for biological network building is relation extraction, which identifies relations between the entities of interest. Many interaction databases already use text mining systems, and the development of these tools will lead to more reliable networks, as well as the possibility to personalize the networks by selecting the desired relations. This review will focus on different approaches of automatic information extraction from biomedical text that can be used to enhance existing networks or create new ones, such as deep learning state-of-the-art approaches, focusing on cancer disease as a case-study.
Collapse
|
28
|
Albaradei S, Thafar M, Alsaedi A, Van Neste C, Gojobori T, Essack M, Gao X. Machine learning and deep learning methods that use omics data for metastasis prediction. Comput Struct Biotechnol J 2021; 19:5008-5018. [PMID: 34589181 PMCID: PMC8450182 DOI: 10.1016/j.csbj.2021.09.001] [Citation(s) in RCA: 65] [Impact Index Per Article: 21.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 08/16/2021] [Accepted: 09/02/2021] [Indexed: 12/14/2022] Open
Abstract
Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.
Collapse
Key Words
- AE, autoencoder
- ANN, Artificial Neural Network
- AUC, area under the curve
- Acc, Accuracy
- Artificial intelligence
- BC, Betweenness centrality
- BH, Benjamini-Hochberg
- BioGRID, Biological General Repository for Interaction Datasets
- CCP, compound covariate predictor
- CEA, Carcinoembryonic antigen
- CNN, convolution neural networks
- CV, cross-validation
- Cancer
- DBN, deep belief network
- DDBN, discriminative deep belief network
- DEGs, differentially expressed genes
- DIP, Database of Interacting Proteins
- DNN, Deep neural network
- DT, Decision Tree
- Deep learning
- EMT, epithelial-mesenchymal transition
- FC, fully connected
- GA, Genetic Algorithm
- GANs, generative adversarial networks
- GEO, Gene Expression Omnibus
- HCC, hepatocellular carcinoma
- HPRD, Human Protein Reference Database
- KNN, K-nearest neighbor
- L-SVM, linear SVM
- LIMMA, linear models for microarray data
- LOOCV, Leave-one-out cross-validation
- LR, Logistic Regression
- MCCV, Monte Carlo cross-validation
- MLP, multilayer perceptron
- Machine learning
- Metastasis
- NPV, negative predictive value
- PCA, Principal component analysis
- PPI, protein-protein interaction
- PPV, positive predictive value
- RC, ridge classifier
- RF, Random Forest
- RFE, recursive feature elimination
- RMA, robust multi‐array average
- RNN, recurrent neural networks
- SGD, stochastic gradient descent
- SMOTE, synthetic minority over-sampling technique
- SVM, Support Vector Machine
- Se, sensitivity
- Sp, specificity
- TCGA, The Cancer Genome Atlas
- k-CV, k-fold cross validation
- mRMR, minimum redundancy maximum relevance
Collapse
Affiliation(s)
- Somayah Albaradei
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia
| | - Maha Thafar
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Taif University, Collage of Computers and Information Technology, Taif, Saudi Arabia
| | - Asim Alsaedi
- King Saud bin Abdulaziz University for Health Sciences, Jeddah, Saudi Arabia
- King Abdulaziz Medical City, Jeddah, Saudi Arabia
| | - Christophe Van Neste
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
29
|
Gao S, Alawad M, Young MT, Gounley J, Schaefferkoetter N, Yoon HJ, Wu XC, Durbin EB, Doherty J, Stroup A, Coyle L, Tourassi G. Limitations of Transformers on Clinical Text Classification. IEEE J Biomed Health Inform 2021; 25:3596-3607. [PMID: 33635801 PMCID: PMC8387496 DOI: 10.1109/jbhi.2021.3062322] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.
Collapse
Affiliation(s)
- Shang Gao
- Oak Ridge National Laboratory, Oak Ridge, TN 37830
| | | | | | - John Gounley
- Oak Ridge National Laboratory, Oak Ridge, TN 37830
| | | | | | | | | | | | | | - Linda Coyle
- Information Management Services Inc., Calverton, MD 20705
| | | |
Collapse
|
30
|
Santus E, Schuster T, Tahmasebi AM, Li C, Yala A, Lanahan CR, Prinsen P, Thompson SF, Coons S, Mynderse L, Barzilay R, Hughes K. Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports. JCO Clin Cancer Inform 2021; 4:865-874. [PMID: 33006906 DOI: 10.1200/cci.20.00028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.
Collapse
Affiliation(s)
- Enrico Santus
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Tal Schuster
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | | | - Clara Li
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Adam Yala
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Conor R Lanahan
- Department of Oncology, Massachusetts General Hospital, Boston, MA
| | | | | | | | | | - Regina Barzilay
- Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, MA
| | - Kevin Hughes
- Department of Oncology, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
31
|
Alkaitis MS, Agrawal MN, Riely GJ, Razavi P, Sontag D. Automated NLP Extraction of Clinical Rationale for Treatment Discontinuation in Breast Cancer. JCO Clin Cancer Inform 2021; 5:550-560. [PMID: 33989016 DOI: 10.1200/cci.20.00139] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Key oncology end points are not routinely encoded into electronic medical records (EMRs). We assessed whether natural language processing (NLP) can abstract treatment discontinuation rationale from unstructured EMR notes to estimate toxicity incidence and progression-free survival (PFS). METHODS We constructed a retrospective cohort of 6,115 patients with early-stage and 701 patients with metastatic breast cancer initiating care at Memorial Sloan Kettering Cancer Center from 2008 to 2019. Each cohort was divided into training (70%), validation (15%), and test (15%) subsets. Human abstractors identified the clinical rationale associated with treatment discontinuation events. Concatenated EMR notes were used to train high-dimensional logistic regression and convolutional neural network models. Kaplan-Meier analyses were used to compare toxicity incidence and PFS estimated by our NLP models to estimates generated by manual labeling and time-to-treatment discontinuation (TTD). RESULTS Our best high-dimensional logistic regression models identified toxicity events in early-stage patients with an area under the curve of the receiver-operator characteristic of 0.857 ± 0.014 (standard deviation) and progression events in metastatic patients with an area under the curve of 0.752 ± 0.027 (standard deviation). NLP-extracted toxicity incidence and PFS curves were not significantly different from manually extracted curves (P = .95 and P = .67, respectively). By contrast, TTD overestimated toxicity in early-stage patients (P < .001) and underestimated PFS in metastatic patients (P < .001). Additionally, we tested an extrapolation approach in which 20% of the metastatic cohort were labeled manually, and NLP algorithms were used to abstract the remaining 80%. This extrapolated outcomes approach resolved PFS differences between receptor subtypes (P < .001 for hormone receptor+/human epidermal growth factor receptor 2- v human epidermal growth factor receptor 2+ v triple-negative) that could not be resolved with TTD. CONCLUSION NLP models are capable of abstracting treatment discontinuation rationale with minimal manual labeling.
Collapse
Affiliation(s)
- Matthew S Alkaitis
- CSAIL & IMES, Massachusetts Institute of Technology, Cambridge, MA.,Harvard Medical School, Boston, MA
| | - Monica N Agrawal
- CSAIL & IMES, Massachusetts Institute of Technology, Cambridge, MA
| | - Gregory J Riely
- Memorial Sloan Kettering Cancer Center, New York, NY.,Weill-Cornell Medical College, New York, NY
| | - Pedram Razavi
- Memorial Sloan Kettering Cancer Center, New York, NY.,Weill-Cornell Medical College, New York, NY
| | - David Sontag
- CSAIL & IMES, Massachusetts Institute of Technology, Cambridge, MA
| |
Collapse
|
32
|
Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data. J Biomed Inform 2021; 122:103872. [PMID: 34411709 DOI: 10.1016/j.jbi.2021.103872] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 03/08/2021] [Accepted: 07/18/2021] [Indexed: 11/20/2022]
Abstract
OBJECTIVE We aim to build an accurate machine learning-based system for classifying tumor attributes from cancer pathology reports in the presence of a small amount of annotated data, motivated by the expensive and time-consuming nature of pathology report annotation. An enriched labeling scheme that includes the location of relevant information along with the final label is used along with a corresponding hierarchical method for classifying reports that leverages these enriched annotations. MATERIALS AND METHODS Our data consists of 250 colon cancer and 250 kidney cancer pathology reports from 2002 to 2019 at the University of California, San Francisco. For each report, we classify attributes such as procedure performed, tumor grade, and tumor site. For each attribute and document, an annotator trained by an oncologist labeled both the value of that attribute as well as the specific lines in the document that indicated the value. We develop a model that uses these enriched annotations that first predicts the relevant lines of the document, then predicts the final value given the predicted lines. We compare our model to multiple state-of-the-art methods for classifying tumor attributes from pathology reports. RESULTS Our results show that across colon and kidney cancers and varying training set sizes, our hierarchical method consistently outperforms state-of-the-art methods. Furthermore, performance comparable to these methods can be achieved with approximately half the amount of labeled data. CONCLUSION Document annotations that are enriched with location information are shown to greatly increase the sample efficiency of machine learning methods for classifying attributes of pathology reports.
Collapse
|
33
|
Teo K, Yong CW, Chuah JH, Hum YC, Tee YK, Xia K, Lai KW. Current Trends in Readmission Prediction: An Overview of Approaches. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021; 48:1-18. [PMID: 34422543 PMCID: PMC8366485 DOI: 10.1007/s13369-021-06040-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 07/30/2021] [Indexed: 12/03/2022]
Abstract
Hospital readmission shortly after discharge threatens the quality of patient care and leads to increased medical care costs. In the United States, hospitals with high readmission rates are subject to federal financial penalties. This concern calls for incentives for healthcare facilities to reduce their readmission rates by predicting patients who are at high risk of readmission. Conventional practices involve the use of rule-based assessment scores and traditional statistical methods, such as logistic regression, in developing risk prediction models. The recent advancements in machine learning driven by improved computing power and sophisticated algorithms have the potential to produce highly accurate predictions. However, the value of such models could be overrated. Meanwhile, the use of other flexible models that leverage simple algorithms offer great transparency in terms of feature interpretation, which is beneficial in clinical settings. This work presents an overview of the current trends in risk prediction models developed in the field of readmission. The various techniques adopted by researchers in recent years are described, and the topic of whether complex models outperform simple ones in readmission risk stratification is investigated.
Collapse
Affiliation(s)
- Kareen Teo
- Department of Biomedical Engineering, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | - Ching Wai Yong
- Department of Biomedical Engineering, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | - Joon Huang Chuah
- Department of Electrical Engineering, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| | - Yan Chai Hum
- Department of Mechatronics and Biomedical Engineering, Universiti Tunku Abdul Rahman, 43000 Sungai Long, Malaysia
| | - Yee Kai Tee
- Department of Mechatronics and Biomedical Engineering, Universiti Tunku Abdul Rahman, 43000 Sungai Long, Malaysia
| | - Kaijian Xia
- Changshu Institute of Technology, Changshu, 215500 Jiangsu China
| | - Khin Wee Lai
- Department of Biomedical Engineering, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| |
Collapse
|
34
|
Deshmukh PR, Phalnikar R. Prognostic elements extraction from documents to detect prognostic stage. Comput Methods Biomech Biomed Engin 2021; 25:371-386. [PMID: 34319178 DOI: 10.1080/10255842.2021.1955359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
For cancer prediction, the prognostic stage is the main factor that helps medical experts to decide the optimal treatment for a patient. The main objective of this study is to predict prognostic stage from the medical records of various health institutions. Total 465 pathological and clinical reports of people living with breast cancer has been collected from India's reputed treatment institutions. Different anatomic and biologic factors are extracted from unstructured medical records using a novel combination of natural language processing (NLP) and fuzzy decision tree (FDT) for prognostic stage detection. This study has extracted the anatomic and biologic factors from medical reports with high accuracy. The average accuracy of the prognostic stage prediction found 93% and 83% in rural and urban regions, respectively. A generalized method for cancer staging with great accuracy in a different medical institution from dissimilar regional areas suggest that the proposed research improves the prognosis of breast cancer.
Collapse
Affiliation(s)
- Pratiksha R Deshmukh
- School of Computer Engineering and Technology, MIT World Peace University, Pune, India.,Department of Computer Science and Information Technology, College of Engineering Pune, Pune, India
| | - Rashmi Phalnikar
- School of Computer Engineering and Technology, MIT World Peace University, Pune, India
| |
Collapse
|
35
|
Park B, Altieri N, DeNero J, Odisho AY, Yu B. Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity. JAMIA Open 2021; 4:ooab085. [PMID: 34604711 PMCID: PMC8484934 DOI: 10.1093/jamiaopen/ooab085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 09/06/2021] [Accepted: 09/22/2021] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.
Collapse
Affiliation(s)
- Briton Park
- Department of Statistics, University of California, Berkeley, California, USA
| | - Nicholas Altieri
- Department of Statistics, University of California, Berkeley, California, USA
| | - John DeNero
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA
| | - Anobel Y Odisho
- Department of Urology and Helen Diller Family Comprehensive Cancer Center, School of Medicine, University of California, San Francisco, California, USA
- Department of Epidemiology & Biostatistics, School of Medicine, University of California, San Francisco, California, USA
- Center for Digital Health Innovation, University of California, San Francisco, California, USA
| | - Bin Yu
- Department of Statistics, University of California, Berkeley, California, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA
- Chan-Zuckerberg Biohub, San Francisco, California, USA
| |
Collapse
|
36
|
Bitterman DS, Miller TA, Mak RH, Savova GK. Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer. Int J Radiat Oncol Biol Phys 2021; 110:641-655. [PMID: 33545300 DOI: 10.1016/j.ijrobp.2021.01.044] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 12/22/2020] [Accepted: 01/23/2021] [Indexed: 02/07/2023]
Abstract
Natural language processing (NLP), which aims to convert human language into expressions that can be analyzed by computers, is one of the most rapidly developing and widely used technologies in the field of artificial intelligence. Natural language processing algorithms convert unstructured free text data into structured data that can be extracted and analyzed at scale. In medicine, this unlocking of the rich, expressive data within clinical free text in electronic medical records will help untap the full potential of big data for research and clinical purposes. Recent major NLP algorithmic advances have significantly improved the performance of these algorithms, leading to a surge in academic and industry interest in developing tools to automate information extraction and phenotyping from clinical texts. Thus, these technologies are poised to transform medical research and alter clinical practices in the future. Radiation oncology stands to benefit from NLP algorithms if they are appropriately developed and deployed, as they may enable advances such as automated inclusion of radiation therapy details into cancer registries, discovery of novel insights about cancer care, and improved patient data curation and presentation at the point of care. However, challenges remain before the full value of NLP is realized, such as the plethora of jargon specific to radiation oncology, nonstandard nomenclature, a lack of publicly available labeled data for model development, and interoperability limitations between radiation oncology data silos. Successful development and implementation of high quality and high value NLP models for radiation oncology will require close collaboration between computer scientists and the radiation oncology community. Here, we present a primer on artificial intelligence algorithms in general and NLP algorithms in particular; provide guidance on how to assess the performance of such algorithms; review prior research on NLP algorithms for oncology; and describe future avenues for NLP in radiation oncology research and clinics.
Collapse
Affiliation(s)
- Danielle S Bitterman
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts.
| | - Timothy A Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| | - Raymond H Mak
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts
| | - Guergana K Savova
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| |
Collapse
|
37
|
Management and outcomes of men diagnosed with primary breast cancer. Breast Cancer Res Treat 2021; 188:561-569. [PMID: 33830393 DOI: 10.1007/s10549-021-06174-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 03/03/2021] [Indexed: 12/18/2022]
Abstract
BACKGROUND Fewer than 1% of all breast cancers occur in men. As a result, a distinct lack of data exists regarding the management and outcomes in this cohort. METHODS Any male patient with pathologically confirmed breast cancer diagnosed between August 2000 and October 2017 at either Massachusetts General Hospital or Brigham and Women's Hospital/Dana-Farber Cancer Institute and their affiliate satellite locations were included. Primary chart review was used to assess clinical and pathologic characteristics. Patient and treatment variables were reported via descriptive statistics. Local-regional failure (LRF), overall survival (OS), breast cancer-specific survival (BCSS), and disease-free survival (DFS) were estimated using the Kaplan-Meier method. RESULTS 100 patients were included in this study. Median follow-up was 112 months (range 1-220 months). Approximately 1/3 of patients experienced at least a 3-month delay to presentation. 83 patients ultimately underwent mastectomy as definitive surgical treatment. 46 patients received adjuvant radiation therapy, and 37 patients received chemotherapy. Of 82 hormone receptor-positive patients with invasive cancer, 94% (n = 77) received endocrine therapy. Of the fifty-eight patients who underwent genetic testing, 15 (26%) tested positive. The 5-year OS, BCSS, DFS, and LRF rates were 91.5%, 96.2%, 86%, and 4.8%, respectively. Delay to presentation was not associated with worse survival. CONCLUSIONS Male breast cancer remains a rare diagnosis. Despite this, the majority of patients in this study received standard of care therapy and experienced excellent oncologic outcomes. Penetration for genetic testing improved over time.
Collapse
|
38
|
Ahmad Z, Rahim S, Zubair M, Abdul-Ghafar J. Artificial intelligence (AI) in medicine, current applications and future role with special emphasis on its potential and promise in pathology: present and future impact, obstacles including costs and acceptance among pathologists, practical and philosophical considerations. A comprehensive review. Diagn Pathol 2021; 16:24. [PMID: 33731170 PMCID: PMC7971952 DOI: 10.1186/s13000-021-01085-4] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/04/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND The role of Artificial intelligence (AI) which is defined as the ability of computers to perform tasks that normally require human intelligence is constantly expanding. Medicine was slow to embrace AI. However, the role of AI in medicine is rapidly expanding and promises to revolutionize patient care in the coming years. In addition, it has the ability to democratize high level medical care and make it accessible to all parts of the world. MAIN TEXT Among specialties of medicine, some like radiology were relatively quick to adopt AI whereas others especially pathology (and surgical pathology in particular) are only just beginning to utilize AI. AI promises to play a major role in accurate diagnosis, prognosis and treatment of cancers. In this paper, the general principles of AI are defined first followed by a detailed discussion of its current role in medicine. In the second half of this comprehensive review, the current and future role of AI in surgical pathology is discussed in detail including an account of the practical difficulties involved and the fear of pathologists of being replaced by computer algorithms. A number of recent studies which demonstrate the usefulness of AI in the practice of surgical pathology are highlighted. CONCLUSION AI has the potential to transform the practice of surgical pathology by ensuring rapid and accurate results and enabling pathologists to focus on higher level diagnostic and consultative tasks such as integrating molecular, morphologic and clinical information to make accurate diagnosis in difficult cases, determine prognosis objectively and in this way contribute to personalized care.
Collapse
Affiliation(s)
- Zubair Ahmad
- Department of Pathology and Laboratory Medicine, Aga Khan University Hospital, Karachi, Pakistan
| | - Shabina Rahim
- Department of Pathology and Laboratory Medicine, Aga Khan University Hospital, Karachi, Pakistan
| | - Maha Zubair
- Department of Pathology and Laboratory Medicine, Aga Khan University Hospital, Karachi, Pakistan
| | - Jamshid Abdul-Ghafar
- Department of Pathology and Clinical Laboratory, French Medical Institute for Mothers and Children (FMIC), Kabul, Afghanistan.
| |
Collapse
|
39
|
Alawad M, Gao S, Qiu JX, Yoon HJ, Blair Christian J, Penberthy L, Mumphrey B, Wu XC, Coyle L, Tourassi G. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc 2021; 27:89-98. [PMID: 31710668 PMCID: PMC7489089 DOI: 10.1093/jamia/ocz153] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 07/09/2019] [Accepted: 07/22/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. MATERIALS AND METHODS Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). RESULTS MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. CONCLUSIONS The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.
Collapse
Affiliation(s)
- Mohammed Alawad
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Shang Gao
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - John X Qiu
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Hong Jun Yoon
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - J Blair Christian
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Lynne Penberthy
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA
| | - Brent Mumphrey
- Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, Louisiana, USA
| | - Xiao-Cheng Wu
- Louisiana Tumor Registry, Louisiana State University Health Sciences Center School of Public Health, New Orleans, Louisiana, USA
| | - Linda Coyle
- Information Management Services Inc, Calverton, Maryland, USA
| | - Georgia Tourassi
- Computational Sciences and Engineering Division, Health Data Sciences Institute, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| |
Collapse
|
40
|
Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records. Sci Rep 2020; 10:20265. [PMID: 33219276 PMCID: PMC7679382 DOI: 10.1038/s41598-020-77258-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 11/05/2020] [Indexed: 11/20/2022] Open
Abstract
Pathology reports contain the essential data for both clinical and research purposes. However, the extraction of meaningful, qualitative data from the original document is difficult due to the narrative and complex nature of such reports. Keyword extraction for pathology reports is necessary to summarize the informative text and reduce intensive time consumption. In this study, we employed a deep learning model for the natural language process to extract keywords from pathology reports and presented the supervised keyword extraction algorithm. We considered three types of pathological keywords, namely specimen, procedure, and pathology types. We compared the performance of the present algorithm with the conventional keyword extraction methods on the 3115 pathology reports that were manually labeled by professional pathologists. Additionally, we applied the present algorithm to 36,014 unlabeled pathology reports and analysed the extracted keywords with biomedical vocabulary sets. The results demonstrated the suitability of our model for practical application in extracting important data from pathology reports.
Collapse
|
41
|
Odisho AY, Park B, Altieri N, DeNero J, Cooperberg MR, Carroll PR, Yu B. Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation. JAMIA Open 2020; 3:431-438. [PMID: 33381748 PMCID: PMC7751177 DOI: 10.1093/jamiaopen/ooaa029] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Revised: 06/09/2020] [Accepted: 07/13/2020] [Indexed: 12/05/2022] Open
Abstract
OBJECTIVE Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. MATERIALS AND METHODS Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct. RESULTS Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. CONCLUSIONS We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.
Collapse
Affiliation(s)
- Anobel Y Odisho
- Department of Urology, UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California, USA
| | - Briton Park
- Department of Statistics, University of California, Berkeley, California, USA
| | - Nicholas Altieri
- Department of Statistics, University of California, Berkeley, California, USA
| | - John DeNero
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, USA
| | - Matthew R Cooperberg
- Department of Urology, UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California, USA
- Department of Epidemiology & Biostatistics, University of California, San Francisco, California, USA
| | - Peter R Carroll
- Department of Urology, UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco, California, USA
| | - Bin Yu
- Department of Statistics, University of California, Berkeley, California, USA
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, California, USA
- Chan-Zuckerberg Biohub, San Francisco, California, USA
| |
Collapse
|
42
|
Yoon HJ, Klasky HB, Gounley JP, Alawad M, Gao S, Durbin EB, Wu XC, Stroup A, Doherty J, Coyle L, Penberthy L, Blair Christian J, Tourassi GD. Accelerated training of bootstrap aggregation-based deep information extraction systems from cancer pathology reports. J Biomed Inform 2020; 110:103564. [PMID: 32919043 DOI: 10.1016/j.jbi.2020.103564] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 07/14/2020] [Accepted: 09/06/2020] [Indexed: 12/24/2022]
Abstract
OBJECTIVE In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems. MATERIALS AND METHODS The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem-thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL). RESULTS We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement. CONCLUSION Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.
Collapse
Affiliation(s)
- Hong-Jun Yoon
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - Hilda B Klasky
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - John P Gounley
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - Mohammed Alawad
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - Shang Gao
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - Eric B Durbin
- College of Medicine, University of Kentucky, Lexington, KY 40536, United States of America.
| | - Xiao-Cheng Wu
- Louisiana Tumor Registry, Louisiana State University Health Sciences Center, School of Public Health, New Orleans, LA 70112, United States of America.
| | - Antoinette Stroup
- New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, 08901, United States of America.
| | - Jennifer Doherty
- Utah Cancer Registry, University of Utah School of Medicine, Salt Lake City, UT 84132, United States of America.
| | - Linda Coyle
- Information Management Services Inc., Calverton, MD 20705, United States of America.
| | - Lynne Penberthy
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD 20814, United States of America.
| | - J Blair Christian
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| | - Georgia D Tourassi
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States of America.
| |
Collapse
|
43
|
Novel Body Composition Predictors of Outcome in Patients With Angiosarcoma of the Breast: A Preliminary Study. J Comput Assist Tomogr 2020; 44:605-609. [DOI: 10.1097/rct.0000000000001066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
44
|
Deng Z, Yin K, Bao Y, Armengol VD, Wang C, Tiwari A, Barzilay R, Parmigiani G, Braun D, Hughes KS. Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31419182 DOI: 10.1200/cci.19.00043] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes-that is, penetrance-enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) -based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure. MATERIALS AND METHODS We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene-cancer penetrance meta-analyses spanning 16 gene-cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage). RESULTS Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%-we are able to identify 132 of 142 studies-before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies). CONCLUSION We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.
Collapse
Affiliation(s)
| | - Kanhua Yin
- Massachusetts General Hospital, Boston, MA
| | - Yujia Bao
- Massachusetts Institute of Technology, Boston, MA
| | | | - Cathy Wang
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | | | | | - Giovanni Parmigiani
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Danielle Braun
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Kevin S Hughes
- Massachusetts General Hospital, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
45
|
Oliwa T, Maron SB, Chase LM, Lomnicki S, Catenacci DVT, Furner B, Volchenboum SL. Obtaining Knowledge in Pathology Reports Through a Natural Language Processing Approach With Classification, Named-Entity Recognition, and Relation-Extraction Heuristics. JCO Clin Cancer Inform 2020; 3:1-8. [PMID: 31365274 DOI: 10.1200/cci.19.00008] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.
Collapse
Affiliation(s)
| | | | - Leah M Chase
- The University of Chicago Medical Center, Chicago, IL
| | | | | | | | - Samuel L Volchenboum
- The University of Chicago, Chicago, IL.,The University of Chicago Medical Center, Chicago, IL
| |
Collapse
|
46
|
Schwarz J, Heider D. GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decision-making. Bioinformatics 2020; 35:2458-2465. [PMID: 30496351 DOI: 10.1093/bioinformatics/bty984] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2018] [Revised: 11/21/2018] [Accepted: 11/28/2018] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Clinical decision support systems have been applied in numerous fields, ranging from cancer survival toward drug resistance prediction. Nevertheless, clinical decision support systems typically have a caveat: many of them are perceived as black-boxes by non-experts and, unfortunately, the obtained scores cannot usually be interpreted as class probability estimates. In probability-focused medical applications, it is not sufficient to perform well with regards to discrimination and, consequently, various calibration methods have been developed to enable probabilistic interpretation. The aims of this study were (i) to develop a tool for fast and comparative analysis of different calibration methods, (ii) to demonstrate their limitations for the use on clinical data and (iii) to introduce our novel method GUESS. RESULTS We compared the performances of two different state-of-the-art calibration methods, namely histogram binning and Bayesian Binning in Quantiles, as well as our novel method GUESS on both, simulated and real-world datasets. GUESS demonstrated calibration performance comparable to the state-of-the-art methods and always retained accurate class discrimination. GUESS showed superior calibration performance in small datasets and therefore may be an optimal calibration method for typical clinical datasets. Moreover, we provide a framework (CalibratR) for R, which can be used to identify the most suitable calibration method for novel datasets in a timely and efficient manner. Using calibrated probability estimates instead of original classifier scores will contribute to the acceptance and dissemination of machine learning based classification models in cost-sensitive applications, such as clinical research. AVAILABILITY AND IMPLEMENTATION GUESS as part of CalibratR can be downloaded at CRAN.
Collapse
|
47
|
Santus E, Li C, Yala A, Peck D, Soomro R, Faridi N, Mamshad I, Tang R, Lanahan CR, Barzilay R, Hughes K. Do Neural Information Extraction Algorithms Generalize Across Institutions? JCO Clin Cancer Inform 2020; 3:1-8. [PMID: 31310566 DOI: 10.1200/cci.18.00160] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Natural language processing (NLP) techniques have been adopted to reduce the curation costs of electronic health records. However, studies have questioned whether such techniques can be applied to data from previously unseen institutions. We investigated the performance of a common neural NLP algorithm on data from both known and heldout (ie, institutions whose data were withheld from the training set and only used for testing) hospitals. We also explored how diversity in the training data affects the system's generalization ability. METHODS We collected 24,881 breast pathology reports from seven hospitals and manually annotated them with nine key attributes that describe types of atypia and cancer. We trained a convolutional neural network (CNN) on annotations from either only one (CNN1), only two (CNN2), or only four (CNN4) hospitals. The trained systems were tested on data from five organizations, including both known and heldout ones. For every setting, we provide the accuracy scores as well as the learning curves that show how much data are necessary to achieve good performance and generalizability. RESULTS The system achieved a cross-institutional accuracy of 93.87% when trained on reports from only one hospital (CNN1). Performance improved to 95.7% and 96%, respectively, when the system was trained on reports from two (CNN2) and four (CNN4) hospitals. The introduction of diversity during training did not lead to improvements on the known institutions, but it boosted performance on the heldout institutions. When tested on reports from heldout hospitals, CNN4 outperformed CNN1 and CNN2 by 2.13% and 0.3%, respectively. CONCLUSION Real-world scenarios require that neural NLP approaches scale to data from previously unseen institutions. We show that a common neural NLP algorithm for information extraction can achieve this goal, especially when diverse data are used during training.
Collapse
Affiliation(s)
- Enrico Santus
- Massachusetts Institute of Technology, Cambridge, MA
| | - Clara Li
- Massachusetts Institute of Technology, Cambridge, MA
| | - Adam Yala
- Massachusetts Institute of Technology, Cambridge, MA
| | - Donald Peck
- Henry Ford Health System, Detroit, MI.,Michigan Technological University, Houghton, MI
| | - Rufina Soomro
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Naveen Faridi
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Isra Mamshad
- Liaquat National Hospital & Medical College, Karachi, Pakistan
| | - Rong Tang
- Rochester General Hospital, Rochester, NY
| | | | | | | |
Collapse
|
48
|
Zhao B. Clinical Data Extraction and Normalization of Cyrillic Electronic Health Records Via Deep-Learning Natural Language Processing. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31577448 PMCID: PMC6874014 DOI: 10.1200/cci.19.00057] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE A substantial portion of medical data is unstructured. Extracting data from unstructured text presents a barrier to advancing clinical research and improving patient care. In addition, ongoing studies have been focused predominately on the English language, whereas inflected languages with non-Latin alphabets (such as Slavic languages with a Cyrillic alphabet) present numerous linguistic challenges. We developed deep-learning–based natural language processing algorithms for automatically extracting biomarker status of patients with breast cancer from three oncology centers in Bulgaria. METHODS We used dual embeddings for English and Bulgarian languages, encoding both syntactic and polarity information for the words. The embeddings were subsequently aligned so that they were in the same vector space. The embeddings were used as input to convolutional or recurrent neural networks to derive the biomarker status of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2. RESULTS We showed that we can resolve ambiguity in highly variable medical text containing both Latin and Cyrillic text. Final models incorporating both English and Bulgarian syntax and polarity embeddings achieved F1 scores of 0.90 or higher for all estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 biomarkers. The models were robust against human errors originally found in the training set. In addition, such models can be extended for analyzing text containing words not seen during training. CONCLUSION By using several techniques that incorporate dual-word embeddings encoding syntactic and polarity information in two languages followed by deep neural network architectures, we show that researchers can extract and normalize parameters within medical data. The principles described here can be used to analyze Cyrillic or Latin mixed medical text and extract other parameters.
Collapse
|
49
|
Giannaris PS, Al-Taie Z, Kovalenko M, Thanintorn N, Kholod O, Innokenteva Y, Coberly E, Frazier S, Laziuk K, Popescu M, Shyu CR, Xu D, Hammer RD, Shin D. Artificial Intelligence-Driven Structurization of Diagnostic Information in Free-Text Pathology Reports. J Pathol Inform 2020; 11:4. [PMID: 32166042 PMCID: PMC7045509 DOI: 10.4103/jpi.jpi_30_19] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 12/18/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Free-text sections of pathology reports contain the most important information from a diagnostic standpoint. However, this information is largely underutilized for computer-based analytics. The vast majority of NLP-based methods lack a capacity to accurately extract complex diagnostic entities and relationships among them as well as to provide an adequate knowledge representation for downstream data-mining applications. METHODS In this paper, we introduce a novel informatics pipeline that extends open information extraction (openIE) techniques with artificial intelligence (AI) based modeling to extract and transform complex diagnostic entities and relationships among them into Knowledge Graphs (KGs) of relational triples (RTs). RESULTS Evaluation studies have demonstrated that the pipeline's output significantly differs from a random process. The semantic similarity with original reports is high (Mean Weighted Overlap of 0.83). The precision and recall of extracted RTs based on experts' assessment were 0.925 and 0.841 respectively (P <0.0001). Inter-rater agreement was significant at 93.6% and inter-rated reliability was 81.8%. CONCLUSION The results demonstrated important properties of the pipeline such as high accuracy, minimality and adequate knowledge representation. Therefore, we conclude that the pipeline can be used in various downstream data-mining applications to assist diagnostic medicine.
Collapse
Affiliation(s)
- Pericles S. Giannaris
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Zainab Al-Taie
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Computer Science, College of Science for Women, University of Baghdad, Baghdad, Iraq
| | - Mikhail Kovalenko
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Nattapon Thanintorn
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Olha Kholod
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Yulia Innokenteva
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
| | - Emily Coberly
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Shellaine Frazier
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Katsiarina Laziuk
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Mihail Popescu
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Electrical Engineering and Computer Science, College of Engineering, University of Missouri, Columbia, MO 65211, United States
- Department of Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Chi-Ren Shyu
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Electrical Engineering and Computer Science, College of Engineering, University of Missouri, Columbia, MO 65211, United States
| | - Dong Xu
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Electrical Engineering and Computer Science, College of Engineering, University of Missouri, Columbia, MO 65211, United States
| | - Richard D. Hammer
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
| | - Dmitriy Shin
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, United States
- Department of Pathology and Anatomical Sciences, School of Medicine, University of Missouri, Columbia, MO 65212, United States
- Department of Electrical Engineering and Computer Science, College of Engineering, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
50
|
Habib G, Kiryati N, Sklair-Levy M, Shalmon A, Halshtok Neiman O, Faermann Weidenfeld R, Yagil Y, Konen E, Mayer A. Automatic Breast Lesion Classification by Joint Neural Analysis of Mammography and Ultrasound. MULTIMODAL LEARNING FOR CLINICAL DECISION SUPPORT AND CLINICAL IMAGE-BASED PROCEDURES 2020. [DOI: 10.1007/978-3-030-60946-7_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|