1
|
Hu J, Fu J, Zhao W, Lou P, Feng M, Ren H, Feng S, Li Y, Fang A. Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs. Health Informatics J 2024; 30:14604582241291442. [PMID: 39379071 DOI: 10.1177/14604582241291442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.
Collapse
Affiliation(s)
- Jiahui Hu
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jin Fu
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Wanqing Zhao
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Pei Lou
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Ming Feng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Huiling Ren
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Shanshan Feng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Yansheng Li
- DHC Mediway Technology Co., Ltd., Beijing, China
| | - An Fang
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
2
|
Zelin C, Chung WK, Jeanne M, Zhang G, Weng C. Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT. J Biomed Inform 2024; 157:104702. [PMID: 39084480 PMCID: PMC11402564 DOI: 10.1016/j.jbi.2024.104702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/19/2024] [Accepted: 07/24/2024] [Indexed: 08/02/2024]
Abstract
Although rare diseases individually have a low prevalence, they collectively affect nearly 400 million individuals around the world. On average, it takes five years for an accurate rare disease diagnosis, but many patients remain undiagnosed or misdiagnosed. As machine learning technologies have been used to aid diagnostics in the past, this study aims to test ChatGPT's suitability for rare disease diagnostic support with the enhancement provided by Retrieval Augmented Generation (RAG). RareDxGPT, our enhanced ChatGPT model, supplies ChatGPT with information about 717 rare diseases from an external knowledge resource, the RareDis Corpus, through RAG. In RareDxGPT, when a query is entered, the three documents most relevant to the query in the RareDis Corpus are retrieved. Along with the query, they are returned to ChatGPT to provide a diagnosis. Additionally, phenotypes for thirty different diseases were extracted from free text from PubMed's Case Reports. They were each entered with three different prompt types: "prompt", "prompt + explanation" and "prompt + role play." The accuracy of ChatGPT and RareDxGPT with each prompt was then measured. With "Prompt", RareDxGPT had a 40 % accuracy, while ChatGPT 3.5 got 37 % of the cases correct. With "Prompt + Explanation", RareDxGPT had a 43 % accuracy, while ChatGPT 3.5 got 23 % of the cases correct. With "Prompt + Role Play", RareDxGPT had a 40 % accuracy, while ChatGPT 3.5 got 23 % of the cases correct. To conclude, ChatGPT, especially when supplying extra domain specific knowledge, demonstrates early potential for rare disease diagnosis with adjustments.
Collapse
Affiliation(s)
| | - Wendy K Chung
- Department of Pediatrics, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Mederic Jeanne
- Department of Pediatrics, Boston Children's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Gongbo Zhang
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York City, NY 10032, USA.
| |
Collapse
|
3
|
Shyr C, Hu Y, Bastarache L, Cheng A, Hamid R, Harris P, Xu H. Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:438-461. [PMID: 38681753 PMCID: PMC11052982 DOI: 10.1007/s41666-023-00155-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 10/24/2023] [Accepted: 11/13/2023] [Indexed: 05/01/2024]
Abstract
Purpose Phenotyping is critical for informing rare disease diagnosis and treatment, but disease phenotypes are often embedded in unstructured text. While natural language processing (NLP) can automate extraction, a major bottleneck is developing annotated corpora. Recently, prompt learning with large language models (LLMs) has been shown to lead to generalizable results without any (zero-shot) or few annotated samples (few-shot), but none have explored this for rare diseases. Our work is the first to study prompt learning for identifying and extracting rare disease phenotypes in the zero- and few-shot settings. Methods We compared the performance of prompt learning with ChatGPT and fine-tuning with BioClinicalBERT. We engineered novel prompts for ChatGPT to identify and extract rare diseases and their phenotypes (e.g., diseases, symptoms, and signs), established a benchmark for evaluating its performance, and conducted an in-depth error analysis. Results Overall, fine-tuning BioClinicalBERT resulted in higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.610 in the zero- and few-shot settings, respectively). However, ChatGPT achieved higher accuracy for rare diseases and signs in the one-shot setting (F1 of 0.778 and 0.725). Conversational, sentence-based prompts generally achieved higher accuracy than structured lists. Conclusion Prompt learning using ChatGPT has the potential to match or outperform fine-tuning BioClinicalBERT at extracting rare diseases and signs with just one annotated sample. Given its accessibility, ChatGPT could be leveraged to extract these entities without relying on a large, annotated corpus. While LLMs can support rare disease phenotyping, researchers should critically evaluate model outputs to ensure phenotyping accuracy.
Collapse
Affiliation(s)
- Cathy Shyr
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Yan Hu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77225 USA
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Alex Cheng
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Rizwan Hamid
- Division of Medical Genetics and Genomic Medicine, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Paul Harris
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
- Department of Biomedical Engineering, Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN 37203 USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, New Haven, CT 06510 USA
| |
Collapse
|
4
|
Wang W, Zhao Z, Ning H. A tree-based corpus annotated with Cyber-Syndrome, symptoms, and acupoints. Sci Data 2024; 11:482. [PMID: 38730023 PMCID: PMC11087536 DOI: 10.1038/s41597-024-03321-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 04/29/2024] [Indexed: 05/12/2024] Open
Abstract
Prolonged and over-excessive interaction with cyberspace poses a threat to people's health and leads to the occurrence of Cyber-Syndrome, which covers not only physiological but also psychological disorders. This paper aims to create a tree-shaped gold-standard corpus that annotates the Cyber-Syndrome, clinical manifestations, and acupoints that can alleviate their symptoms or signs, designating this corpus as CS-A. In the CS-A corpus, this paper defines six entities and relations subject to annotation. There are 448 texts to annotate in total manually. After three rounds of updating the annotation guidelines, the inter-annotator agreement (IAA) improved significantly, resulting in a higher IAA score of 86.05%. The purpose of constructing CS-A corpus is to increase the popularity of Cyber-Syndrome and draw attention to its subtle impact on people's health. Meanwhile, annotated corpus promotes the development of natural language processing technology. Some model experiments can be implemented based on this corpus, such as optimizing and improving models for discontinuous entity recognition, nested entity recognition, etc. The CS-A corpus has been uploaded to figshare.
Collapse
Affiliation(s)
- Wenxi Wang
- School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China
| | - Zhan Zhao
- School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China
| | - Huansheng Ning
- School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China.
| |
Collapse
|
5
|
Zhang J, Xu W, Lei C, Pu Y, Zhang Y, Zhang J, Yu H, Su X, Huang Y, Gong R, Zhang L, Shi Q. Using Clinician-Patient WeChat Group Communication Data to Identify Symptom Burdens in Patients With Uterine Fibroids Under Focused Ultrasound Ablation Surgery Treatment: Qualitative Study. JMIR Form Res 2023; 7:e43995. [PMID: 37656501 PMCID: PMC10504630 DOI: 10.2196/43995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 12/26/2022] [Accepted: 07/24/2023] [Indexed: 09/02/2023] Open
Abstract
BACKGROUND Unlike research project-based health data collection (questionnaires and interviews), social media platforms allow patients to freely discuss their health status and obtain peer support. Previous literature has pointed out that both public and private social platforms can serve as data sources for analysis. OBJECTIVE This study aimed to use natural language processing (NLP) techniques to identify concerns regarding the postoperative quality of life and symptom burdens in patients with uterine fibroids after focused ultrasound ablation surgery. METHODS Screenshots taken from clinician-patient WeChat groups were converted into free texts using image text recognition technology and used as the research object of this study. From 408 patients diagnosed with uterine fibroids in Chongqing Haifu Hospital between 2010 and 2020, we searched for symptom burdens in over 900,000 words of WeChat group chats. We first built a corpus of symptoms by manually coding 30% of the WeChat texts and then used regular expressions in Python to crawl symptom information from the remaining texts based on this corpus. We compared the results with a manual review (gold standard) of the same records. Finally, we analyzed the relationship between the population baseline data and conceptual symptoms; quantitative and qualitative results were examined. RESULTS A total of 408 patients with uterine fibroids were included in the study; 190,000 words of free text were obtained after data cleaning. The mean age of the patients was 39.94 (SD 6.81) years, and their mean BMI was 22.18 (SD 2.78) kg/m2. The median reporting times of the 7 major symptoms were 21, 26, 57, 2, 18, 30, and 49 days. Logistic regression models identified preoperative menstrual duration (odds ratio [OR] 1.14, 95% CI 5.86-6.37; P=.009), age of menophania (OR -1.02 , 95% CI 11.96-13.47; P=.03), and the number (OR 2.34, 95% CI 1.45-1.83; P=.04) and size of fibroids (OR 0.12, 95% CI 2.43-3.51; P=.04) as significant risk factors for postoperative symptoms. CONCLUSIONS Unstructured free texts from social media platforms extracted by NLP technology can be used for analysis. By extracting the conceptual information about patients' health-related quality of life, we can adopt personalized treatment for patients at different stages of recovery to improve their quality of life. Python-based text mining of free-text data can accurately extract symptom burden and save considerable time compared to manual review, maximizing the utility of the extant information in population-based electronic health records for comparative effectiveness research.
Collapse
Affiliation(s)
- Jiayuan Zhang
- State Key Laboratory of Ultrasound in Medicine and Engineering, College of Biomedical Engineering, Chongqing Medical University, Chongqing, China
| | - Wei Xu
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Cheng Lei
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Yang Pu
- State Key Laboratory of Ultrasound in Medicine and Engineering, College of Biomedical Engineering, Chongqing Medical University, Chongqing, China
| | - Yubo Zhang
- State Key Laboratory of Ultrasound in Medicine and Engineering, College of Biomedical Engineering, Chongqing Medical University, Chongqing, China
| | - Jingyu Zhang
- State Key Laboratory of Ultrasound in Medicine and Engineering, College of Biomedical Engineering, Chongqing Medical University, Chongqing, China
| | - Hongfan Yu
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Xueyao Su
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Yanyan Huang
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Ruoyan Gong
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Lijun Zhang
- School of Public Health, Chongqing Medical University, Chongqing, China
| | - Qiuling Shi
- State Key Laboratory of Ultrasound in Medicine and Engineering, College of Biomedical Engineering, Chongqing Medical University, Chongqing, China
- School of Public Health, Chongqing Medical University, Chongqing, China
| |
Collapse
|
6
|
Hens D, Wyers L, Claeys KG. Validation of an Artificial Intelligence driven framework to automatically detect red flag symptoms in screening for rare diseases in electronic health records: hereditary transthyretin amyloidosis polyneuropathy as a key example. J Peripher Nerv Syst 2023; 28:79-85. [PMID: 36468607 DOI: 10.1111/jns.12523] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 11/19/2022] [Accepted: 11/28/2022] [Indexed: 12/07/2022]
Abstract
Rare life-threatening conditions, such as multisystemic hereditary transthyretin amyloidosis (ATTRv) polyneuropathy, are often underdiagnosed or diagnosed late in the disease course, although early diagnosis is crucial for treatment success. Red flag symptoms have been identified, but manual screening of multidisciplinary medical records on this set of symptoms is time-consuming. This study aimed to validate a Natural Language Processing (NLP) algorithm to perform such a search in an automated manner, in order to improve early diagnosis and treatment. A novel state-of-the-art NLP procedure was applied to extract red flag symptoms from patients' electronic medical records and to select patients at risk for ATTRv polyneuropathy for further clinical review. Accuracy of the algorithm was assessed through comparison with a manual standard on a random sample of 300 patients. Out of a retrospective sample of 1015 patients, the NLP algorithm yielded 128 patients with three or more red flag symptoms of which 69 patients were considered eligible for genetic testing after clinical review. High accuracy was found in the detection of red flag symptoms, with F1 scores between 0.88 and 0.98. A relative increase of 48.6% in genetic testing, to identify patients with a rare disease earlier, was demonstrated. An NLP algorithm, after clinical validation, offers a valid and accurate tool to detect red flag symptoms in medical records across multiple disciplines, supporting better screening for patients with rare diseases. This opens the door to further NLP applications, facilitating rapid diagnosis and early treatment of rare diseases.
Collapse
Affiliation(s)
| | | | - Kristl G Claeys
- Department of Neurology, University Hospitals Leuven, Leuven, Belgium.,Laboratory for Muscle Diseases and Neuropathies, Department of Neurosciences, KU Leuven, and Leuven Brain Institute (LBI), Leuven, Belgium
| |
Collapse
|
7
|
Zhang J, Xu W, Lei C, Pu Y, Zhang Y, Zhang J, Yu H, Su X, Huang Y, Gong R, Zhang L, Shi Q. Using WeChat clinician-patient group communication data to identify symptom burdens in patients with uterine fibroids under focused ultrasound ablation surgery treatment :Qualitative Study (Preprint).. [DOI: 10.2196/preprints.43995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
BACKGROUND
Unlike research project-based health data collections(questionnaires, interviews), social media platforms, which allow patients to freely discuss their health status and obtain peer support.Previous literature has pointed out that both public and private social platforms can serve as data sources for analysing.
OBJECTIVE
This study aimed to use natural language processing (NLP) techniques to identify concerns regarding the postoperative quality of life and symptom burdens in uterine fibroids after focused ultrasound ablation surgery.
METHODS
Screenshots taken from the clinician-patient WeChat groups were converted into free texts using image text recognition technology and used as the research object of this study, which used regular expressions in Python to search for symptom burdens in over 900,000 words of WeChat group-chats associated with 408 patients in Chongqing Haifu Hospital diagnosed with uterine fibroids between 2010 and 2020. We first built a corpus of symptoms by manually coding 30% of the WeChat texts, and then used regular expressions to crawl symptom information from the remaining texts based on this corpus. We compared the results with a manual review (gold standard) of the same records. Then we analyzed the relationship between the population baseline data and conceptual symptoms, Quantitative and qualitative results were examined.
RESULTS
A total of 190,000 words of uterine fibroids patients' free text were finally obtained after data cleaning. A total of 408 patients were included in the study. The age of the patients was 39.94±6.81 years, and their BMI was 22.18±2.78 (kg/m^2). The median reporting times of the seven major symptoms were 21, 26, 57, 2, 18, 30, and 49 days. Results showed that patients with dysmenorrhea were younger(mean 38.26 (SD 7.05), P=.004) and slimmer (mean 22.37 (SD 3.81), P=.04), with lower fertility and parity (P<.05), and tended to stay longer in the hospital (P<.05). Logistic regression models identified preoperative menstrual duration (OR 1.14, 95% CI 5.86-6.37; P= .009), age of menophania (OR -1.02 ,95%CI 11.96-13.47,P=.03), and the number(OR 2.34,95% CI 1.45-1.83,P=.04) and size of fibroids(OR 0.12,95% CI 2.43-3.51,P=.04) as significant risk factors for postoperative symptoms.
CONCLUSIONS
Unstructured free texts from social media platforms extracted by NLP technology can be used for analysis, extracting the conceptual information about patients' HRQol,adopt personalized treatment for patients at different stages of recovery to improve the quality of life of patients. Python-based text mining of free-text data can accurately extract symptom burden administered and save considerable time compared to manual review, maximizing the utility of the extant information in population-based electronic health records for comparative effectiveness research.
CLINICALTRIAL
Collapse
|
8
|
Segura-Bedmar I, Camino-Perdones D, Guerrero-Aspizua S. Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts. BMC Bioinformatics 2022; 23:263. [PMID: 35794528 PMCID: PMC9258216 DOI: 10.1186/s12859-022-04810-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 06/21/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND AND OBJECTIVE Although rare diseases are characterized by low prevalence, approximately 400 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. In addition to this, rare diseases usually show a wide variety of manifestations, which might make the diagnosis even more difficult. A delayed diagnosis can negatively affect the patient's life. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments. METHODS The paper explores several deep learning techniques such as Bidirectional Long Short Term Memory (BiLSTM) networks or deep contextualized word representations based on Bidirectional Encoder Representations from Transformers (BERT) to recognize rare diseases and their clinical manifestations (signs and symptoms). RESULTS BioBERT, a domain-specific language representation based on BERT and trained on biomedical corpora, obtains the best results with an F1 of 85.2% for rare diseases. Since many signs are usually described by complex noun phrases that involve the use of use of overlapped, nested and discontinuous entities, the model provides lower results with an F1 of 57.2%. CONCLUSIONS While our results are promising, there is still much room for improvement, especially with respect to the identification of clinical manifestations (signs and symptoms).
Collapse
Affiliation(s)
- Isabel Segura-Bedmar
- Human Language and Accesibility Technologies, Computer Science Department, Universidad Carlos III de Madrid, Avenidad de la Universidad, 30, Leganés, 28911 Madrid, Spain
| | - David Camino-Perdones
- Human Language and Accesibility Technologies, Computer Science Department, Universidad Carlos III de Madrid, Avenidad de la Universidad, 30, Leganés, 28911 Madrid, Spain
| | - Sara Guerrero-Aspizua
- Tissue Engineering and Regenerative Medicine group, Department of Bioengineering, Universidad Carlos III de Madrid, Avenidad de la Universidad, 30, Leganés, 28911 Madrid, Spain
- Hospital Fundación Jiménez Díaz e Instituto de Investigación, FJD, Av. de los Reyes Católicos, 2, 28040 Madrid, Spain
- Epithelial Biomedicine Division, CIEMAT, Avda. Complutense 40, 28029 Madrid, Spain
- Centre for Biomedical Network Research on Rare Diseases (CIBERER), C/Monforte de Lemos 3-5, 28029 Madrid, Spain
| |
Collapse
|
9
|
Yates T, Lain A, Campbell J, FitzPatrick DR, Simpson TI. Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders. Database (Oxford) 2022; 2022:baac038. [PMID: 35670729 PMCID: PMC9216525 DOI: 10.1093/database/baac038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 03/26/2022] [Accepted: 05/25/2022] [Indexed: 11/24/2022]
Abstract
There are >2500 different genetically determined developmental disorders (DD), which, as a group, show very high levels of both locus and allelic heterogeneity. This has led to the wide-spread use of evidence-based filtering of genome-wide sequence data as a diagnostic tool in DD. Determining whether the association of a filtered variant at a specific locus is a plausible explanation of the phenotype in the proband is crucial and commonly requires extensive manual literature review by both clinical scientists and clinicians. Access to a database of weighted clinical features extracted from rigorously curated literature would increase the efficiency of this process and facilitate the development of robust phenotypic similarity metrics. However, given the large and rapidly increasing volume of published information, conventional biocuration approaches are becoming impractical. Here, we present a scalable, automated method for the extraction of categorical phenotypic descriptors from the full-text literature. Papers identified through literature review were downloaded and parsed using the Cadmus custom retrieval package. Human Phenotype Ontology terms were extracted using MetaMap, with 76-84% precision and 65-73% recall. Mean terms per paper increased from 9 in title + abstract, to 68 using full text. We demonstrate that these literature-derived disease models plausibly reflect true disease expressivity more accurately than widely used manually curated models, through comparison with prospectively gathered data from the Deciphering Developmental Disorders study. The area under the curve for receiver operating characteristic (ROC) curves increased by 5-10% through the use of literature-derived models. This work shows that scalable automated literature curation increases performance and adds weight to the need for this strategy to be integrated into informatic variant analysis pipelines. Database URL: https://doi.org/10.1093/database/baac038.
Collapse
Affiliation(s)
- T.M Yates
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - A Lain
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
| | - J Campbell
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - D R FitzPatrick
- MRC Human Genetics Unit, Western General Hospital, Institute of Genetics and Cancer, The University of Edinburgh, Crewe Road South, Edinburgh EH4 2XU, UK
- Transforming Genetic Medicine Initiative, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| | - T I Simpson
- Institute for Adaptive and Neural Computation, Informatics Forum, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, UK
- Simons Initiative for the Developing Brain, The University of Edinburgh, Hugh Robson Building, George Square, Edinburgh EH8 9XF, UK
| |
Collapse
|