1
|
Darmofal M, Suman S, Atwal G, Toomey M, Chen JF, Chang JC, Vakiani E, Varghese AM, Balakrishnan Rema A, Syed A, Schultz N, Berger MF, Morris Q. Deep-Learning Model for Tumor-Type Prediction Using Targeted Clinical Genomic Sequencing Data. Cancer Discov 2024; 14:1064-1081. [PMID: 38416134 PMCID: PMC11145170 DOI: 10.1158/2159-8290.cd-23-0996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Revised: 12/07/2023] [Accepted: 02/23/2024] [Indexed: 02/29/2024]
Abstract
Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor-type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole-genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a data set of 39,787 solid tumors sequenced using a clinically targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivaling the performance of WGS-based methods. GDD-ENS can also guide diagnoses of rare type and cancers of unknown primary and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows could provide clinically relevant tumor-type predictions to guide treatment decisions in real time. SIGNIFICANCE We describe a highly accurate tumor-type prediction model, designed specifically for clinical implementation. Our model relies only on widely used cancer gene panel sequencing data, predicts across 38 distinct cancer types, and supports integration of patient-specific nongenomic information for enhanced decision support in challenging diagnostic situations. See related commentary by Garg, p. 906. This article is featured in Selected Articles from This Issue, p. 897.
Collapse
Affiliation(s)
- Madison Darmofal
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York
| | - Shalabh Suman
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Gurnit Atwal
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Michael Toomey
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, New York
| | - Jie-Fu Chen
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Jason C. Chang
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Efsevia Vakiani
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Anna M. Varghese
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York
| | | | - Aijazuddin Syed
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Nikolaus Schultz
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Michael F. Berger
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, New York
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Quaid Morris
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York
| |
Collapse
|
2
|
Ma W, Wu H, Chen Y, Xu H, Jiang J, Du B, Wan M, Ma X, Chen X, Lin L, Su X, Bao X, Shen Y, Xu N, Ruan J, Jiang H, Ding Y. New techniques to identify the tissue of origin for cancer of unknown primary in the era of precision medicine: progress and challenges. Brief Bioinform 2024; 25:bbae028. [PMID: 38343328 PMCID: PMC10859692 DOI: 10.1093/bib/bbae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 12/10/2023] [Accepted: 01/11/2024] [Indexed: 02/15/2024] Open
Abstract
Despite a standardized diagnostic examination, cancer of unknown primary (CUP) is a rare metastatic malignancy with an unidentified tissue of origin (TOO). Patients diagnosed with CUP are typically treated with empiric chemotherapy, although their prognosis is worse than those with metastatic cancer of a known origin. TOO identification of CUP has been employed in precision medicine, and subsequent site-specific therapy is clinically helpful. For example, molecular profiling, including genomic profiling, gene expression profiling, epigenetics and proteins, has facilitated TOO identification. Moreover, machine learning has improved identification accuracy, and non-invasive methods, such as liquid biopsy and image omics, are gaining momentum. However, the heterogeneity in prediction accuracy, sample requirements and technical fundamentals among the various techniques is noteworthy. Accordingly, we systematically reviewed the development and limitations of novel TOO identification methods, compared their pros and cons and assessed their potential clinical usefulness. Our study may help patients shift from empirical to customized care and improve their prognoses.
Collapse
Affiliation(s)
- Wenyuan Ma
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Hui Wu
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yiran Chen
- Department of Surgical Oncology, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Hongxia Xu
- Zhejiang University-University of Edinburgh Institute (ZJU-UoE Institute), Zhejiang University School of Medicine, Zhejiang University, Haining, China
| | - Junjie Jiang
- Department of Gastroenterology, Affiliated Hangzhou First People's Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Bang Du
- Real Doctor AI Research Centre, School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Mingyu Wan
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xiaolu Ma
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xiaoyu Chen
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lili Lin
- Department of Nuclear Medicine, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xinhui Su
- Department of Nuclear Medicine, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xuanwen Bao
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yifei Shen
- Department of Laboratory Medicine, the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Nong Xu
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jian Ruan
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Haiping Jiang
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yongfeng Ding
- Department of Medical Oncology, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
3
|
Darmofal M, Suman S, Atwal G, Chen JF, Chang JC, Toomey M, Vakiani E, Varghese AM, Rema AB, Syed A, Schultz N, Berger M, Morris Q. Deep Learning Model for Tumor Type Prediction using Targeted Clinical Genomic Sequencing Data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.09.08.23295131. [PMID: 37732244 PMCID: PMC10508812 DOI: 10.1101/2023.09.08.23295131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
Tumor type guides clinical treatment decisions in cancer, but histology-based diagnosis remains challenging. Genomic alterations are highly diagnostic of tumor type, and tumor type classifiers trained on genomic features have been explored, but the most accurate methods are not clinically feasible, relying on features derived from whole genome sequencing (WGS), or predicting across limited cancer types. We use genomic features from a dataset of 39,787 solid tumors sequenced using a clinical targeted cancer gene panel to develop Genome-Derived-Diagnosis Ensemble (GDD-ENS): a hyperparameter ensemble for classifying tumor type using deep neural networks. GDD-ENS achieves 93% accuracy for high-confidence predictions across 38 cancer types, rivalling performance of WGS-based methods. GDD-ENS can also guide diagnoses on rare type and cancers of unknown primary, and incorporate patient-specific clinical information for improved predictions. Overall, integrating GDD-ENS into prospective clinical sequencing workflows has enabled clinically-relevant tumor type predictions to guide treatment decisions in real time.
Collapse
Affiliation(s)
- Madison Darmofal
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine; New York, NY 10065, USA
| | - Shalabh Suman
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Gurnit Atwal
- Computational Biology Program, Ontario Institute for Cancer Research; Toronto, ON M5G 0A3, Canada
- Department of Molecular Genetics, University of Toronto; Toronto, ON M5S 1A8, Canada
- Vector Institute; Toronto, ON M5G 1M1, Canada
| | - Jie-Fu Chen
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Jason C. Chang
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Michael Toomey
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medicine; New York, NY 10065, USA
| | - Efsevia Vakiani
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Anna M Varghese
- Department of Medicine, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | | | - Aijazuddin Syed
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Nikolaus Schultz
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Michael Berger
- Department of Pathology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Marie-Josée and Henry R. Kravis Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| | - Quaid Morris
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center; New York, NY 10065, USA
| |
Collapse
|
4
|
Nguyen L, Van Hoeck A, Cuppen E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat Commun 2022; 13:4013. [PMID: 35817764 PMCID: PMC9273599 DOI: 10.1038/s41467-022-31666-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 06/23/2022] [Indexed: 12/25/2022] Open
Abstract
Cancers of unknown primary (CUP) origin account for ∼3% of all cancer diagnoses, whereby the tumor tissue of origin (TOO) cannot be determined. Using a uniformly processed dataset encompassing 6756 whole-genome sequenced primary and metastatic tumors, we develop Cancer of Unknown Primary Location Resolver (CUPLR), a random forest TOO classifier that employs 511 features based on simple and complex somatic driver and passenger mutations. CUPLR distinguishes 35 cancer (sub)types with ∼90% recall and ∼90% precision based on cross-validation and test set predictions. We find that structural variant derived features increase the performance and utility for classifying specific cancer types. With CUPLR, we could determine the TOO for 82/141 (58%) of CUP patients. Although CUPLR is based on machine learning, it provides a human interpretable graphical report with detailed feature explanations. The comprehensive output of CUPLR complements existing histopathological procedures and can enable improved diagnostics for CUP patients.
Collapse
Affiliation(s)
- Luan Nguyen
- University Medical Center Utrecht, Universiteitsweg 100, 3584 CG, Utrecht, The Netherlands
| | - Arne Van Hoeck
- University Medical Center Utrecht, Universiteitsweg 100, 3584 CG, Utrecht, The Netherlands
| | - Edwin Cuppen
- University Medical Center Utrecht, Universiteitsweg 100, 3584 CG, Utrecht, The Netherlands.
- Hartwig Medical Foundation, Science Park 408, 1098 XH, Amsterdam, The Netherlands.
| |
Collapse
|
5
|
Wang Z, Zhang T, Wu W, Wu L, Li J, Huang B, Liang Y, Li Y, Li P, Li K, Wang W, Guo R, Wang Q. Detection and Localization of Solid Tumors Utilizing the Cancer-Type-Specific Mutational Signatures. Front Bioeng Biotechnol 2022; 10:883791. [PMID: 35547159 PMCID: PMC9081532 DOI: 10.3389/fbioe.2022.883791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 04/07/2022] [Indexed: 11/17/2022] Open
Abstract
Accurate detection and location of tumor lesions are essential for improving the diagnosis and personalized cancer therapy. However, the diagnosis of lesions with fuzzy histology is mainly dependent on experiences and with low accuracy and efficiency. Here, we developed a logistic regression model based on mutational signatures (MS) for each cancer type to trace the tumor origin. We observed MS could distinguish cancer from inflammation and healthy individuals. By collecting extensive datasets of samples from ten tumor types in the training cohort (5,001 samples) and independent testing cohort (2,580 samples), cancer-type-specific MS patterns (CTS-MS) were identified and had a robust performance in distinguishing different types of primary and metastatic solid tumors (AUC:0.76 ∼ 0.93). Moreover, we validated our model in an Asian population and found that the AUC of our model in predicting the tumor origin of the Asian population was higher than 0.7. The metastatic tumor lesions inherited the MS pattern of the primary tumor, suggesting the capability of MS in identifying the tissue-of-origin for metastatic cancers. Furthermore, we distinguished breast cancer and prostate cancer with 90% accuracy by combining somatic mutations and CTS-MS from cfDNA, indicating that the CTS-MS could improve the accuracy of cancer-type prediction by cfDNA. In summary, our study demonstrated that MS was a novel reliable biomarker for diagnosing solid tumors and provided new insights into predicting tissue-of-origin.
Collapse
Affiliation(s)
- Ziyu Wang
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Tingting Zhang
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Wei Wu
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Lingxiang Wu
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Jie Li
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Bin Huang
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Yuan Liang
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Yan Li
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Pengping Li
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Kening Li
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
- *Correspondence: Kening Li, ; Wei Wang, ; Renhua Guo, ; Qianghu Wang,
| | - Wei Wang
- Department of Thoracic Surgery, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
- *Correspondence: Kening Li, ; Wei Wang, ; Renhua Guo, ; Qianghu Wang,
| | - Renhua Guo
- Department of Oncology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
- *Correspondence: Kening Li, ; Wei Wang, ; Renhua Guo, ; Qianghu Wang,
| | - Qianghu Wang
- Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing, China
- Department of Bioinformatics, Nanjing Medical University, Nanjing, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
- *Correspondence: Kening Li, ; Wei Wang, ; Renhua Guo, ; Qianghu Wang,
| |
Collapse
|
6
|
Liu H, Qiu C, Wang B, Bing P, Tian G, Zhang X, Ma J, He B, Yang J. Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-of-Origin. Front Cell Dev Biol 2021; 9:619330. [PMID: 34012960 PMCID: PMC8126648 DOI: 10.3389/fcell.2021.619330] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 03/22/2021] [Indexed: 12/18/2022] Open
Abstract
Carcinoma of unknown primary (CUP) is a type of metastatic cancer, the primary tumor site of which cannot be identified. CUP occupies approximately 5% of cancer incidences in the United States with usually unfavorable prognosis, making it a big threat to public health. Traditional methods to identify the tissue-of-origin (TOO) of CUP like immunohistochemistry can only deal with around 20% CUP patients. In recent years, more and more studies suggest that it is promising to solve the problem by integrating machine learning techniques with big biomedical data involving multiple types of biomarkers including epigenetic, genetic, and gene expression profiles, such as DNA methylation. Different biomarkers play different roles in cancer research; for example, genomic mutations in a patient’s tumor could lead to specific anticancer drugs for treatment; DNA methylation and copy number variation could reveal tumor tissue of origin and molecular classification. However, there is no systematic comparison on which biomarker is better at identifying the cancer type and site of origin. In addition, it might also be possible to further improve the inference accuracy by integrating multiple types of biomarkers. In this study, we used primary tumor data rather than metastatic tumor data. Although the use of primary tumors may lead to some biases in our classification model, their tumor-of-origins are known. In addition, previous studies have suggested that the CUP prediction model built from primary tumors could efficiently predict TOO of metastatic cancers (Lal et al., 2013; Brachtel et al., 2016). We systematically compared the performances of three types of biomarkers including DNA methylation, gene expression profile, and somatic mutation as well as their combinations in inferring the TOO of CUP patients. First, we downloaded the gene expression profile, somatic mutation and DNA methylation data of 7,224 tumor samples across 21 common cancer types from the cancer genome atlas (TCGA) and generated seven different feature matrices through various combinations. Second, we performed feature selection by the Pearson correlation method. The selected features for each matrix were used to build up an XGBoost multi-label classification model to infer cancer TOO, an algorithm proven to be effective in a few previous studies. The performance of each biomarker and combination was compared by the 10-fold cross-validation process. Our results showed that the TOO tracing accuracy using gene expression profile was the highest, followed by DNA methylation, while somatic mutation performed the worst. Meanwhile, we found that simply combining multiple biomarkers does not have much effect in improving prediction accuracy.
Collapse
Affiliation(s)
- Haiyan Liu
- Academician Workstation, Changsha Medical University, Changsha, China.,College of Information Engineering, Changsha Medical University, Changsha, China
| | - Chun Qiu
- Department of Oncology, Hainan General Hospital, Haikou, China
| | - Bo Wang
- Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| | - Pingping Bing
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Geng Tian
- Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| | - Xueliang Zhang
- Department of Oncology, Jiamusi Cancer Hospital, Jiamusi, China
| | - Jun Ma
- College of Information Engineering, Changsha Medical University, Changsha, China
| | - Bingsheng He
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Jialiang Yang
- Academician Workstation, Changsha Medical University, Changsha, China.,Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| |
Collapse
|
7
|
He B, Dai C, Lang J, Bing P, Tian G, Wang B, Yang J. A machine learning framework to trace tumor tissue-of-origin of 13 types of cancer based on DNA somatic mutation. Biochim Biophys Acta Mol Basis Dis 2020; 1866:165916. [PMID: 32771416 DOI: 10.1016/j.bbadis.2020.165916] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 07/20/2020] [Accepted: 08/03/2020] [Indexed: 12/13/2022]
Abstract
Carcinoma of unknown primary (CUP), defined as metastatic cancers with unknown cancer origin, occurs in 3-5 per 100 cancer patients in the United States. Heterogeneity and metastasis of cancer brings great difficulties to the follow-up diagnosis and treatment for CUP. To find the tissue-of-origin (TOO) of the CUP, multiple methods have been raised. However, the accuracies for computed tomography (CT) and positron emission tomography (PET) to identify TOO were 20%-27% and 24%-40% respectively, which were not enough for determining targeted therapies. In this study, we provide a machine learning framework to trace tumor tissue origin by using gene length-normalized somatic mutation sequencing data. Somatic mutation data was downloaded from the Data Portal (Release 28) of the International Cancer Genome Consortium (ICGC), and 4909 samples for 13 cancers was used to identify primary site of cancers. Optimal results were obtained based on a 600-gene set by using the random forest algorithm with 10-fold cross-validation, and the average accuracy and F1-score were 0.8822 and 0.8886 respectively across 13 types of cancer. In conclusion, we provide an effective computational framework to infer cancer tissue-of-origin by combining DNA sequencing and machine learning techniques, which is promising in assisting clinical diagnosis of cancers.
Collapse
Affiliation(s)
- Bingsheng He
- Academician Workstation, Changsha Medical University, Changsha 410219, China.
| | - Chan Dai
- Geneis Beijing Co., Ltd., Beijing 100102, China
| | - Jidong Lang
- Geneis Beijing Co., Ltd., Beijing 100102, China
| | - Pingping Bing
- Academician Workstation, Changsha Medical University, Changsha 410219, China
| | - Geng Tian
- Geneis Beijing Co., Ltd., Beijing 100102, China
| | - Bo Wang
- Geneis Beijing Co., Ltd., Beijing 100102, China.
| | - Jialiang Yang
- Academician Workstation, Changsha Medical University, Changsha 410219, China; Geneis Beijing Co., Ltd., Beijing 100102, China.
| |
Collapse
|
8
|
Liu X, Li L, Peng L, Wang B, Lang J, Lu Q, Zhang X, Sun Y, Tian G, Zhang H, Zhou L. Predicting Cancer Tissue-of-Origin by a Machine Learning Method Using DNA Somatic Mutation Data. Front Genet 2020; 11:674. [PMID: 32760423 PMCID: PMC7372518 DOI: 10.3389/fgene.2020.00674] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 06/02/2020] [Indexed: 12/11/2022] Open
Abstract
Patients with carcinoma of unknown primary (CUP) account for 3-5% of all cancer cases. A large number of metastatic cancers require further diagnosis to determine their tissue of origin. However, diagnosis of CUP and identification of its primary site are challenging. Previous studies have suggested that molecular profiling of tissue-specific genes could be useful in inferring the primary tissue of a tumor. The purpose of this study was to evaluate the performance somatic mutations detected in a tumor to identify the cancer tissue of origin. We downloaded the somatic mutation datasets from the International Cancer Genome Consortium project. The random forest algorithm was used to extract features, and a classifier was established based on the logistic regression. Specifically, the somatic mutations of 300 genes were extracted, which are significantly enriched in functions, such as cell-to-cell adhesion. In addition, the prediction accuracy on tissue-of-origin inference for 3,374 cancer samples across 13 cancer types reached 81% in a 10-fold cross-validation. Our method could be useful in the identification of cancer tissue of origin, as well as the diagnosis and treatment of cancers.
Collapse
Affiliation(s)
- Xiaojun Liu
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | | | - Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Bo Wang
- Genesis Beijing Co., Ltd., Beijing, China
| | | | | | | | - Yi Sun
- Chifeng Municipal Hospital, Chifeng, China
| | - Geng Tian
- Genesis Beijing Co., Ltd., Beijing, China
| | - Huajun Zhang
- College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua, China
| | - Liqian Zhou
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| |
Collapse
|
9
|
He B, Lang J, Wang B, Liu X, Lu Q, He J, Gao W, Bing P, Tian G, Yang J. TOOme: A Novel Computational Framework to Infer Cancer Tissue-of-Origin by Integrating Both Gene Mutation and Expression. Front Bioeng Biotechnol 2020; 8:394. [PMID: 32509741 PMCID: PMC7248358 DOI: 10.3389/fbioe.2020.00394] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 04/08/2020] [Indexed: 02/05/2023] Open
Abstract
Metastatic cancers require further diagnosis to determine their primary tumor sites. However, the tissue-of-origin for around 5% tumors could not be identified by routine medical diagnosis according to a statistics in the United States. With the development of machine learning techniques and the accumulation of big cancer data from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), it is now feasible to predict cancer tissue-of-origin by computational tools. Metastatic tumor inherits characteristics from its tissue-of-origin, and both gene expression profile and somatic mutation have tissue specificity. Thus, we developed a computational framework to infer tumor tissue-of-origin by integrating both gene mutation and expression (TOOme). Specifically, we first perform feature selection on both gene expressions and mutations by a random forest method. The selected features are then used to build up a multi-label classification model to infer cancer tissue-of-origin. We adopt a few popular multiple-label classification methods, which are compared by the 10-fold cross validation process. We applied TOOme to the TCGA data containing 7,008 non-metastatic samples across 20 solid tumors. Seventy four genes by gene expression profile and six genes by gene mutation are selected by the random forest process, which can be divided into two categories: (1) cancer type specific genes and (2) those expressed or mutated in several cancers with different levels of expression or mutation rates. Function analysis indicates that the selected genes are significantly enriched in gland development, urogenital system development, hormone metabolic process, thyroid hormone generation prostate hormone generation and so on. According to the multiple-label classification method, random forest performs the best with a 10-fold cross-validation prediction accuracy of 96%. We also use the 19 metastatic samples from TCGA and 256 cancer samples downloaded from GEO as independent testing data, for which TOOme achieves a prediction accuracy of 89%. The cross-validation validation accuracy is better than those using gene expression (i.e., 95%) and gene mutation (53%) alone. In conclusion, TOOme provides a quick yet accurate alternative to traditional medical methods in inferring cancer tissue-of-origin. In addition, the methods combining somatic mutation and gene expressions outperform those using gene expression or mutation alone.
Collapse
Affiliation(s)
- Binsheng He
- Academician Workstation, Changsha Medical University, Changsha, China
| | | | - Bo Wang
- Geneis Beijing Co., Ltd., Beijing, China
| | | | | | - Jianjun He
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Wei Gao
- Fujian Provincial Cancer Hospital, Fuzhou, China
| | - Pingping Bing
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Geng Tian
- Geneis Beijing Co., Ltd., Beijing, China
| | | |
Collapse
|
10
|
Bavafaye Haghighi E, Knudsen M, Elmedal Laursen B, Besenbacher S. Hierarchical Classification of Cancers of Unknown Primary Using Multi-Omics Data. Cancer Inform 2019; 18:1176935119872163. [PMID: 31516310 PMCID: PMC6719477 DOI: 10.1177/1176935119872163] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2019] [Accepted: 07/25/2019] [Indexed: 12/25/2022] Open
Abstract
A cancer of unknown primary (CUP) is a metastatic cancer for which standard diagnostic tests fail to locate the primary cancer. As standard treatments are based on the cancer type, such cases are hard to treat and have very poor prognosis. Using molecular data from the metastatic cancer to predict the primary site can make treatment choice easier and enable targeted therapy. In this article, we first examine the ability to predict cancer type using different types of omics data. Methylation data lead to slightly better prediction than gene expression and both these are superior to classification using somatic mutations. After using 3 data types independently, we notice some differences between the classes that tend to be misclassified, suggesting that integrating the data might improve accuracy. In light of the different levels of information provided by different omics types and to be able to handle missing data, we perform multi-omics classification by hierarchically combining the classifiers. The proposed hierarchical method first classifies based on the most informative type of omics data and then uses the other types of omics data to classify samples that did not get a high confidence classification in the first step. The resulting hierarchical classifier has higher accuracy than any of the single omics classifiers and thus proves that the combination of different data types is beneficial. Our results show that using multi-omics data can improve the classification of cancer types. We confirm this by testing our method on metastatic cancers from the MET500 dataset.
Collapse
Affiliation(s)
| | - Michael Knudsen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Britt Elmedal Laursen
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| | - Søren Besenbacher
- Department of Molecular Medicine (MOMA), Aarhus University Hospital, Aarhus, Denmark
| |
Collapse
|
11
|
Van Hoeck A, Tjoonk NH, van Boxtel R, Cuppen E. Portrait of a cancer: mutational signature analyses for cancer diagnostics. BMC Cancer 2019; 19:457. [PMID: 31092228 PMCID: PMC6521503 DOI: 10.1186/s12885-019-5677-2] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Accepted: 05/03/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND In the past decade, systematic and comprehensive analyses of cancer genomes have identified cancer driver genes and revealed unprecedented insight into the molecular mechanisms underlying the initiation and progression of cancer. These studies illustrate that although every cancer has a unique genetic make-up, there are only a limited number of mechanisms that shape the mutational landscapes of cancer genomes, as reflected by characteristic computationally-derived mutational signatures. Importantly, the molecular mechanisms underlying specific signatures can now be dissected and coupled to treatment strategies. Systematic characterization of mutational signatures in a cancer patient's genome may thus be a promising new tool for molecular tumor diagnosis and classification. RESULTS In this review, we describe the status of mutational signature analysis in cancer genomes and discuss the opportunities and relevance, as well as future challenges, for further implementation of mutational signatures in clinical tumor diagnostics and therapy guidance. CONCLUSIONS Scientific studies have illustrated the potential of mutational signature analysis in cancer research. As such, we believe that the implementation of mutational signature analysis within the diagnostic workflow will improve cancer diagnosis in the future.
Collapse
Affiliation(s)
- Arne Van Hoeck
- Center for Molecular Medicine and Oncode Institute, University Medical Centre Utrecht, Heidelberglaan 100, 3584CX Utrecht, The Netherlands
| | - Niels H. Tjoonk
- Center for Molecular Medicine and Oncode Institute, University Medical Centre Utrecht, Heidelberglaan 100, 3584CX Utrecht, The Netherlands
- Princess Máxima Center for Pediatric Oncology and Oncode Institute, Heidelberglaan 25, 3584CS Utrecht, The Netherlands
| | - Ruben van Boxtel
- Princess Máxima Center for Pediatric Oncology and Oncode Institute, Heidelberglaan 25, 3584CS Utrecht, The Netherlands
| | - Edwin Cuppen
- Center for Molecular Medicine and Oncode Institute, University Medical Centre Utrecht, Heidelberglaan 100, 3584CX Utrecht, The Netherlands
- Hartwig Medical Foundation, Science Park 408, 1098XH Amsterdam, The Netherlands
| |
Collapse
|
12
|
Analysis of renal cancer cell lines from two major resources enables genomics-guided cell line selection. Nat Commun 2017; 8:15165. [PMID: 28489074 PMCID: PMC5436135 DOI: 10.1038/ncomms15165] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2016] [Accepted: 03/06/2017] [Indexed: 12/19/2022] Open
Abstract
The utility of cancer cell lines is affected by the similarity to endogenous tumour cells. Here we compare genomic data from 65 kidney-derived cell lines from the Cancer Cell Line Encyclopedia and the COSMIC Cell Lines Project to three renal cancer subtypes from The Cancer Genome Atlas: clear cell renal cell carcinoma (ccRCC, also known as kidney renal clear cell carcinoma), papillary (pRCC, also known as kidney papillary) and chromophobe (chRCC, also known as kidney chromophobe) renal cell carcinoma. Clustering copy number alterations shows that most cell lines resemble ccRCC, a few (including some often used as models of ccRCC) resemble pRCC, and none resemble chRCC. Human ccRCC tumours clustering with cell lines display clinical and genomic features of more aggressive disease, suggesting that cell lines best represent aggressive tumours. We stratify mutations and copy number alterations for important kidney cancer genes by the consistency between databases, and classify cell lines into established gene expression-based indolent and aggressive subtypes. Our results could aid investigators in analysing appropriate renal cancer cell lines.
Collapse
|
13
|
Big Data and Cancer Research. BIG DATA ANALYTICS 2016. [DOI: 10.1007/978-81-322-3628-3_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
14
|
Marquard AM, Birkbak NJ, Thomas CE, Favero F, Krzystanek M, Lefebvre C, Ferté C, Jamal-Hanjani M, Wilson GA, Shafi S, Swanton C, André F, Szallasi Z, Eklund AC. TumorTracer: a method to identify the tissue of origin from the somatic mutations of a tumor specimen. BMC Med Genomics 2015; 8:58. [PMID: 26429708 PMCID: PMC4590711 DOI: 10.1186/s12920-015-0130-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 08/17/2015] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND A substantial proportion of cancer cases present with a metastatic tumor and require further testing to determine the primary site; many of these are never fully diagnosed and remain cancer of unknown primary origin (CUP). It has been previously demonstrated that the somatic point mutations detected in a tumor can be used to identify its site of origin with limited accuracy. We hypothesized that higher accuracy could be achieved by a classification algorithm based on the following feature sets: 1) the number of nonsynonymous point mutations in a set of 232 specific cancer-associated genes, 2) frequencies of the 96 classes of single-nucleotide substitution determined by the flanking bases, and 3) copy number profiles, if available. METHODS We used publicly available somatic mutation data from the COSMIC database to train random forest classifiers to distinguish among those tissues of origin for which sufficient data was available. We selected feature sets using cross-validation and then derived two final classifiers (with or without copy number profiles) using 80 % of the available tumors. We evaluated the accuracy using the remaining 20 %. For further validation, we assessed accuracy of the without-copy-number classifier on three independent data sets: 1669 newly available public tumors of various types, a cohort of 91 breast metastases, and a set of 24 specimens from 9 lung cancer patients subjected to multiregion sequencing. RESULTS The cross-validation accuracy was highest when all three types of information were used. On the left-out COSMIC data not used for training, we achieved a classification accuracy of 85 % across 6 primary sites (with copy numbers), and 69 % across 10 primary sites (without copy numbers). Importantly, a derived confidence score could distinguish tumors that could be identified with 95 % accuracy (32 %/75 % of tumors with/without copy numbers) from those that were less certain. Accuracy in the independent data sets was 46 %, 53 % and 89 % respectively, similar to the accuracy expected from the training data. CONCLUSIONS Identification of primary site from point mutation and/or copy number data may be accurate enough to aid clinical diagnosis of cancers of unknown primary origin.
Collapse
Affiliation(s)
- Andrea Marion Marquard
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
| | - Nicolai Juul Birkbak
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK.
| | - Cecilia Engel Thomas
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
- NNF Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, DK-2200, Copenhagen, Denmark.
| | - Francesco Favero
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
| | - Marcin Krzystanek
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
| | | | - Charles Ferté
- Inserm Unit U981, Gustave Roussy, Villejuif, France.
- Department of Medical Oncology, Gustave Roussy, Villejuif, France.
| | - Mariam Jamal-Hanjani
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK.
| | - Gareth A Wilson
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK.
| | - Seema Shafi
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK.
| | - Charles Swanton
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, 72 Huntley Street, London, WC1E 6BT, UK.
- Cancer Research UK London Research Institute, London, UK.
| | - Fabrice André
- Inserm Unit U981, Gustave Roussy, Villejuif, France.
- Department of Medical Oncology, Gustave Roussy, Villejuif, France.
| | - Zoltan Szallasi
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
- Children's Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology (CHIP@HST), Harvard Medical School, Boston, USA.
| | - Aron Charles Eklund
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet 8, DK-2800, Lyngby, Denmark.
| |
Collapse
|
15
|
Dietlein F, Thelen L, Jokic M, Jachimowicz RD, Ivan L, Knittel G, Leeser U, van Oers J, Edelmann W, Heukamp LC, Reinhardt HC. A Functional Cancer Genomics Screen Identifies a Druggable Synthetic Lethal Interaction between MSH3 and PRKDC. Cancer Discov 2014; 4:592-605. [DOI: 10.1158/2159-8290.cd-13-0907] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|