Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA.
PLoS Comput Biol 2019;
15:e1007332. [PMID:
31469830 PMCID:
PMC6742441 DOI:
10.1371/journal.pcbi.1007332]
[Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 09/12/2019] [Accepted: 08/14/2019] [Indexed: 12/14/2022] Open
Abstract
The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same ‘cell of origin’. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL’s predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention.
Many subtypes of cancer are unable to be distinguished based on their genomic profiles. With the ever-increasing use of sequencing, we now have the ability to look deeper into the genome and pick up on hidden signals in areas typically considered irrelevant to disease. To overcome the issue of rare variants and the vast amount of heterogeneity found in these non-coding sectors, we introduce a new algorithm capable of correcting for both challenges, ReVeaL. Using this approach, we are able to demonstrate that the non-coding regions of the genome have more signal for distinguishing subtle subtypes of disease compared to all the coding regions. Specifically, we show that the darkest unexplored genomic regions, the non-coding genome with no functional annotation whatsoever in the literature, have the strongest signal. Thus dark-matter does indeed matter and should not be ignored but rather considered for the continued pressing task of finding biomarkers of disease to adequately treat our patients.
Collapse