1
|
Meyerson W, Leisman J, Navarro FCP, Gerstein M. Origins and characterization of variants shared between databases of somatic and germline human mutations. BMC Bioinformatics 2020; 21:227. [PMID: 32498674 PMCID: PMC7273669 DOI: 10.1186/s12859-020-3508-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Accepted: 04/20/2020] [Indexed: 01/26/2023] Open
Abstract
Background Mutations arise in the human genome in two major settings: the germline and the soma. These settings involve different inheritance patterns, time scales, chromatin structures, and environmental exposures, all of which impact the resulting distribution of substitutions. Nonetheless, many of the same single nucleotide variants (SNVs) are shared between germline and somatic mutation databases, such as between the gnomAD database of 120,000 germline exomes and the TCGA database of 10,000 somatic exomes. Here, we sought to explain this overlap. Results After strict filtering to exclude common germline polymorphisms and sites with poor coverage or mappability, we found 336,987 variants shared between the somatic and germline databases. A uniform statistical model explains 34% of these shared variants; a model that incorporates the varying mutation rates of the basic mutation types explains another 50% of shared variants; and a model that includes extended nucleotide contexts (e.g. surrounding 3 bases on either side) explains an additional 4% of shared variants. Analysis of read depth finds mixed evidence that up to 4% of the shared variants may represent germline variants leaked into somatic call sets. 9% of the shared variants are not explained by any model. Sequencing errors and convergent evolution did not account for these. We surveyed other factors as well: Cancers driven by endogenous mutational processes share a greater fraction of variants with the germline, and recently derived germline variants were more likely to be somatically shared than were ancient germline ones. Conclusions Overall, we find that shared variants largely represent bona fide biological occurrences of the same variant in the germline and somatic setting and arise primarily because DNA has some of the same basic chemical vulnerabilities in either setting. Moreover, we find mixed evidence that somatic call-sets leak appreciable numbers of germline variants, which is relevant to genomic privacy regulations. In future studies, the similar chemical vulnerability of DNA between the somatic and germline settings might be used to help identify disease-related genes by guiding the development of background-mutation models that are informed by both somatic and germline patterns of variation.
Collapse
Affiliation(s)
- William Meyerson
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA.
| | - John Leisman
- Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, 06510, USA
| | - Fabio C P Navarro
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA.,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA
| | - Mark Gerstein
- Computational Biology & Bioinformatics, Yale University, New Haven, CT, 06511, USA. .,Yale School of Medicine, Yale University, New Haven, CT, 06510, USA. .,Molecular Biophysics & Biochemistry, Yale University, New Haven, CT, 06511, USA. .,Department of Computer Science, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|