1
|
Sinha R, Pal RK, De RK. GenSeg and MR-GenSeg: A Novel Segmentation Algorithm and its Parallel MapReduce Based Approach for Identifying Genomic Regions With Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:443-454. [PMID: 32750860 DOI: 10.1109/tcbb.2020.3000661] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying intragenic as well as intergenic sequences of the DNA, having structural alterations, is a significantly important research area, since this may be the root cause of many neurological and autoimmune diseases, including cancer. Working with whole genome NGS data has provided a new insight in this regard, but has lead to huge explosion of data that is growing exponentially. Hence, the challenges lie in efficient means of storage and processing this big data. In this study, we have developed a novel segmentation algorithm, called GenSeg, and its parallel MapReduce based algorithm, called MR-GenSeg, for detecting copy number variations. In order to annotate CNVs (variants), segments formed by GenSeg/MR-GenSeg have been represented in a novel way using a binary tree, where each node is a CNV event. GenSeg considers each position specific data of whole genome DNA sequence, so that precise identification of breakpoints is possible. GenSeg/MR-GenSeg has been compared with twelve popular CNV detection algorithms, where it has outperformed the others in terms of sensitivity, and has achieved a good F-score value. MR-GenSeg has excelled in terms of SpeedUp, when compared with these algorithms. The effect of CNVs on immunoglobulin (IG) genes has also been analysed in this study. Availability: The source codes are available at https://github.com/rituparna-sinha/MapReduce-GENSEG.
Collapse
|
2
|
Kim MJ, Lee S, Yun H, Cho SI, Kim B, Lee JS, Chae JH, Sun C, Park SS, Seong MW. Consistent count region-copy number variation (CCR-CNV): an expandable and robust tool for clinical diagnosis of copy number variation at the exon level using next-generation sequencing data. Genet Med 2021; 24:663-672. [PMID: 34906491 DOI: 10.1016/j.gim.2021.10.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Accepted: 10/29/2021] [Indexed: 11/29/2022] Open
Abstract
PURPOSE Despite the importance of exonic copy number variations (CNVs) in human genetic diseases, reliable next-generation sequencing-based methods for detecting them are unavailable. We developed an expandable and robust exonic CNV detection tool called consistent count region (CCR)-CNV. METHODS In total, about 1000 samples of the truth set were used for validating CCR-CNV. We compared CCR-CNV performance with 2 well-known CNV tools. Finally, to overcome the limitations of CCR-CNV, we devised a combined approach. RESULTS The mean sensitivity and specificity of CCR-CNV alone were above 95%, which was superior to that of other CNV tools, such as DECoN and Atlas-CNV. However, low covered region and positive predictive value and high false discovery rate act as obstacles to its use in clinical settings. The combined approach showed much improved performance than CCR-CNV alone. CONCLUSION In this study, we present a novel diagnostic tool that allows the identification of exonic CNVs with high confidence using various reagents and clinical next-generation sequencing platforms. We validated this method using the largest multiple ligation-dependent probe amplification-confirmed data set, including sufficient copy normal control data. The approach, combined with existing CNV tools, allows the implementation of CCR-CNV in clinical settings.
Collapse
Affiliation(s)
- Man Jin Kim
- Department of Genomic Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea; Department of Laboratory Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Sungyoung Lee
- Department of Genomic Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea; Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea
| | - Hongseok Yun
- Department of Genomic Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea; Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea
| | - Sung Im Cho
- Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea
| | - Boram Kim
- Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea
| | - Jee-Soo Lee
- Department of Laboratory Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Jong Hee Chae
- Department of Genomic Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea; Department of Pediatrics, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | | | - Sung Sup Park
- Department of Laboratory Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Moon-Woo Seong
- Department of Laboratory Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea.
| |
Collapse
|
3
|
García-Fernández J, Vilches-Arroyo S, Olavarrieta L, Pérez-Pérez J, Rodríguez de Córdoba S. Detection of Genetic Rearrangements in the Regulators of Complement Activation RCA Cluster by High-Throughput Sequencing and MLPA. Methods Mol Biol 2021; 2227:159-178. [PMID: 33847941 DOI: 10.1007/978-1-0716-1016-9_16] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The regulators of complement activation (RCA) gene cluster in 1q31-1q32 includes most of the genes encoding complement regulatory proteins. Genetic variability in the RCA gene cluster frequently involve copy number variations (CNVs), a type of chromosome structural variation causing alterations in the number of copies of specific regions of DNA. CNVs in the RCA gene cluster often relate with gene rearrangements that result in the generation of novel genes, carrying internal duplications or deletions, and hybrid genes, resulting from the fusion or exchange of genetic material between two different genes. These gene rearrangements are strongly associated with a number of rare and common diseases characterized by complement dysregulation. Identification of CNVs in the RCA gene cluster is critical in the molecular diagnostic of these diseases. It can be done by bioinformatics analysis of DNA sequence data generated by massive parallel sequencing techniques (NGS, next generation sequencing) but often requires special techniques like multiplex ligation-dependent probe amplification (MLPA). This is because the currently used massive parallel DNA sequencing approaches do not easily identify all the structural variations in the RCA gene cluster. We will describe here how to use the MLPA assays and two computational tools to analyze NGS data, NextGENe and ONCOCNV, to detect CNVs and gene rearrangements in the RCA gene cluster.
Collapse
|
4
|
Cocchi E, Nestor JG, Gharavi AG. Clinical Genetic Screening in Adult Patients with Kidney Disease. Clin J Am Soc Nephrol 2020; 15:1497-1510. [PMID: 32646915 PMCID: PMC7536756 DOI: 10.2215/cjn.15141219] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Expanded accessibility of genetic sequencing technologies, such as chromosomal microarray and massively parallel sequencing approaches, is changing the management of hereditary kidney diseases. Genetic causes account for a substantial proportion of pediatric kidney disease cases, and with increased utilization of diagnostic genetic testing in nephrology, they are now also detected at appreciable frequencies in adult populations. Establishing a molecular diagnosis can have many potential benefits for patient care, such as guiding treatment, familial testing, and providing deeper insights on the molecular pathogenesis of kidney diseases. Today, with wider clinical use of genetic testing as part of the diagnostic evaluation, nephrologists have the challenging task of selecting the most suitable genetic test for each patient, and then applying the results into the appropriate clinical contexts. This review is intended to familiarize nephrologists with the various technical, logistical, and ethical considerations accompanying the increasing utilization of genetic testing in nephrology care.
Collapse
Affiliation(s)
- Enrico Cocchi
- Division of Nephrology and Center for Precision Medicine and Genomics, Department of Medicine, Columbia University, New York, New York
- Department of Pediatrics, Universita' degli Studi di Torino, Torino, Italy
| | - Jordan Gabriela Nestor
- Division of Nephrology and Center for Precision Medicine and Genomics, Department of Medicine, Columbia University, New York, New York
| | - Ali G Gharavi
- Division of Nephrology and Center for Precision Medicine and Genomics, Department of Medicine, Columbia University, New York, New York
- Insititute of Genomic Medicine, Columbia University, New York, New York
| |
Collapse
|
5
|
Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data. PLoS Comput Biol 2020; 16:e1008012. [PMID: 32658894 PMCID: PMC7377518 DOI: 10.1371/journal.pcbi.1008012] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 07/23/2020] [Accepted: 06/03/2020] [Indexed: 12/22/2022] Open
Abstract
Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient. Copy number aberrations, or CNAs, refer to evolutionary events that act on cancer genomes by deleting segments of the genomes or introducing new copies of existing segments. These events have been implicated in various types of cancer; consequently, their accurate detection could shed light on the initiation and progression of tumor, as well as on the development of potential targeted therapeutics. Single-cell DNA sequencing technologies are now producing the type of data that would allow such detection at the resolution of individual cells. However, to achieve this detection task, methods have to implement several steps of “data wrangling” and dealing with technical artifacts. In this work, we benchmarked three widely used methods for CNA detection from single-cell DNA data, namely Ginkgo, HMMcopy, and CopyNumber. To accomplish this study, we developed a novel simulator and devised a phylogeny-based measure of potentially erroneous CNA calls. We find that none of these methods has high accuracy, and all of them can be computationally very demanding. These findings call for the development of more accurate and more efficient methods for CNA detection from single-cell DNA data.
Collapse
|
6
|
Karim MR, Rahman A, Jares JB, Decker S, Beyan O. A snapshot neural ensemble method for cancer-type prediction based on copy number variations. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04616-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
AbstractAn accurate diagnosis and prognosis for cancer are specific to patients with particular cancer types and molecular traits, which needs to address carefully. The discovery of important biomarkers is becoming an important step toward understanding the molecular mechanisms of carcinogenesis in which genomics data and clinical outcomes need to be analyzed before making any clinical decision. Copy number variations (CNVs) are found to be associated with the risk of individual cancers and hence can be used to reveal genetic predispositions before cancer develops. In this paper, we collect the CNVs data about 8000 cancer patients covering 14 different cancer types from The Cancer Genome Atlas. Then, two different sparse representations of CNVs based on 578 oncogenes and 20,308 protein-coding genes, including genomic deletions and duplication across the samples, are prepared. Then, we train Conv-LSTM and convolutional autoencoder (CAE) networks using both representations and create snapshot models. While the Conv-LSTM can capture locally and globally important features, CAE can utilize unsupervised pretraining to initialize the weights in the subsequent convolutional layers against the sparsity. Model averaging ensemble (MAE) is then applied to combine the snapshot models in order to make a single prediction. Finally, we identify most significant CNVs biomarkers using guided-gradient class activation map plus (GradCAM++) and rank top genes for different cancer types. Results covering several experiments show fairly high prediction accuracies for the majority of cancer types. In particular, using protein-coding genes, Conv-LSTM and CAE networks can predict cancer types correctly at least 72.96% and 76.77% of the cases, respectively. Contrarily, using oncogenes gives moderately higher accuracies of 74.25% and 78.32%, whereas the snapshot model based on MAE shows overall 2.5% of accuracy improvement.
Collapse
|
7
|
Keel BN, Nonneman DJ, Lindholm-Perry AK, Oliver WT, Rohrer GA. A Survey of Copy Number Variation in the Porcine Genome Detected From Whole-Genome Sequence. Front Genet 2019; 10:737. [PMID: 31475038 PMCID: PMC6707380 DOI: 10.3389/fgene.2019.00737] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 07/12/2019] [Indexed: 12/11/2022] Open
Abstract
Copy number variations (CNVs) are gains and losses of large regions of genomic sequence between individuals of a species. Although CNVs have been associated with various phenotypic traits in humans and other species, the extent to which CNVs impact phenotypic variation remains unclear. In swine, as well as many other species, relatively little is understood about the frequency of CNV in the genome, sizes, locations, and other chromosomal properties. In this work, we identified and characterized CNV by utilizing whole-genome sequence from 240 members of an intensely phenotyped experimental swine herd at the U.S. Meat Animal Research Center (USMARC). These animals included all 24 of the purebred founding boars (12 Duroc and 12 Landrace), 48 of the founding Yorkshire-Landrace composite sows, 109 composite animals from generations 4 through 9, 29 composite animals from generation 15, and 30 purebred industry boars (15 Landrace and 15 Yorkshire) used as sires in generations 10 through 15. Using a combination of split reads, paired-end mapping, and read depth approaches, we identified a total of 3,538 copy number variable regions (CNVRs), including 1,820 novel CNVRs not reported in previous studies. The CNVRs covered 0.94% of the porcine genome and overlapped 1,401 genes. Gene ontology analysis identified that CNV-overlapped genes were enriched for functions related to organism development. Additionally, CNVRs overlapped with many known quantitative trait loci (QTL). In particular, analysis of QTL previously identified in the USMARC herd showed that CNVRs were most overlapped with reproductive traits, such as age of puberty and ovulation rate, and CNVRs were significantly enriched for reproductive QTL.
Collapse
Affiliation(s)
- Brittney N Keel
- USDA, ARS, U.S. Meat Animal Research Center, Clay Center, NE, United States
| | - Dan J Nonneman
- USDA, ARS, U.S. Meat Animal Research Center, Clay Center, NE, United States
| | | | - William T Oliver
- USDA, ARS, U.S. Meat Animal Research Center, Clay Center, NE, United States
| | - Gary A Rohrer
- USDA, ARS, U.S. Meat Animal Research Center, Clay Center, NE, United States
| |
Collapse
|