1
|
Yuan T, Dong J, Jia B, Jiang H, Zhao Z, Zhou M. DTDHM: detection of tandem duplications based on hybrid methods using next-generation sequencing data. PeerJ 2024; 12:e17748. [PMID: 39076774 PMCID: PMC11285389 DOI: 10.7717/peerj.17748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 06/24/2024] [Indexed: 07/31/2024] Open
Abstract
Background Tandem duplication (TD) is a common and important type of structural variation in the human genome. TDs have been shown to play an essential role in many diseases, including cancer. However, it is difficult to accurately detect TDs due to the uneven distribution of reads and the inherent complexity of next-generation sequencing (NGS) data. Methods This article proposes a method called DTDHM (detection of tandem duplications based on hybrid methods), which utilizes NGS data to detect TDs in a single sample. DTDHM builds a pipeline that integrates read depth (RD), split read (SR), and paired-end mapping (PEM) signals. To solve the problem of uneven distribution of normal and abnormal samples, DTDHM uses the K-nearest neighbor (KNN) algorithm for multi-feature classification prediction. Then, the qualified split reads and discordant reads are extracted and analyzed to achieve accurate localization of variation sites. This article compares DTDHM with three other methods on 450 simulated datasets and five real datasets. Results In 450 simulated data samples, DTDHM consistently maintained the highest F1-score. The average F1-score of DTDHM, SVIM, TARDIS, and TIDDIT were 80.0%, 56.2%, 43.4%, and 67.1%, respectively. The F1-score of DTDHM had a small variation range and its detection effect was the most stable and 1.2 times that of the suboptimal method. Most of the boundary biases of DTDHM fluctuated around 20 bp, and its boundary deviation detection ability was better than TARDIS and TIDDIT. In real data experiments, five real sequencing samples (NA19238, NA19239, NA19240, HG00266, and NA12891) were used to test DTDHM. The results showed that DTDHM had the highest overlap density score (ODS) and F1-score of the four methods. Conclusions Compared with the other three methods, DTDHM achieved excellent results in terms of sensitivity, precision, F1-score, and boundary bias. These results indicate that DTDHM can be used as a reliable tool for detecting TDs from NGS data, especially in the case of low coverage depth and tumor purity samples.
Collapse
Affiliation(s)
- Tianting Yuan
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Jinxin Dong
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Baoxian Jia
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Hua Jiang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Zuyao Zhao
- Orthopedics Department, Liaocheng People’s Hospital, Liaocheng, China
| | - Mengjiao Zhou
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| |
Collapse
|
2
|
Zhang Y, Liu W, Duan J. On the core segmentation algorithms of copy number variation detection tools. Brief Bioinform 2024; 25:bbae022. [PMID: 38340093 PMCID: PMC10858679 DOI: 10.1093/bib/bbae022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/26/2023] [Indexed: 02/12/2024] Open
Abstract
Shotgun sequencing is a high-throughput method used to detect copy number variants (CNVs). Although there are numerous CNV detection tools based on shotgun sequencing, their quality varies significantly, leading to performance discrepancies. Therefore, we conducted a comprehensive analysis of next-generation sequencing-based CNV detection tools over the past decade. Our findings revealed that the majority of mainstream tools employ similar detection rationale: calculates the so-called read depth signal from aligned sequencing reads and then segments the signal by utilizing either circular binary segmentation (CBS) or hidden Markov model (HMM). Hence, we compared the performance of those two core segmentation algorithms in CNV detection, considering varying sequencing depths, segment lengths and complex types of CNVs. To ensure a fair comparison, we designed a parametrical model using mainstream statistical distributions, which allows for pre-excluding bias correction such as guanine-cytosine (GC) content during the preprocessing step. The results indicate the following key points: (1) Under ideal conditions, CBS demonstrates high precision, while HMM exhibits a high recall rate. (2) For practical conditions, HMM is advantageous at lower sequencing depths, while CBS is more competitive in detecting small variant segments compared to HMM. (3) In case involving complex CNVs resembling real sequencing, HMM demonstrates more robustness compared with CBS. (4) When facing large-scale sequencing data, HMM costs less time compared with the CBS, while their memory usage is approximately equal. This can provide an important guidance and reference for researchers to develop new tools for CNV detection.
Collapse
Affiliation(s)
- Yibo Zhang
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Wenyu Liu
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| | - Junbo Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education and Department of Biomedical Engineering, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
3
|
Sinha R, Pal RK, De RK. ENLIGHTENMENT: A Scalable Annotated Database of Genomics and NGS-Based Nucleotide Level Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:155-168. [PMID: 38055361 DOI: 10.1109/tcbb.2023.3340067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/08/2023]
Abstract
The revolution in sequencing technologies has enabled human genomes to be sequenced at a very low cost and time leading to exponential growth in the availability of whole-genome sequences. However, the complete understanding of our genome and its association with cancer is a far way to go. Researchers are striving hard to detect new variants and find their association with diseases, which further gives rise to the need for aggregation of this Big Data into a common standard scalable platform. In this work, a database named Enlightenment has been implemented which makes the availability of genomic data integrated from eight public databases, and DNA sequencing profiles of H. sapiens in a single platform. Annotated results with respect to cancer specific biomarkers, pharmacogenetic biomarkers and its association with variability in drug response, and DNA profiles along with novel copy number variants are computed and stored, which are accessible through a web interface. In order to overcome the challenge of storage and processing of NGS technology-based whole-genome DNA sequences, Enlightenment has been extended and deployed to a flexible and horizontally scalable database HBase, which is distributed over a hadoop cluster, which would enable the integration of other omics data into the database for enlightening the path towards eradication of cancer.
Collapse
|
4
|
Li C, Fan S, Zhao H, Liu X. CNV-FB: A Feature bagging strategy-based approach to detect copy number variants from NGS data. J Bioinform Comput Biol 2023; 21:2350026. [PMID: 38212874 DOI: 10.1142/s0219720023500269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2024]
Abstract
Copy number variation (CNV), as a type of genomic structural variation, accounts for a large proportion of structural variation and is related to the pathogenesis and susceptibility to some human diseases, playing an important role in the development and change of human diseases. The development of next-generation sequencing technology (NGS) provides strong support for the design of CNV detection algorithms. Although a large number of methods have been developed to detect CNVs using NGS data, it is still considered a difficult problem to detect CNVs with low purity and coverage. In this paper, a new calculation method CNV-FB is proposed to detect CNVs from NGS data. The core idea of CNV-FB is to randomly sample the read depth values of the genome fragment, and then each sample is individually detected for outliers, and finally combined into a final outlier score. The CNV-FB method was applied to simulation data and real data experiments and compared with the other five methods of the same type. The results show that the CNV-FB method has a better detection effect than other methods. Therefore, the CNV-FB method may be an effective algorithm for detecting genomic mutations.
Collapse
Affiliation(s)
- Chengyou Li
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Shiqiang Fan
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Haiyong Zhao
- School of Computer Science, Liaocheng University, Liaocheng 252000, P. R. China
| | - Xiaotong Liu
- School of Agronomy and Agricultural Engineering, Liaocheng University, Liaocheng 252000, P. R. China
| |
Collapse
|
5
|
Park J, Cho YG, Kim JK, Kim HH. STS and PUDP Deletion Identified by Targeted Panel Sequencing with CNV Analysis in X-Linked Ichthyosis: A Case Report and Literature Review. Genes (Basel) 2023; 14:1925. [PMID: 37895274 PMCID: PMC10606178 DOI: 10.3390/genes14101925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 10/05/2023] [Accepted: 10/07/2023] [Indexed: 10/29/2023] Open
Abstract
X-linked recessive ichthyosis (XLI) is clinically characterized by dark brown, widespread dryness with polygonal scales. We describe the identification of STS and PUDP deletions using targeted panel sequencing combined with copy-number variation (CNV) analysis in XLI. A 9-month-old infant was admitted for genetic counseling. Since the second day after birth, the infant's skin tended to be dry and polygonal scales had accumulated over the abdomen and upper extremities. The infant's maternal uncle and brother (who had also exhibited similar skin symptoms from birth) presented with polygonal scales on their trunks. CNV analysis revealed a hemizygous deletion spanning 719.3 Kb on chromosome Xp22 (chrX:7,108,996-7,828,312), which included a segment of the STS gene and exhibited a Z ratio of -2 in the proband. Multiplex ligation-dependent probe amplification (MLPA) confirmed this interstitial Xp22.31 deletion. Our report underscores the importance of implementing CNV screening techniques, including sequencing data analysis and gene dosage assays such as MLPA, to detect substantial deletions that encompass the STS gene region of Xq22 in individuals suspected of having XLI.
Collapse
Affiliation(s)
- Joonhong Park
- Department of Laboratory Medicine, Jeonbuk National University Medical School and Hospital, Jeonju 54907, Republic of Korea; (J.P.); (Y.G.C.)
| | - Yong Gon Cho
- Department of Laboratory Medicine, Jeonbuk National University Medical School and Hospital, Jeonju 54907, Republic of Korea; (J.P.); (Y.G.C.)
| | - Jin Kyu Kim
- Department of Pediatrics, Jeonbuk National University Medical School and Hospital, Jeonju 54907, Republic of Korea;
| | - Hyun Ho Kim
- Department of Pediatrics, Jeonbuk National University Medical School and Hospital, Jeonju 54907, Republic of Korea;
- Research Institute of Clinical Medicine, Jeonbuk National University-Biomedical Research Institute, Jeonbuk National University Hospital, Jeonju 54907, Republic of Korea
| |
Collapse
|
6
|
Liu G, Yang H, He Z. Detection of copy number variations based on a local distance using next-generation sequencing data. Front Genet 2023; 14:1147761. [PMID: 37811148 PMCID: PMC10556732 DOI: 10.3389/fgene.2023.1147761] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 09/14/2023] [Indexed: 10/10/2023] Open
Abstract
As one of the main types of structural variation in the human genome, copy number variation (CNV) plays an important role in the occurrence and development of human cancers. Next-generation sequencing (NGS) technology can provide base-level resolution, which provides favorable conditions for the accurate detection of CNVs. However, it is still a very challenging task to accurately detect CNVs from cancer samples with different purity and low sequencing coverage. Local distance-based CNV detection (LDCNV), an innovative computational approach to predict CNVs using NGS data, is proposed in this work. LDCNV calculates the average distance between each read depth (RD) and its k nearest neighbors (KNNs) to define the distance of KNNs of each RD, and the average distance between the KNNs for each RD to define their internal distance. Based on the above definitions, a local distance score is constructed using the ratio between the distance of KNNs and the internal distance of KNNs for each RD. The local distance scores are used to fit a normal distribution to evaluate the significance level of each RDS, and then use the hypothesis test method to predict the CNVs. The performance of the proposed method is verified with simulated and real data and compared with several popular methods. The experimental results show that the proposed method is superior to various other techniques. Therefore, the proposed method can be helpful for cancer diagnosis and targeted drug development.
Collapse
Affiliation(s)
- Guojun Liu
- School of Mathematics, Xi’an University of Finance and Economics, Xi’an, China
| | - Hongzhi Yang
- Department of Radiology, XD Group Hospital, Xi’an, China
| | - Zongzhen He
- School of Mathematics, Xi’an University of Finance and Economics, Xi’an, China
| |
Collapse
|
7
|
Xie K, Liu K, Alvi HAK, Chen Y, Wang S, Yuan X. KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data. Front Cell Dev Biol 2022; 9:796249. [PMID: 35004691 PMCID: PMC8728060 DOI: 10.3389/fcell.2021.796249] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 11/23/2021] [Indexed: 11/19/2022] Open
Abstract
Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.
Collapse
Affiliation(s)
- Kun Xie
- School of Computer Science and Technology, Xidian University, Xi'an, China.,Hangzhou Institute of Technology, Xidian University, Hangzhou, China
| | - Kang Liu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Haque A K Alvi
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Yuehui Chen
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Jinan, China
| | - Shuzhen Wang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China.,Hangzhou Institute of Technology, Xidian University, Hangzhou, China
| |
Collapse
|
8
|
Sinha R, Pal RK, De RK. GenSeg and MR-GenSeg: A Novel Segmentation Algorithm and its Parallel MapReduce Based Approach for Identifying Genomic Regions With Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:443-454. [PMID: 32750860 DOI: 10.1109/tcbb.2020.3000661] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying intragenic as well as intergenic sequences of the DNA, having structural alterations, is a significantly important research area, since this may be the root cause of many neurological and autoimmune diseases, including cancer. Working with whole genome NGS data has provided a new insight in this regard, but has lead to huge explosion of data that is growing exponentially. Hence, the challenges lie in efficient means of storage and processing this big data. In this study, we have developed a novel segmentation algorithm, called GenSeg, and its parallel MapReduce based algorithm, called MR-GenSeg, for detecting copy number variations. In order to annotate CNVs (variants), segments formed by GenSeg/MR-GenSeg have been represented in a novel way using a binary tree, where each node is a CNV event. GenSeg considers each position specific data of whole genome DNA sequence, so that precise identification of breakpoints is possible. GenSeg/MR-GenSeg has been compared with twelve popular CNV detection algorithms, where it has outperformed the others in terms of sensitivity, and has achieved a good F-score value. MR-GenSeg has excelled in terms of SpeedUp, when compared with these algorithms. The effect of CNVs on immunoglobulin (IG) genes has also been analysed in this study. Availability: The source codes are available at https://github.com/rituparna-sinha/MapReduce-GENSEG.
Collapse
|
9
|
Yuan X, Ma C, Zhao H, Yang L, Wang S, Xi J. STIC: Predicting Single Nucleotide Variants and Tumor Purity in Cancer Genome. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2692-2701. [PMID: 32086221 DOI: 10.1109/tcbb.2020.2975181] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Single nucleotide variant (SNV) plays an important role in cellular proliferation and tumorigenesis in various types of human cancer. Next-generation sequencing (NGS) has provided high-throughput data at an unprecedented resolution to predict SNVs. Currently, there exist many computational methods for either germline or somatic SNV discovery from NGS data, but very few of them are versatile enough to adapt to any situations. In the absence of matched normal samples, the prediction of somatic SNVs from single-tumor samples becomes considerably challenging, especially when the tumor purity is unknown. Here, we propose a new approach, STIC, to predict somatic SNVs and estimate tumor purity from NGS data without matched normal samples. The main features of STIC include: (1) extracting a set of SNV-relevant features on each site and training the BP neural network algorithm on the features to predict SNVs; (2) creating an iterative process to distinguish somatic SNVs from germline ones by disturbing allele frequency; and (3) establishing a reasonable relationship between tumor purity and allele frequencies of somatic SNVs to accurately estimate the purity. We quantitatively evaluate the performance of STIC on both simulation and real sequencing datasets, the results of which indicate that STIC outperforms competing methods.
Collapse
|
10
|
Huang T, Li J, Jia B, Sang H. CNV-MEANN: A Neural Network and Mind Evolutionary Algorithm-Based Detection of Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 12:700874. [PMID: 34484298 PMCID: PMC8415314 DOI: 10.3389/fgene.2021.700874] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 07/19/2021] [Indexed: 11/20/2022] Open
Abstract
Copy number variation (CNV), is defined as repetitions or deletions of genomic segments of 1 Kb to 5 Mb, and is a major trigger for human disease. The high-throughput and low-cost characteristics of next-generation sequencing technology provide the possibility of the detection of CNVs in the whole genome, and also greatly improve the clinical practicability of next-generation sequencing (NGS) testing. However, current methods for the detection of CNVs are easily affected by sequencing and mapping errors, and uneven distribution of reads. In this paper, we propose an improved approach, CNV-MEANN, for the detection of CNVs, involving changing the structure of the neural network used in the MFCNV method. This method has three differences relative to the MFCNV method: (1) it utilizes a new feature, mapping quality, to replace two features in MFCNV, (2) it considers the influence of the loss categories of CNV on disease prediction, and refines the output structure, and (3) it uses a mind evolutionary algorithm to optimize the backpropagation (neural network) neural network model, and calculates individual scores for each genome bin to predict CNVs. Using both simulated and real datasets, we tested the performance of CNV-MEANN and compared its performance with those of seven widely used CNV detection methods. Experimental results demonstrated that the CNV-MEANN approach outperformed other methods with respect to sensitivity, precision, and F1-score. The proposed method was able to detect many CNVs that other approaches could not, and it reduced the boundary bias. CNV-MEANN is expected to be an effective method for the analysis of changes in CNVs in the genome.
Collapse
Affiliation(s)
- Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Baoxian Jia
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Hongyan Sang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| |
Collapse
|
11
|
Yuan X, Li J, Bai J, Xi J. A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1811-1820. [PMID: 31880558 DOI: 10.1109/tcbb.2019.2961886] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Copy number variation (CNV) is a major type of genomic structural variations that play an important role in human disorders. Next generation sequencing (NGS) has fueled the advancement in algorithm design to detect CNVs at base-pair resolution. However, accurate detection of CNVs of low amplitudes remains a challenging task. This paper proposes a new computational method, CNV-LOF, to identify CNVs of full-range amplitudes from NGS data. CNV-LOF is distinctly different from traditional methods, which mainly consider aberrations from a global perspective and rely on some assumed distribution of NGS read depths. In contrast, CNV-LOF takes a local view on the read depths and assigns an outlier factor to each genome segment. With the outlier factor profile, CNV-LOF uses a boxplot procedure to declare CNVs without the reliance of any distribution assumptions. Simulation experiments indicate that CNV-LOF outperforms five existing methods with respect to F1-measure, sensitivity, and precision. CNV-LOF is further validated on real sequencing samples, yielding highly consistent results with peer methods. CNV-LOF is able to detect CNVs of low and moderate amplitudes where the other existing methods fail, and it is expected to become a routine approach for the discovery of novel CNVs on whole sequencing genome.
Collapse
|
12
|
Yuan X, Xu X, Zhao H, Duan J. ERINS: Novel Sequence Insertion Detection by Constructing an Extended Reference. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1893-1901. [PMID: 31751246 DOI: 10.1109/tcbb.2019.2954315] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Next generation sequencing technology has led to the development of methods for the detection of novel sequence insertions (nsINS). Multiple signatures from short reads are usually extracted to improve nsINS detection performance. However, characterization of nsINSs larger than the mean insert size is still challenging. This article presents a new method, ERINS, to detect nsINS contents and genotypes of full spectrum range size. It integrates the features of structural variations and mapping states of split reads to find nsINS breakpoints, and then adopts a left-most mapping strategy to infer nsINS content by iteratively extending the standard reference at each breakpoint. Finally, it realigns all reads to the extended reference and infers nsINS genotypes through statistical testing on read counts. We test and validate the performance of ERINS on simulation and real sequencing datasets. The simulation experimental results demonstrate that it outperforms several peer methods with respect to sensitivity and precision. The real data application indicates that ERINS obtains high consistent results with those of previously reported and detects nsINSs over 200 base pairs that many other methods fail. In conclusion, ERINS can be used as a supplement to existing tools and will become a routine approach for characterizing nsINSs.
Collapse
|
13
|
Zhao HY, Li Q, Tian Y, Chen YH, Alvi HAK, Yuan XG. CIRCNV: Detection of CNVs Based on a Circular Profile of Read Depth from Sequencing Data. BIOLOGY 2021; 10:biology10070584. [PMID: 34202028 PMCID: PMC8301091 DOI: 10.3390/biology10070584] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 06/10/2021] [Accepted: 06/21/2021] [Indexed: 12/29/2022]
Abstract
Simple Summary In this study, we propose a copy number variation (CNV) detection method called CIRCNV, which is based on a circular profile of the read depth from sequencing data. The proposed method is an extended version of our previously developed method CNV-LOF. The main difference of CIRCNV from CNV-LOF lies in its two new features: (1) it transfers the read depth profile from a line shape to a circular shape via a polar coordinate transformation to generate a meaningful two-dimensional dataset for CNV analysis and promote fairness between the ends and middle part of the genome, and (2) it performs two rounds of CNV declaration via estimating tumor purity and recovering the truth circular RD profile. We test and evaluate the performance of CIRCNV via conducting simulation studies and real sequencing tumor sample applications. The experimental results show that CIRCNV outperforms peer methods with respect to sensitivity, precision, and the F1-score. The experiments prove that the proposed method is a reliable and effective tool in the field of variation analysis of tumor genomes. Abstract Copy number variation (CNV) is a common type of structural variation in the human genome. Accurate detection of CNVs from tumor genomes can provide crucial information for the study of tumor genesis and cancer precision diagnosis. However, the contamination of normal genomes in tumor genomes and the crude profiles of the read depth make such a task difficult. In this paper, we propose an alternative approach, called CIRCNV, for the detection of CNVs from sequencing data. CIRCNV is an extension of our previously developed method CNV-LOF, which uses local outlier factors to predict CNVs. Comparatively, CIRCNV can be performed on individual tumor samples and has the following two new features: (1) it transfers the read depth profile from a line shape to a circular shape via a polar coordinate transformation, in order to improve the efficiency of the read depth (RD) profile for the detection of CNVs; and (2) it performs a second round of CNV declaration based on the truth circular RD profile, which is recovered by estimating tumor purity. We test and validate the performance of CIRCNV based on simulation and real sequencing data and perform comparisons with several peer methods. The results demonstrate that CIRCNV can obtain superior performance in terms of sensitivity and precision. We expect that our proposed method will be a supplement to existing methods and become a routine tool in the field of variation analysis of tumor genomes.
Collapse
Affiliation(s)
- Hai-Yong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng 252000, China;
| | - Qi Li
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Ye Tian
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Yue-Hui Chen
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, University of Jinan, Ji’nan 250022, China;
| | - Haque A. K. Alvi
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
| | - Xi-Guo Yuan
- School of Computer Science and Technology, Xidian University, Xi’an 710071, China; (Q.L.); (Y.T.); (H.A.K.A.)
- Correspondence:
| |
Collapse
|
14
|
Yuan X, Yu J, Xi J, Yang L, Shang J, Li Z, Duan J. CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:539-549. [PMID: 31180897 DOI: 10.1109/tcbb.2019.2920889] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurate detection of copy number variations (CNVs) from short-read sequencing data is challenging due to the uneven distribution of reads and the unbalanced amplitudes of gains and losses. The direct use of read depths to measure CNVs tends to limit performance. Thus, robust computational approaches equipped with appropriate statistics are required to detect CNV regions and boundaries. This study proposes a new method called CNV_IFTV to address this need. CNV_IFTV assigns an anomaly score to each genome bin through a collection of isolation trees. The trees are trained based on isolation forest algorithm through conducting subsampling from measured read depths. With the anomaly scores, CNV_IFTV uses a total variation model to smooth adjacent bins, leading to a denoised score profile. Finally, a statistical model is established to test the denoised scores for calling CNVs. CNV_IFTV is tested on both simulated and real data in comparison to several peer methods. The results indicate that the proposed method outperforms the peer methods. CNV_IFTV is a reliable tool for detecting CNVs from short-read sequencing data even for low-level coverage and tumor purity. The detection results on tumor samples can aid to evaluate known cancer genes and to predict target drugs for disease diagnosis.
Collapse
|
15
|
Wei C, Zhang J, Yuan X, He Z, Liu G, Wu J. NeuroTIS: Enhancing the prediction of translation initiation sites in mRNA sequences via a hybrid dependency network and deep learning framework. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106459] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
16
|
Zhao H, Huang T, Li J, Liu G, Yuan X. MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2020; 11:434. [PMID: 32499814 PMCID: PMC7243272 DOI: 10.3389/fgene.2020.00434] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2020] [Accepted: 04/08/2020] [Indexed: 11/13/2022] Open
Abstract
Copy number variation (CNV) is a very important phenomenon in tumor genomes and plays a significant role in tumor genesis. Accurate detection of CNVs has become a routine and necessary procedure for a deep investigation of tumor cells and diagnosis of tumor patients. Next-generation sequencing (NGS) technique has provided a wealth of data for the detection of CNVs at base-pair resolution. However, such task is usually influenced by a number of factors, including GC-content bias, sequencing errors, and correlations among adjacent positions within CNVs. Although many existing methods have dealt with some of these artifacts by designing their own strategies, there is still a lack of comprehensive consideration of all the factors. In this paper, we propose a new method, MFCNV, for an accurate detection of CNVs from NGS data. Compared with existing methods, the characteristics of the proposed method include the following: (1) it makes a full consideration of the intrinsic correlations among adjacent positions in the genome to be analyzed, (2) it calculates read depth, GC-content bias, base quality, and correlation value for each genome bin and combines them as multiple features for the evaluation of genome bins, and (3) it addresses the joint effect among the factors via training a neural network algorithm for the prediction of CNVs. We test the performance of the MFCNV method by using simulation and real sequencing data and make comparisons with several peer methods. The results demonstrate that our method is superior to other methods in terms of sensitivity, precision, and F1-score and can detect many CNVs that other methods have not discovered. MFCNV is expected to be a complementary tool in the analysis of mutations in tumor genomes and can be extended to be applied to the analysis of single-cell sequencing data.
Collapse
Affiliation(s)
- Haiyong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China.,The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Guojun Liu
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Xiguo Yuan
- The School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
17
|
Yuan X, Li Z, Zhao H, Bai J, Zhang J. Accurate Inference of Tumor Purity and Absolute Copy Numbers From High-Throughput Sequencing Data. Front Genet 2020; 11:458. [PMID: 32425990 PMCID: PMC7205152 DOI: 10.3389/fgene.2020.00458] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 04/14/2020] [Indexed: 02/06/2023] Open
Abstract
Inference of absolute copy numbers in tumor genomes is one of the key points in the study of tumor genesis. However, the mixture of tumor and normal cells poses a big challenge to this task. Accurate estimation of tumor purity (i.e., the fraction of tumor cells) is a necessary step to solve this problem. In this paper, we propose a new approach, AITAC, to accurately infer tumor purity and absolute copy numbers in a tumor sample by using high-throughput sequencing (HTS) data. In contrast to many existing algorithms for estimating tumor purity, which usually rely on pre-detected mutation genotypes (heterogeneity and homogeneity), AITAC just requires read depths (RDs) observed at the regions with copy number losses. AITAC creates a non-linear model to correlate tumor purity, observed and expected RDs. It adopts an exhaustive search strategy to scan tumor purity in a wide range, and chooses the tumor purity that minimizes the deviation between observed RDs and expected ones as the optimal solution. We apply the proposed approach to both simulation and real sequencing data sets and demonstrate its performance by comparing with two classical approaches. AITAC is freely available at https://github.com/BDanalysis/aitac and can be expected to become a useful approach for researchers to analyze copy numbers in cancer genome.
Collapse
Affiliation(s)
- Xiguo Yuan
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Zhe Li
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Haiyong Zhao
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Jun Bai
- Department of Medical Oncology, Shaanxi Provincial People's Hospital, Xi'an, China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
18
|
SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples. Genes Genomics 2019; 41:529-536. [PMID: 30779024 DOI: 10.1007/s13258-019-00788-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 01/21/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND Copy number variation (CNV) is an important form of genomic structural variation and is linked to dozens of human diseases. Using next-generation sequencing (NGS) data and developing computational methods to characterize such structural variants is significant for understanding the mechanisms of diseases. OBJECTIVE The objective of this study is to develop a new statistical method of detection recurrent CNVs across multiple samples from genomic sequences. METHODS A statistical method is carried out to detect recurrent CNVs, referred to as SM-RCNV. This method uses a statistic associated with each location by combining the frequency of variation at one location across whole samples and the correlation among consecutive locations. The weights of the frequency and correlation are trained using real datasets with known CNVs. P-value is assessed for each location on the genome by permutation testing. RESULTS Compared with six peer methods, SM-RCNV outperforms the peer methods under receiver operating characteristic curves. SM-RCNV successfully identifies many consistent recurrent CNVs, most of which are known to be of biological significance and associated with diseased genes. The validation rate of SM-RCNV in the CEU call set and YRI call set with Database of Genomic Variants are 258/328 (79%) and (157/309) 51%, respectively. CONCLUSION SM-RCNV is a well-grounded statistical framework for detecting recurrent CNVs from multiple genomic sequences, providing valuable information to study genomes in human diseases. The source code is freely available at https://sourceforge.net/projects/sm-rcnv/ .
Collapse
|