1
|
Yu X, Luo X, Cai G, Xiao F. OSCAA: A two-dimensional Gaussian mixture model for copy number variation association analysis. Genet Epidemiol 2024. [PMID: 38533840 DOI: 10.1002/gepi.22558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 01/30/2024] [Accepted: 03/05/2024] [Indexed: 03/28/2024]
Abstract
Copy number variants (CNVs) are prevalent in the human genome and are found to have a profound effect on genomic organization and human diseases. Discovering disease-associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two-stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome-wide assessment of such variation. In this article, we developed One-Stage CNV-disease Association Analysis (OSCAA), a flexible algorithm to discover disease-associated CNVs for both quantitative and qualitative traits. OSCAA employs a two-dimensional Gaussian mixture model that is built upon the PCs from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one-stage method and traditional two-stage methods by yielding a more accurate estimate of the CNV-disease association, especially for short CNVs or CNVs with weak signals. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.
Collapse
Affiliation(s)
- Xuanxuan Yu
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA
| | - Xizhi Luo
- Data and Statistical Sciences, AbbVie Inc., North Chicago, Illinois, USA
| | - Guoshuai Cai
- Department of Surgery, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Feifei Xiao
- Department of Biostatistics, College of Public Health and Health Promotion & College of Medicine, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
2
|
Qin F, Cai G, Amos CI, Xiao F. A statistical learning method for simultaneous copy number estimation and subclone clustering with single-cell sequencing data. Genome Res 2024; 34:85-93. [PMID: 38290978 PMCID: PMC10903939 DOI: 10.1101/gr.278098.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 01/08/2024] [Indexed: 02/01/2024]
Abstract
The availability of single-cell sequencing (SCS) enables us to assess intra-tumor heterogeneity and identify cellular subclones without the confounding effect of mixed cells. Copy number aberrations (CNAs) have been commonly used to identify subclones in SCS data using various clustering methods, as cells comprising a subpopulation are found to share a genetic profile. However, currently available methods may generate spurious results (e.g., falsely identified variants) in the procedure of CNA detection, thereby diminishing the accuracy of subclone identification within a large, complex cell population. In this study, we developed a subclone clustering method based on a fused lasso model, referred to as FLCNA, which can simultaneously detect CNAs in single-cell DNA sequencing (scDNA-seq) data. Spike-in simulations were conducted to evaluate the clustering and CNA detection performance of FLCNA, benchmarking it against existing copy number estimation methods (SCOPE, HMMcopy) in combination with commonly used clustering methods. Application of FLCNA to a scDNA-seq data set of breast cancer revealed different genomic variation patterns in neoadjuvant chemotherapy-treated samples and pretreated samples. We show that FLCNA is a practical and powerful method for subclone identification and CNA detection with scDNA-seq data.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina 29208, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Feifei Xiao
- Department of Biostatistics, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, Florida 32603, USA
| |
Collapse
|
3
|
Khowal S, Zhang D, Yong WH, Heaney AP. Whole-exome sequencing reveals genetic variants that may play a role in neurocytomas. J Neurooncol 2024; 166:471-483. [PMID: 38319496 DOI: 10.1007/s11060-024-04567-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 01/09/2024] [Indexed: 02/07/2024]
Abstract
OBJECTIVES Neurocytomas (NCs) are rare intracranial tumors that can often be surgically resected. However, disease course is unpredictable in many patients and medical therapies are lacking. We have used whole exome sequencing to explore the molecular etiology for neurocytoma and assist in target identification to develop novel therapeutic interventions. METHODS We used whole exome sequencing (WES) to compare the molecular landscape of 21 primary & recurrent NCs to five normal cerebellar control samples. WES data was analyzed using the Qiagen Clinical Insight program, variants of interest (VOI) were interrogated using ConSurf, ScoreCons, & Ingenuity Pathway Analysis Software to predict their potential functional effects, and Copy number variations (CNVs) in the genes of interest were analyzed by Genewiz (Azenta Life Sciences). RESULTS Of 40 VOI involving thirty-six genes, 7 were pathogenic, 17 likely-pathogenic, and 16 of uncertain-significance. Of seven pathogenic NC associated variants, Glucosylceramidase beta 1 [GBA1 c.703T > C (p.S235P)] was mutated in 5/21 (24%), Coagulation factor VIII [F8 c.3637dupA (p.I1213fs*28)] in 4/21 (19%), Phenylalanine hydroxylase [PAH c.975C > A (p.Y325*)] in 3/21 (14%), and Fanconi anemia complementation group C [FANCC c.1162G > T (p.G388*)], Chromodomain helicase DNA binding protein 7 [CHD7 c.2839C > T (p.R947*)], Myosin VIIA [MYO7A c.940G > T (p.E314*)] and Dynein axonemal heavy chain 11 [DNAH11 c.3544C > T (p.R1182*)] in 2/21 (9.5%) NCs respectively. CNVs were noted in 85% of these latter 7 genes. Interestingly, a Carboxy-terminal domain RNA polymerase II polypeptide A small phosphatase 2 [CTDSP2 c.472G > A (p.E158K)] of uncertain significance was also found in > 70% of NC cases. INTERPRETATION The variants of interest we identified in the NCs regulate a variety of neurological processes including cilia motility, cell metabolism, immune responses, and DNA damage repair and provide novel insights into the molecular pathogenesis of these extremely rare tumors.
Collapse
Affiliation(s)
- Sapna Khowal
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - Dongyun Zhang
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA
| | - William H Yong
- Department of Pathology and Laboratory Medicine, University of California, Irvine, CA, 92868, USA
| | - Anthony P Heaney
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.
- Department of Neurosurgery, David Geffen School of Medicine, University of California, Los Angeles, CA, 90095, USA.
| |
Collapse
|
4
|
Yu X, Luo X, Cai G, Xiao F. OSCAA: A Two-Dimensional Gaussian Mixture Model for Copy Number Variation Association Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.25.559392. [PMID: 37808739 PMCID: PMC10557568 DOI: 10.1101/2023.09.25.559392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Copy number variants (CNVs) are prevalent in the human genome which provide profound effect on genomic organization and human diseases. Discovering disease associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two-stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome wide assessment of such variation. In this article, we developed OSCAA, a flexible algorithm to discover disease associated CNVs for both quantitative and qualitative traits. OSCAA employs a two-dimensional Gaussian mixture model that is built upon the principal components from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one-stage method and traditional two-stage methods by yielding a more accurate estimate of the CNV-disease association, especially for short CNVs or CNVs with weak signal. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.
Collapse
|
5
|
Qin F, Cai G, Xiao F. A statistical learning method for simultaneous copy number estimation and subclone clustering with single cell sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.18.537346. [PMID: 37131674 PMCID: PMC10153109 DOI: 10.1101/2023.04.18.537346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The availability of single cell sequencing (SCS) enables us to assess intra-tumor heterogeneity and identify cellular subclones without the confounding effect of mixed cells. Copy number aberrations (CNAs) have been commonly used to identify subclones in SCS data using various clustering methods, since cells comprising a subpopulation are found to share genetic profile. However, currently available methods may generate spurious results (e.g., falsely identified CNAs) in the procedure of CNA detection, hence diminishing the accuracy of subclone identification from a large complex cell population. In this study, we developed a CNA detection method based on a fused lasso model, referred to as FLCNA, which can simultaneously identify subclones in single cell DNA sequencing (scDNA-seq) data. Spike-in simulations were conducted to evaluate the clustering and CNA detection performance of FLCNA benchmarking to existing copy number estimation methods (SCOPE, HMMcopy) in combination with the existing and commonly used clustering methods. Interestingly, application of FLCNA to a real scDNA-seq dataset of breast cancer revealed remarkably different genomic variation patterns in neoadjuvant chemotherapy treated samples and pre-treated samples. We show that FLCNA is a practical and powerful method in subclone identification and CNA detection with scDNA-seq data.
Collapse
|
6
|
Luo X, Cai G, Mclain AC, Amos CI, Cai B, Xiao F. BMI-CNV: a Bayesian framework for multiple genotyping platforms detection of copy number variants. Genetics 2022; 222:iyac147. [PMID: 36171678 PMCID: PMC9713397 DOI: 10.1093/genetics/iyac147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 09/08/2022] [Indexed: 12/13/2022] Open
Abstract
Whole-exome sequencing (WES) enables the detection of copy number variants (CNVs) with high resolution in protein-coding regions. However, variants in the intergenic or intragenic regions are excluded from studies. Fortunately, many of these samples have been previously sequenced by other genotyping platforms which are sparse but cover a wide range of genomic regions, such as SNP array. Moreover, conventional single sample-based methods suffer from a high false discovery rate due to prominent data noise. Therefore, methods for integrating multiple genotyping platforms and multiple samples are highly demanded for improved copy number variant detection. We developed BMI-CNV, a Bayesian Multisample and Integrative CNV (BMI-CNV) profiling method with data sequenced by both whole-exome sequencing and microarray. For the multisample integration, we identify the shared copy number variants regions across samples using a Bayesian probit stick-breaking process model coupled with a Gaussian Mixture model estimation. With extensive simulations, BMI-copy number variant outperformed existing methods with improved accuracy. In the matched data from the 1000 Genomes Project and HapMap project data, BMI-CNV also accurately detected common variants and significantly enlarged the detection spectrum of whole-exome sequencing. Further application to the data from The Research of International Cancer of Lung consortium (TRICL) identified lung cancer risk variant candidates in 17q11.2, 1p36.12, 8q23.1, and 5q22.2 regions.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Alexander C Mclain
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX 77030, USA
| | - Bo Cai
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| |
Collapse
|
7
|
Li F, Yin J, Lu M, Mou M, Li Z, Zeng Z, Tan Y, Wang S, Chu X, Dai H, Hou T, Zeng S, Chen Y, Zhu F. DrugMAP: molecular atlas and pharma-information of all drugs. Nucleic Acids Res 2022; 51:D1288-D1299. [PMID: 36243961 PMCID: PMC9825453 DOI: 10.1093/nar/gkac813] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 08/30/2022] [Accepted: 10/12/2022] [Indexed: 02/06/2023] Open
Abstract
The efficacy and safety of drugs are widely known to be determined by their interactions with multiple molecules of pharmacological importance, and it is therefore essential to systematically depict the molecular atlas and pharma-information of studied drugs. However, our understanding of such information is neither comprehensive nor precise, which necessitates the construction of a new database providing a network containing a large number of drugs and their interacting molecules. Here, a new database describing the molecular atlas and pharma-information of drugs (DrugMAP) was therefore constructed. It provides a comprehensive list of interacting molecules for >30 000 drugs/drug candidates, gives the differential expression patterns for >5000 interacting molecules among different disease sites, ADME (absorption, distribution, metabolism and excretion)-relevant organs and physiological tissues, and weaves a comprehensive and precise network containing >200 000 interactions among drugs and molecules. With the great efforts made to clarify the complex mechanism underlying drug pharmacokinetics and pharmacodynamics and rapidly emerging interests in artificial intelligence (AI)-based network analyses, DrugMAP is expected to become an indispensable supplement to existing databases to facilitate drug discovery. It is now fully and freely accessible at: https://idrblab.org/drugmap/.
Collapse
Affiliation(s)
| | | | - Mingkun Lu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba–Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba–Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Ying Tan
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Biology, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China
| | - Shanshan Wang
- Qian Xuesen Collaborative Research Center of Astrochemistry and Space Life Sciences, Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Xinyi Chu
- Qian Xuesen Collaborative Research Center of Astrochemistry and Space Life Sciences, Institute of Drug Discovery Technology, Ningbo University, Ningbo 315211, China
| | - Haibin Dai
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Su Zeng
- Correspondence may also be addressed to Su Zeng.
| | - Yuzong Chen
- Correspondence may also be addressed to Yuzong Chen.
| | - Feng Zhu
- To whom correspondence should be addressed.
| |
Collapse
|
8
|
Jewell S, Fearnhead P, Witten D. Testing for a Change in Mean After Changepoint Detection. J R Stat Soc Series B Stat Methodol 2022; 84:1082-1104. [PMID: 36419504 PMCID: PMC9678373 DOI: 10.1111/rssb.12501] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
Abstract
While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post-detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it is possible to efficiently carry out this framework in the case of changepoints estimated by binary segmentation and its variants, ℓ 0 segmentation, or the fused lasso. Our setup allows us to condition on much less information than existing approaches, which yields higher powered tests. We apply our proposals in a simulation study and on a dataset of chromosomal guanine-cytosine content. These approaches are freely available in the R package ChangepointInference at https://jewellsean.github.io/changepoint-inference/.
Collapse
Affiliation(s)
- Sean Jewell
- Department of Statistics, University of Washington, Seattle, USA
| | - Paul Fearnhead
- Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington, Seattle, USA
| |
Collapse
|
9
|
Huang T, Li J, Jia B, Sang H. CNV-MEANN: A Neural Network and Mind Evolutionary Algorithm-Based Detection of Copy Number Variations From Next-Generation Sequencing Data. Front Genet 2021; 12:700874. [PMID: 34484298 PMCID: PMC8415314 DOI: 10.3389/fgene.2021.700874] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 07/19/2021] [Indexed: 11/20/2022] Open
Abstract
Copy number variation (CNV), is defined as repetitions or deletions of genomic segments of 1 Kb to 5 Mb, and is a major trigger for human disease. The high-throughput and low-cost characteristics of next-generation sequencing technology provide the possibility of the detection of CNVs in the whole genome, and also greatly improve the clinical practicability of next-generation sequencing (NGS) testing. However, current methods for the detection of CNVs are easily affected by sequencing and mapping errors, and uneven distribution of reads. In this paper, we propose an improved approach, CNV-MEANN, for the detection of CNVs, involving changing the structure of the neural network used in the MFCNV method. This method has three differences relative to the MFCNV method: (1) it utilizes a new feature, mapping quality, to replace two features in MFCNV, (2) it considers the influence of the loss categories of CNV on disease prediction, and refines the output structure, and (3) it uses a mind evolutionary algorithm to optimize the backpropagation (neural network) neural network model, and calculates individual scores for each genome bin to predict CNVs. Using both simulated and real datasets, we tested the performance of CNV-MEANN and compared its performance with those of seven widely used CNV detection methods. Experimental results demonstrated that the CNV-MEANN approach outperformed other methods with respect to sensitivity, precision, and F1-score. The proposed method was able to detect many CNVs that other approaches could not, and it reduced the boundary bias. CNV-MEANN is expected to be an effective method for the analysis of changes in CNVs in the genome.
Collapse
Affiliation(s)
- Tihao Huang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Junqing Li
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Baoxian Jia
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| | - Hongyan Sang
- School of Computer Science and Technology, Liaocheng University, Liaocheng, China
| |
Collapse
|
10
|
Qin F, Luo X, Cai G, Xiao F. Shall genomic correlation structure be considered in copy number variants detection? Brief Bioinform 2021; 22:6295811. [PMID: 34114005 DOI: 10.1093/bib/bbab215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 04/16/2021] [Accepted: 05/17/2021] [Indexed: 11/14/2022] Open
Abstract
Copy number variation has been identified as a major source of genomic variation associated with disease susceptibility. With the advent of whole-exome sequencing (WES) technology, massive WES data have been generated, allowing for the identification of copy number variants (CNVs) in the protein-coding regions with direct functional interpretation. We have previously shown evidence of the genomic correlation structure in array data and developed a novel chromosomal breakpoint detection algorithm, LDcnv, which showed significantly improved detection power through integrating the correlation structure in a systematic modeling manner. However, it remains unexplored whether the genomic correlation exists in WES data and how such correlation structure integration can improve the CNV detection accuracy. In this study, we first explored the correlation structure of the WES data using the 1000 Genomes Project data. Both real raw read depth and median-normalized data showed strong evidence of the correlation structure. Motivated by this fact, we proposed a correlation-based method, CORRseq, as a novel release of the LDcnv algorithm in profiling WES data. The performance of CORRseq was evaluated in extensive simulation studies and real data analysis from the 1000 Genomes Project. CORRseq outperformed the existing methods in detecting medium and large CNVs. In conclusion, it would be more advantageous to model genomic correlation structure in detecting relatively long CNVs. This study provides great insights for methodology development of CNV detection with NGS data.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina (USC), Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| |
Collapse
|
11
|
Luo X, Qin F, Cai G, Xiao F. Integrating genomic correlation structure improves copy number variations detection. Bioinformatics 2021; 37:312-317. [PMID: 32805016 DOI: 10.1093/bioinformatics/btaa737] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 07/23/2020] [Accepted: 08/12/2020] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD). RESULTS We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms. AVAILABILITY AND IMPLEMENTATION https://github.com/FeifeiXiaoUSC/LDcnv. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| |
Collapse
|