1
|
Liu S, Zeng Y, Wang C, Zhang Q, Chen M, Wang X, Wang L, Lu Y, Guo H, Bu F. seGMM: A New Tool for Gender Determination From Massively Parallel Sequencing Data. Front Genet 2022; 13:850804. [PMID: 35309142 PMCID: PMC8930203 DOI: 10.3389/fgene.2022.850804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2022] [Accepted: 02/10/2022] [Indexed: 11/18/2022] Open
Abstract
In clinical genetic testing, checking the concordance between self-reported gender and genotype-inferred gender from genomic data is a significant quality control measure because mismatched gender due to sex chromosomal abnormalities or misregistration of clinical information can significantly affect molecular diagnosis and treatment decisions. Targeted gene sequencing (TGS) is widely recommended as a first-tier diagnostic step in clinical genetic testing. However, the existing gender-inference tools are optimized for whole genome and whole exome data and are not adequate and accurate for analyzing TGS data. In this study, we validated a new gender-inference tool, seGMM, which uses unsupervised clustering (Gaussian mixture model) to determine the gender of a sample. The seGMM tool can also identify sex chromosomal abnormalities in samples by aligning the sequencing reads from the genotype data. The seGMM tool consistently demonstrated >99% gender-inference accuracy in a publicly available 1,000-gene panel dataset from the 1,000 Genomes project, an in-house 785 hearing loss gene panel dataset of 16,387 samples, and a 187 autism risk gene panel dataset from the Autism Clinical and Genetic Resources in China (ACGC) database. The performance and accuracy of seGMM was significantly higher for the targeted gene sequencing (TGS), whole exome sequencing (WES), and whole genome sequencing (WGS) datasets compared to the other existing gender-inference tools such as PLINK, seXY, and XYalign. The results of seGMM were confirmed by the short tandem repeat analysis of the sex chromosome marker gene, amelogenin. Furthermore, our data showed that seGMM accurately identified sex chromosomal abnormalities in the samples. In conclusion, the seGMM tool shows great potential in clinical genetics by determining the sex chromosomal karyotypes of samples from massively parallel sequencing data with high accuracy.
Collapse
Affiliation(s)
- Sihan Liu
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Yuanyuan Zeng
- School of Medicine, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Chao Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Qian Zhang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Meilin Chen
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Xiaolu Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Lanchen Wang
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Yu Lu
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
- *Correspondence: Yu Lu, ; Hui Guo, ; Fengxiao Bu,
| | - Hui Guo
- Center for Medical Genetics and Hunan Provincial Key Laboratory of Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- *Correspondence: Yu Lu, ; Hui Guo, ; Fengxiao Bu,
| | - Fengxiao Bu
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
- *Correspondence: Yu Lu, ; Hui Guo, ; Fengxiao Bu,
| |
Collapse
|
2
|
Jung CH, Park DJ, Georgeson P, Mahmood K, Milne RL, Southey MC, Pope BJ. sEst: Accurate Sex-Estimation and Abnormality Detection in Methylation Microarray Data. Int J Mol Sci 2018; 19:ijms19103172. [PMID: 30326623 PMCID: PMC6213967 DOI: 10.3390/ijms19103172] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/08/2018] [Accepted: 10/09/2018] [Indexed: 01/21/2023] Open
Abstract
DNA methylation influences predisposition, development and prognosis for many diseases, including cancer. However, it is not uncommon to encounter samples with incorrect sex labelling or atypical sex chromosome arrangement. Sex is one of the strongest influencers of the genomic distribution of DNA methylation and, therefore, correct assignment of sex and filtering of abnormal samples are essential for the quality control of study data. Differences in sex chromosome copy numbers between sexes and X-chromosome inactivation in females result in distinctive sex-specific patterns in the distribution of DNA methylation levels. In this study, we present a software tool, sEst, which incorporates clustering analysis to infer sex and to detect sex-chromosome abnormalities from DNA methylation microarray data. Testing with two publicly available datasets demonstrated that sEst not only correctly inferred the sex of the test samples, but also identified mislabelled samples and samples with potential sex-chromosome abnormalities, such as Klinefelter syndrome and Turner syndrome, the latter being a feature not offered by existing methods. Considering that sex and the sex-chromosome abnormalities can have large effects on many phenotypes, including diseases, our method can make a significant contribution to DNA methylation studies that are based on microarray platforms.
Collapse
Affiliation(s)
- Chol-Hee Jung
- Melbourne Bioinformatics, The University of Melbourne, Parkville, VIC 3010, Australia.
| | - Daniel J Park
- Melbourne Bioinformatics, The University of Melbourne, Parkville, VIC 3010, Australia.
| | - Peter Georgeson
- Melbourne Bioinformatics, The University of Melbourne, Parkville, VIC 3010, Australia.
- Department of Clinical Pathology, The University of Melbourne, Parkville, VIC 3010, Australia.
| | - Khalid Mahmood
- Melbourne Bioinformatics, The University of Melbourne, Parkville, VIC 3010, Australia.
| | - Roger L Milne
- Cancer Epidemiology & Intelligence Division, Cancer Council Victoria, Melbourne, VIC 3004, Australia.
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, 3010, Australia.
- Precision Medicine, School of Clinical Sciences at Monash Health, Monash University, Clayton, VIC 3168, Australia.
| | - Melissa C Southey
- Cancer Epidemiology & Intelligence Division, Cancer Council Victoria, Melbourne, VIC 3004, Australia.
- Precision Medicine, School of Clinical Sciences at Monash Health, Monash University, Clayton, VIC 3168, Australia.
- Genetic Epidemiology Laboratory, The University of Melbourne, Parkville, VIC 3010, Australia.
| | - Bernard J Pope
- Melbourne Bioinformatics, The University of Melbourne, Parkville, VIC 3010, Australia.
- Department of Clinical Pathology, The University of Melbourne, Parkville, VIC 3010, Australia.
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, 3010, Australia.
| |
Collapse
|
3
|
Jeong S, Kim J, Park W, Jeon H, Kim N. SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing. PLoS One 2017; 12:e0184087. [PMID: 28886064 PMCID: PMC5590872 DOI: 10.1371/journal.pone.0184087] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 08/17/2017] [Indexed: 12/01/2022] Open
Abstract
Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.
Collapse
Affiliation(s)
- Seongmun Jeong
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Jiwoong Kim
- Quantitative Biomedical Research Center, Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States of America
| | - Won Park
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Hongmin Jeon
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
| | - Namshin Kim
- Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon, Korea
- * E-mail:
| |
Collapse
|