1
|
Sona P, Hong JH, Lee S, Kim BJ, Hong WY, Jung J, Kim HN, Kim HL, Christopher D, Herviou L, Im YH, Lee KY, Kim TS, Jung J. Integrated genome sizing (IGS) approach for the parallelization of whole genome analysis. BMC Bioinformatics 2018; 19:462. [PMID: 30509173 PMCID: PMC6276166 DOI: 10.1186/s12859-018-2499-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 11/16/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The use of whole genome sequence has increased recently with rapid progression of next-generation sequencing (NGS) technologies. However, storing raw sequence reads to perform large-scale genome analysis pose hardware challenges. Despite advancement in genome analytic platforms, efficient approaches remain relevant especially as applied to the human genome. In this study, an Integrated Genome Sizing (IGS) approach is adopted to speed up multiple whole genome analysis in high-performance computing (HPC) environment. The approach splits a genome (GRCh37) into 630 chunks (fragments) wherein multiple chunks can simultaneously be parallelized for sequence analyses across cohorts. RESULTS IGS was integrated on Maha-Fs (HPC) system, to provide the parallelization required to analyze 2504 whole genomes. Using a single reference pilot genome, NA12878, we compared the NGS process time between Maha-Fs (NFS SATA hard disk drive) and SGI-UV300 (solid state drive memory). It was observed that SGI-UV300 was faster, having 32.5 mins of process time, while that of the Maha-Fs was 55.2 mins. CONCLUSIONS The implementation of IGS can leverage the ability of HPC systems to analyze multiple genomes simultaneously. We believe this approach will accelerate research advancement in personalized genomic medicine. Our method is comparable to the fastest methods for sequence alignment.
Collapse
Affiliation(s)
- Peter Sona
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Jong Hui Hong
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Sunho Lee
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Byong Joon Kim
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Woon-Young Hong
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Jongcheol Jung
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025
| | - Han-Na Kim
- PGM21 (Personalized Genomic Medicine 21), Ewha Womans University Medical Center, 1071, Anyang Cheon-ro, Yangcheon-gu, Seoul, 158-710, Korea
| | - Hyung-Lae Kim
- PGM21 (Personalized Genomic Medicine 21), Ewha Womans University Medical Center, 1071, Anyang Cheon-ro, Yangcheon-gu, Seoul, 158-710, Korea
| | - David Christopher
- Bioinformatics Solutions, 900 N McCarthy Blvd., Milpitas, CA, 95035, USA
| | - Laurent Herviou
- Bioinformatics Solutions, 900 N McCarthy Blvd., Milpitas, CA, 95035, USA
| | - Young Hwan Im
- Bioinformatics Solutions, 900 N McCarthy Blvd., Milpitas, CA, 95035, USA
| | - Kwee-Yum Lee
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025.,Faculty of Medicine, University of Queensland, QLD, Brisbane, 4072, Australia
| | - Tae Soon Kim
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025.,Department of Clinical Medical Sciences, Seoul National University College of Medicine, 71 Ihwajang-gil, Jongno-gu, Seoul, 03087, South Korea
| | - Jongsun Jung
- Genome Data Integration Center, Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon, Republic of Korea, 34025.
| |
Collapse
|
2
|
Javed R. Current research status, databases and application of single nucleotide polymorphism. Pak J Biol Sci 2010; 13:657-663. [PMID: 21717869 DOI: 10.3923/pjbs.2010.657.663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Single Nucleotide Polymorphisms (SNPs) are the most frequent form of DNA variation in the genome. SNPs are genetic markers which are bi-allelic in nature and grow at a very fast rate. Current genomic databases contain information on several million SNPs. More than 6 million SNPs have been identified and the information is publicly available through the efforts of the SNP Consortium and others data bases. The NCBI plays a major role in facillating the identification and cataloging of SNPs through creation and maintenance of the public SNP database (dbSNP) by the biomedical community worldwide and stimulate many areas of biological research including the identification of the genetic components of disease. In this review article, we are compiling the existing SNP databases, research status and their application.
Collapse
Affiliation(s)
- R Javed
- DNA Sequencing Lab, National Bureau of Animal Genetic Resources, Karnal-132001, Haryana, India
| |
Collapse
|
3
|
Abstract
It is known that cancers are caused by accumulated mutations in various genes and consequent functional alterations of proteins that are important for maintenance of normal cellular functions. The changes in nucleotide sequences and expression patterns of cancer-related genes are being extensively studied to better understand the mechanisms of tumorigenesis and to develop methods for DNA/protein [corrected] diagnosis and drug discovery. At present, a number of computer databases for molecular information on cancer-related genes are available publicly through the internet. These databases deal with familial cancer and sporadic cancer at the levels of germline mutation or somatic mutation, genomic or chromosomal abnormalities, and changes in the expression levels of relevant genes. Previously, we constructed a human gene mutation database named MutationView (http://mutview.dmb.med.keio.ac.jp/) and have accumulated mutation data for approximately 300 genes that are involved mainly in monogenic diseases. Forty-two genes are cancer-related and therefore a separate cancer database named KMcancerDB was constructed. MutationView/KMcancerDB utilizes a graphic display function for both queries and search results much more often than other existing databases, making the system quite user friendly. MutationView/KMcancerDB provides a highly sophisticated search function for all genes through a single internet URL. In the present paper, we briefly review various useful databases for cancer-related genes, and describe MutationView/KMcancerDB in more detail.
Collapse
Affiliation(s)
- Nobuyoshi Shimizu
- Department of Molecular Biology, Keio University School of Medicine, 35 Shinanomachi, Shinjuku-ku, Tokyo, Japan.
| | | | | |
Collapse
|
5
|
Verzilli CJ, Stallard N, Whittaker JC. Bayesian modelling of multivariate quantitative traits using seemingly unrelated regressions. Genet Epidemiol 2005; 28:313-25. [PMID: 15789447 DOI: 10.1002/gepi.20072] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
We investigate a Bayesian approach to modelling the statistical association between markers at multiple loci and multivariate quantitative traits. In particular, we describe the use of Bayesian Seemingly Unrelated Regressions (SUR) whereby genotypes at the different loci are allowed to have non-simultaneous effects on the phenotypes considered with residuals from each regression assumed correlated. We present results from simulations showing that, under rather general conditions that are likely to hold in real situations, the Bayesian SUR approach has increased probability of selecting the true model compared to univariate analyses. Finally, we apply our methods to data from subjects genotyped for 12 SNPs in the apolipoprotein E (APOE) gene. Phenotypes relate to response to treatment with atorvastatin and include changes in total cholesterol, low-density lipoprotein cholesterol, and triglycerides. Missing genotype data are naturally accommodated in our Bayesian framework by imputing them using a nested haplotype phasing algorithm.
Collapse
Affiliation(s)
- Claudio J Verzilli
- Department of Epidemiology and Public Health, Imperial College London, London, United Kingdom.
| | | | | |
Collapse
|
6
|
Dvornyk V, Long JR, Xiong DH, Liu PY, Zhao LJ, Shen H, Zhang YY, Liu YJ, Rocha-Sanchez S, Xiao P, Recker RR, Deng HW. Current limitations of SNP data from the public domain for studies of complex disorders: a test for ten candidate genes for obesity and osteoporosis. BMC Genet 2004; 5:4. [PMID: 15113403 PMCID: PMC395827 DOI: 10.1186/1471-2156-5-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2003] [Accepted: 02/25/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Public SNP databases are frequently used to choose SNPs for candidate genes in the association and linkage studies of complex disorders. However, their utility for such studies of diseases with ethnic-dependent background has never been evaluated. RESULTS To estimate the accuracy and completeness of SNP public databases, we analyzed the allele frequencies of 41 SNPs in 10 candidate genes for obesity and/or osteoporosis in a large American-Caucasian sample (1,873 individuals from 405 nuclear families) by PCR-invader assay. We compared our results with those from the databases and other published studies. Of the 41 SNPs, 8 were monomorphic in our sample. Twelve were reported for the first time for Caucasians and the other 29 SNPs in our sample essentially confirmed the respective allele frequencies for Caucasians in the databases and previous studies. The comparison of our data with other ethnic groups showed significant differentiation between the three major world ethnic groups at some SNPs (Caucasians and Africans differed at 3 of the 18 shared SNPs, and Caucasians and Asians differed at 13 of the 22 shared SNPs). This genetic differentiation may have an important implication for studying the well-known ethnic differences in the prevalence of obesity and osteoporosis, and complex disorders in general. CONCLUSION A comparative analysis of the SNP data of the candidate genes obtained in the present study, as well as those retrieved from the public domain, suggests that the databases may currently have serious limitations for studying complex disorders with an ethnic-dependent background due to the incomplete and uneven representation of the candidate SNPs in the databases for the major ethnic groups. This conclusion attests to the imperative necessity of large-scale and accurate characterization of these SNPs in different ethnic groups.
Collapse
Affiliation(s)
- Volodymyr Dvornyk
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Ji-Rong Long
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Dong-Hai Xiong
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Peng-Yuan Liu
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Lan-Juan Zhao
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Hui Shen
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Yuan-Yuan Zhang
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Yong-Jun Liu
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Sonia Rocha-Sanchez
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Peng Xiao
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Robert R Recker
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
| | - Hong-Wen Deng
- Osteoporosis Research Center and Department of Biomedical Sciences, Creighton University, 601 N. 30St., Suite 6730, Omaha, NE 68131, USA
- Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan 410081, P. R. China
| |
Collapse
|
7
|
Marsh S, Kwok P, McLeod HL. SNP databases and pharmacogenetics: great start, but a long way to go. Hum Mutat 2002; 20:174-9. [PMID: 12203989 DOI: 10.1002/humu.10115] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
With the recent publication of the human genome project there has been an explosion of data available for pharmacogenetic research. Web-based databases containing information on single nucleotide polymorphisms (SNPs) are readily accessible to researchers, but there has been little comment on their utility. We used seven major international databases to identify SNPs in 74 genes involved in drug pathways. Very little overlap was seen among the databases, with only eight out of a putative 893 SNPs ( approximately 1%) common to the most commonly used databases. Problems with false positives, secondary to a high degree of homology in gene families, were also observed. These studies suggest researchers limiting their studies to one database would miss a great deal of information. Effort to update compilation databases, such as HGVbase, GeneSNP, PharmGKB, and HOWDY, and the aggressive removal of false positives from all databases is required if these resources are to facilitate the intended growth in pharmacogenetics research.
Collapse
Affiliation(s)
- Sharon Marsh
- Department of Medicine, Washington University School of Medicine, St. Louis, Missouri, USA
| | | | | |
Collapse
|