1
|
Magdy Mohamed Abdelaziz Barakat S, Sallehuddin R, Yuhaniz SS, R. Khairuddin RF, Mahmood Y. Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges. PeerJ Comput Sci 2023; 9:e1180. [PMID: 37547391 PMCID: PMC10403225 DOI: 10.7717/peerj-cs.1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 04/27/2023] [Indexed: 08/08/2023]
Abstract
Background The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. Method The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. Results Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. Conclusion We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
Collapse
Affiliation(s)
| | - Roselina Sallehuddin
- Computer Science, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
| | - Siti Sophiayati Yuhaniz
- Advanced Informatics Department, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Kuala Lumpur, Malaysia
| | | | - Yasir Mahmood
- Faculty of Information Technology, The University of Lahore, Lahore, Lahore, Pakistan
| |
Collapse
|
2
|
Tangaro MA, Mandreoli P, Chiara M, Donvito G, Antonacci M, Parisi A, Bianco A, Romano A, Bianchi DM, Cangelosi D, Uva P, Molineris I, Nosi V, Calogero RA, Alessandri L, Pedrini E, Mordenti M, Bonetti E, Sangiorgi L, Pesole G, Zambelli F. Laniakea@ReCaS: exploring the potential of customisable Galaxy on-demand instances as a cloud-based service. BMC Bioinformatics 2021; 22:544. [PMID: 34749633 PMCID: PMC8574934 DOI: 10.1186/s12859-021-04401-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 09/24/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Improving the availability and usability of data and analytical tools is a critical precondition for further advancing modern biological and biomedical research. For instance, one of the many ramifications of the COVID-19 global pandemic has been to make even more evident the importance of having bioinformatics tools and data readily actionable by researchers through convenient access points and supported by adequate IT infrastructures. One of the most successful efforts in improving the availability and usability of bioinformatics tools and data is represented by the Galaxy workflow manager and its thriving community. In 2020 we introduced Laniakea, a software platform conceived to streamline the configuration and deployment of "on-demand" Galaxy instances over the cloud. By facilitating the set-up and configuration of Galaxy web servers, Laniakea provides researchers with a powerful and highly customisable platform for executing complex bioinformatics analyses. The system can be accessed through a dedicated and user-friendly web interface that allows the Galaxy web server's initial configuration and deployment. RESULTS "Laniakea@ReCaS", the first instance of a Laniakea-based service, is managed by ELIXIR-IT and was officially launched in February 2020, after about one year of development and testing that involved several users. Researchers can request access to Laniakea@ReCaS through an open-ended call for use-cases. Ten project proposals have been accepted since then, totalling 18 Galaxy on-demand virtual servers that employ ~ 100 CPUs, ~ 250 GB of RAM and ~ 5 TB of storage and serve several different communities and purposes. Herein, we present eight use cases demonstrating the versatility of the platform. CONCLUSIONS During this first year of activity, the Laniakea-based service emerged as a flexible platform that facilitated the rapid development of bioinformatics tools, the efficient delivery of training activities, and the provision of public bioinformatics services in different settings, including food safety and clinical research. Laniakea@ReCaS provides a proof of concept of how enabling access to appropriate, reliable IT resources and ready-to-use bioinformatics tools can considerably streamline researchers' work.
Collapse
Affiliation(s)
- Marco Antonio Tangaro
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Pietro Mandreoli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
| | - Matteo Chiara
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy
| | - Giacinto Donvito
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Marica Antonacci
- National Institute for Nuclear Physics (INFN), Section of Bari, Via Orabona 4, 70126, Bari, Italy
| | - Antonio Parisi
- Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
| | - Angelica Bianco
- Istituto Zooprofilattico Sperimentale Della Puglia e Della Basilicata, Via Manfredonia 20, 71121, Foggia, Italy
| | - Angelo Romano
- National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
| | - Daniela Manila Bianchi
- National Reference Laboratory for Coagulase-Positive Staphylococci Including Staphylococcus Aureus, Istituto Zooprofilattico Sperimentale del Piemonte, Liguria e Valle d'Aosta, Via Bologna 148, 10154, Turin, Italy
| | - Davide Cangelosi
- Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy
| | - Paolo Uva
- Clinical Bioinformatics Unit, Scientific Direction, IRCCS Istituto Giannina Gaslini, Via Gerolamo Gaslini 5, 16147, Genova, Italy
- Italian Institute of Technology, Via Morego 30, 16163, Genova, Italy
| | - Ivan Molineris
- Department of Life Science and System Biology, University of Turin, Via Accademia Albertina, 13-1023, Turin, Italy
| | - Vladimir Nosi
- Department of Computer Science, University of Turin, Via Pessinetto 12, 10049, Turin, Italy
| | - Raffaele A Calogero
- Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
| | - Luca Alessandri
- Department of Molecular Biotechnology and Health Sciences, Via Nizza 52, 10126, Turin, Italy
| | - Elena Pedrini
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Marina Mordenti
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Emanuele Bonetti
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
- Department of Experimental Oncology, European Institute of Oncology, Via Adamello 16, 20139, Milan, Italy
| | - Luca Sangiorgi
- Department of Rare Skeletal Disorders, IRCCS Istituto Ortopedico Rizzoli, Via di Barbiano 1/10, 40136, Bologna, Italy
| | - Graziano Pesole
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy.
- Department of Biosciences, Biotechnologies and Biopharmaceutics, University of Bari, Via Orabona 4, 70126, Bari, Italy.
| | - Federico Zambelli
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR), Via Giovanni Amendola 122/O, 70126, Bari, Italy.
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milano, Italy.
| |
Collapse
|
3
|
Ferrara M, Gallo A, Perrone G, Magistà D, Baker SE. Comparative Genomic Analysis of Ochratoxin A Biosynthetic Cluster in Producing Fungi: New Evidence of a Cyclase Gene Involvement. Front Microbiol 2020; 11:581309. [PMID: 33391201 PMCID: PMC7775548 DOI: 10.3389/fmicb.2020.581309] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 11/30/2020] [Indexed: 12/13/2022] Open
Abstract
The widespread use of Next-Generation Sequencing has opened a new era in the study of biological systems by significantly increasing the catalog of fungal genomes sequences and identifying gene clusters for known secondary metabolites as well as novel cryptic ones. However, most of these clusters still need to be examined in detail to completely understand the pathway steps and the regulation of the biosynthesis of metabolites. Genome sequencing approach led to the identification of the biosynthetic genes cluster of ochratoxin A (OTA) in a number of producing fungal species. Ochratoxin A is a potent pentaketide nephrotoxin produced by Aspergillus and Penicillium species and found as widely contaminant in food, beverages and feed. The increasing availability of several new genome sequences of OTA producer species in JGI Mycocosm and/or GenBank databanks led us to analyze and update the gene cluster structure in 19 Aspergillus and 2 Penicillium OTA producing species, resulting in a well conserved organization of OTA core genes among the species. Furthermore, our comparative genome analyses evidenced the presence of an additional gene, previously undescribed, located between the polyketide and non-ribosomal synthase genes in the cluster of all the species analyzed. The presence of a SnoaL cyclase domain in the sequence of this gene supports its putative role in the polyketide cyclization reaction during the initial steps of the OTA biosynthesis pathway. The phylogenetic analysis showed a clustering of OTA SnoaL domains in accordance with the phylogeny of OTA producing species at species and section levels. The characterization of this new OTA gene, its putative role and its expression evidence in three important representative producing species, are reported here for the first time.
Collapse
Affiliation(s)
- Massimo Ferrara
- Institute of Sciences of Food Production (ISPA), National Research Council (CNR), Bari, Italy
| | - Antonia Gallo
- Institute of Sciences of Food Production (ISPA), National Research Council (CNR), Lecce, Italy
| | - Giancarlo Perrone
- Institute of Sciences of Food Production (ISPA), National Research Council (CNR), Bari, Italy
| | - Donato Magistà
- Institute of Sciences of Food Production (ISPA), National Research Council (CNR), Bari, Italy
| | - Scott E Baker
- Functional and Systems Biology Group, Environmental Molecular Sciences Division, Pacific Northwest National Laboratory, Richland, WA, United States.,DOE Joint Bioenergy Institute, Emeryville, CA, United States
| |
Collapse
|
4
|
Tiwari P, Colborn KL, Smith DE, Xing F, Ghosh D, Rosenberg MA. Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation. JAMA Netw Open 2020; 3:e1919396. [PMID: 31951272 PMCID: PMC6991266 DOI: 10.1001/jamanetworkopen.2019.19396] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
IMPORTANCE Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia, and its early detection could lead to significant improvements in outcomes through the appropriate prescription of anticoagulation medication. Although a variety of methods exist for screening for AF, a targeted approach, which requires an efficient method for identifying patients at risk, would be preferred. OBJECTIVE To examine machine learning approaches applied to electronic health record data that have been harmonized to the Observational Medical Outcomes Partnership Common Data Model for identifying risk of AF. DESIGN, SETTING, AND PARTICIPANTS This diagnostic study used data from 2 252 219 individuals cared for in the UCHealth hospital system, which comprises 3 large hospitals in Colorado, from January 1, 2011, to October 1, 2018. Initial analysis was performed in December 2018; follow-up analysis was performed in July 2019. EXPOSURES All Observational Medical Outcomes Partnership Common Data Model-harmonized electronic health record features, including diagnoses, procedures, medications, age, and sex. MAIN OUTCOMES AND MEASURES Classification of incident AF in designated 6-month intervals, adjudicated retrospectively, based on area under the receiver operating characteristic curve and F1 statistic. RESULTS Of 2 252 219 individuals (1 225 533 [54.4%] women; mean [SD] age, 42.9 [22.3] years), 28 036 (1.2%) developed incident AF during a designated 6-month interval. The machine learning model that used the 200 most common electronic health record features, including age and sex, and random oversampling with a single-layer, fully connected neural network provided the optimal prediction of 6-month incident AF, with an area under the receiver operating characteristic curve of 0.800 and an F1 score of 0.110. This model performed only slightly better than a more basic logistic regression model composed of known clinical risk factors for AF, which had an area under the receiver operating characteristic curve of 0.794 and an F1 score of 0.079. CONCLUSIONS AND RELEVANCE Machine learning approaches to electronic health record data offer a promising method for improving risk prediction for incident AF, but more work is needed to show improvement beyond standard risk factors.
Collapse
Affiliation(s)
- Premanand Tiwari
- Colorado Center for Personalized Medicine, University of Colorado School of Medicine, Aurora
| | - Kathryn L. Colborn
- Colorado School of Public Health, Department of Biostatics and Informatics, University of Colorado Denver, Aurora
| | - Derek E. Smith
- Children’s Hospital Colorado, Cancer Center Biostatistics Core, Department of Pediatrics, University of Colorado, Aurora
| | - Fuyong Xing
- Colorado School of Public Health, Department of Biostatics and Informatics, University of Colorado Denver, Aurora
| | - Debashis Ghosh
- Colorado School of Public Health, Department of Biostatics and Informatics, University of Colorado Denver, Aurora
| | - Michael A. Rosenberg
- Colorado Center for Personalized Medicine, University of Colorado School of Medicine, Aurora
- Division of Cardiology and Cardiac Electrophysiology, University of Colorado School of Medicine, Aurora
| |
Collapse
|
5
|
Bhattacharya A, Cui Y. A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules. Sci Rep 2017. [PMID: 28646174 PMCID: PMC5482832 DOI: 10.1038/s41598-017-04070-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
In the analysis of large-scale gene expression data, it is important to identify groups of genes with common expression patterns under certain conditions. Many biclustering algorithms have been developed to address this problem. However, comprehensive discovery of functionally coherent biclusters from large datasets remains a challenging problem. Here we propose a GPU-accelerated biclustering algorithm, based on searching for the largest Condition-dependent Correlation Subgroups (CCS) for each gene in the gene expression dataset. We compared CCS with thirteen widely used biclustering algorithms. CCS consistently outperformed all the thirteen biclustering algorithms on both synthetic and real gene expression datasets. As a correlation-based biclustering method, CCS can also be used to find condition-dependent coexpression network modules. We implemented the CCS algorithm using C and implemented the parallelized CCS algorithm using CUDA C for GPU computing. The source code of CCS is available from https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS.
Collapse
Affiliation(s)
- Anindya Bhattacharya
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA. .,Department of Computer Science and Engineering, University of California, San Diego, CA, 92093, USA.
| | - Yan Cui
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
| |
Collapse
|
6
|
Raja K, Patrick M, Gao Y, Madu D, Yang Y, Tsoi LC. A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries. Int J Genomics 2017; 2017:6213474. [PMID: 28331849 PMCID: PMC5346376 DOI: 10.1155/2017/6213474] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/09/2017] [Indexed: 12/13/2022] Open
Abstract
In the past decade, the volume of "omics" data generated by the different high-throughput technologies has expanded exponentially. The managing, storing, and analyzing of this big data have been a great challenge for the researchers, especially when moving towards the goal of generating testable data-driven hypotheses, which has been the promise of the high-throughput experimental techniques. Different bioinformatics approaches have been developed to streamline the downstream analyzes by providing independent information to interpret and provide biological inference. Text mining (also known as literature mining) is one of the commonly used approaches for automated generation of biological knowledge from the huge number of published articles. In this review paper, we discuss the recent advancement in approaches that integrate results from omics data and information generated from text mining approaches to uncover novel biomedical information.
Collapse
Affiliation(s)
- Kalpana Raja
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Matthew Patrick
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yilin Gao
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Desmond Madu
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Yuyang Yang
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Lam C. Tsoi
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Computational Medicine & Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
7
|
Te Pas MFW, Madsen O, Calus MPL, Smits MA. The Importance of Endophenotypes to Evaluate the Relationship between Genotype and External Phenotype. Int J Mol Sci 2017; 18:E472. [PMID: 28241430 PMCID: PMC5344004 DOI: 10.3390/ijms18020472] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Revised: 02/02/2017] [Accepted: 02/13/2017] [Indexed: 02/06/2023] Open
Abstract
With the exception of a few Mendelian traits, almost all phenotypes (traits) in livestock science are quantitative or complex traits regulated by the expression of many genes. For most of the complex traits, differential expression of genes, rather than genomic variation in the gene coding sequences, is associated with the genotype of a trait. The expression profiles of the animal's transcriptome, proteome and metabolome represent endophenotypes that influence/regulate the externally-observed phenotype. These expression profiles are generated by interactions between the animal's genome and its environment that range from the cellular, up to the husbandry environment. Thus, understanding complex traits requires knowledge about not only genomic variation, but also environmental effects that affect genome expression. Gene products act together in physiological pathways and interaction networks (of pathways). Due to the lack of annotation of the functional genome and ontologies of genes, our knowledge about the various biological systems that contribute to the development of external phenotypes is sparse. Furthermore, interaction with the animals' microbiome, especially in the gut, greatly influences the external phenotype. We conclude that a detailed understanding of complex traits requires not only understanding of variation in the genome, but also its expression at all functional levels.
Collapse
Affiliation(s)
- Marinus F W Te Pas
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, 6700AH Wageningen, The Netherlands.
| | - Ole Madsen
- Animal Breeding and Genomics, Wageningen University, 6700AH Wageningen, The Netherlands.
| | - Mario P L Calus
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, 6700AH Wageningen, The Netherlands.
| | - Mari A Smits
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, 6700AH Wageningen, The Netherlands.
| |
Collapse
|
8
|
Pan H, Holbrook JD, Karnani N, Kwoh CK. Gene, Environment and Methylation (GEM): a tool suite to efficiently navigate large scale epigenome wide association studies and integrate genotype and interaction between genotype and environment. BMC Bioinformatics 2016; 17:299. [PMID: 27480116 PMCID: PMC4970299 DOI: 10.1186/s12859-016-1161-z] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Accepted: 07/21/2016] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The interplay among genetic, environment and epigenetic variation is not fully understood. Advances in high-throughput genotyping methods, high-density DNA methylation detection and well-characterized sample collections, enable epigenetic association studies at the genomic and population levels (EWAS). The field has extended to interrogate the interaction of environmental and genetic (GxE) influences on epigenetic variation. Also, the detection of methylation quantitative trait loci (methQTLs) and their association with health status has enhanced our knowledge of epigenetic mechanisms in disease trajectory. However analysis of this type of data brings computational challenges and there are few practical solutions to enable large scale studies in standard computational environments. RESULTS GEM is a highly efficient R tool suite for performing epigenome wide association studies (EWAS). GEM provides three major functions named GEM_Emodel, GEM_Gmodel and GEM_GxEmodel to study the interplay of Gene, Environment and Methylation (GEM). Within GEM, the pre-existing "Matrix eQTL" package is utilized and extended to study methylation quantitative trait loci (methQTL) and the interaction of genotype and environment (GxE) to determine DNA methylation variation, using matrix based iterative correlation and memory-efficient data analysis. Benchmarking presented here on a publicly available dataset, demonstrated that GEM can facilitate reliable genome-wide methQTL and GxE analysis on a standard laptop computer within minutes. CONCLUSIONS The GEM package facilitates efficient EWAS study in large cohorts. It is written in R code and can be freely downloaded from Bioconductor at https://www.bioconductor.org/packages/GEM/ .
Collapse
Affiliation(s)
- Hong Pan
- Singapore Institute for Clinical Sciences (SICS), Agency for Science Technology and Research (A*STAR), Singapore, 117609, Singapore.,School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore, 639798, Singapore
| | - Joanna D Holbrook
- Singapore Institute for Clinical Sciences (SICS), Agency for Science Technology and Research (A*STAR), Singapore, 117609, Singapore
| | - Neerja Karnani
- Singapore Institute for Clinical Sciences (SICS), Agency for Science Technology and Research (A*STAR), Singapore, 117609, Singapore.,Yong Loo Lin School of Medicine, National University of Singapore (NUS), Singapore, 119228, Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore, 639798, Singapore.
| |
Collapse
|