1
|
Al-Aamri A, Kamarul Azman S, Daw Elbait G, Alsafar H, Henschel A. Critical assessment of on-premise approaches to scalable genome analysis. BMC Bioinformatics 2023; 24:354. [PMID: 37735350 PMCID: PMC10512525 DOI: 10.1186/s12859-023-05470-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/08/2023] [Indexed: 09/23/2023] Open
Abstract
BACKGROUND Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Syafiq Kamarul Azman
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Gihan Daw Elbait
- Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Habiba Alsafar
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
2
|
Magdy Mohamed Abdelaziz Barakat S, Sallehuddin R, Yuhaniz SS, R. Khairuddin RF, Mahmood Y. Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges. PeerJ Comput Sci 2023; 9:e1180. [PMID: 37547391 PMCID: PMC10403225 DOI: 10.7717/peerj-cs.1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 04/27/2023] [Indexed: 08/08/2023]
Abstract
Background The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. Method The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. Results Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. Conclusion We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
Collapse
Affiliation(s)
| | - Roselina Sallehuddin
- Computer Science, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
| | - Siti Sophiayati Yuhaniz
- Advanced Informatics Department, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Kuala Lumpur, Malaysia
| | | | - Yasir Mahmood
- Faculty of Information Technology, The University of Lahore, Lahore, Lahore, Pakistan
| |
Collapse
|
3
|
Best S, Long JC, Braithwaite J, Taylor N. Standardizing variation: Scaling up clinical genomics in Australia. Genet Med 2023; 25:100109. [PMID: 35115231 DOI: 10.1016/j.gim.2022.01.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 01/03/2022] [Accepted: 01/06/2022] [Indexed: 02/07/2023] Open
Abstract
PURPOSE Clinical genomics demands close interaction of physicians, laboratory scientists, and genetic professionals. Taking genomics to scale requires an understanding of the underlying processes from the perspective of nongenetic physicians who are new to the field. We identified components of the processes amenable to adaptation when scaling up clinical genomics. METHODS Semistructured interviews informed by the Theoretical Domains Framework with nongenetic physicians, who were using clinical genomics in practice, were guided by an annotated process map with 7 steps following the patient's journey. Findings from the individual maps were synthesized into an overview process map and a series of individual maps by common location and specialty. Interviews were analyzed using the Theoretical Domains Framework. RESULTS In total, 16 nongenetic physicians (eg, nephrologists, immunologists) participated, generating 1 overview and 10 individual process maps. Sixteen common steps were identified across clinical specialties and locations, with variations over 9 steps. We report the potential for standardization across these 9 steps. CONCLUSION When scaling up complex interventions, it is essential to identify steps where variation can be accommodated. With these results we show how process mapping can be used to identify steps where variation is acceptable during scale up to accommodate adaptation to local context, allowing for the inevitable evolution of factors influencing ongoing implementation and sustainability.
Collapse
Affiliation(s)
- Stephanie Best
- Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, New South Wales, Australia; Australian Genomics, Murdoch Children's Research Institute, Melbourne, Victoria, Australia.
| | - Janet C Long
- Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, New South Wales, Australia
| | - Jeffrey Braithwaite
- Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, New South Wales, Australia
| | - Natalie Taylor
- School of Population Health, UNSW Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
4
|
Merhi G, Koweyes J, Salloum T, Khoury CA, Haidar S, Tokajian S. SARS-CoV-2 genomic epidemiology: data and sequencing infrastructure. Future Microbiol 2022; 17:1001-1007. [PMID: 35899481 PMCID: PMC9332909 DOI: 10.2217/fmb-2021-0207] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background: Genomic surveillance of SARS-CoV-2 is critical in monitoring viral lineages. Available data reveal a significant gap between low- and middle-income countries and the rest of the world. Methods: The SARS-CoV-2 sequencing costs using the Oxford Nanopore MinION device and hardware prices for data computation in Lebanon were estimated and compared with those in developed countries. SARS-CoV-2 genomes deposited on the Global Initiative on Sharing All Influenza Data per 1000 COVID-19 cases were determined per country. Results: Sequencing costs in Lebanon were significantly higher compared with those in developed countries. Low- and middle-income countries showed limited sequencing capabilities linked to the lack of support, high prices, long delivery delays and limited availability of trained personnel. Conclusion: The authors recommend the mobilization of funds to develop whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology to better identify and track outbreaks, leading to appropriate and mindful interventions. Lebanon and other low- and middle-income countries have limited sequencing capabilities. Sequencing costs using MinION in Lebanon were higher than the approximate sequencing costs in developed countries. The challenges faced by low- and middle-income countries include lack of support, few established sequencing facilities, high prices, long delivery delays and the limited availability of trained personnel. There is a need to focus on the development of whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology to improve sequencing efforts in many resource-limited settings and to contain and prevent future pandemic-level outbreaks. Sequencing costs of #SARS-CoV-2 in Lebanon are higher than those in developed countries. #LMICs have limited #sequencing capabilities. Whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology could improve sequencing efforts.
Collapse
Affiliation(s)
- Georgi Merhi
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Jad Koweyes
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Tamara Salloum
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Charbel Al Khoury
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Siwar Haidar
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Sima Tokajian
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| |
Collapse
|
5
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
6
|
Auwerx C, Sadler MC, Reymond A, Kutalik Z. From Pharmacogenetics to Pharmaco-Omics:Milestones and Future Directions. HGG ADVANCES 2022; 3:100100. [PMID: 35373152 PMCID: PMC8971318 DOI: 10.1016/j.xhgg.2022.100100] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
The origins of pharmacogenetics date back to the 1950s, when it was established that inter-individual differences in drug response are partially determined by genetic factors. Since then, pharmacogenetics has grown into its own field, motivated by the translation of identified gene-drug interactions into therapeutic applications. Despite numerous challenges ahead, our understanding of the human pharmacogenetic landscape has greatly improved thanks to the integration of tools originating from disciplines as diverse as biochemistry, molecular biology, statistics, and computer sciences. In this review, we discuss past, present, and future developments of pharmacogenetics methodology, focusing on three milestones: how early research established the genetic basis of drug responses, how technological progress made it possible to assess the full extent of pharmacological variants, and how multi-dimensional omics datasets can improve the identification, functional validation, and mechanistic understanding of the interplay between genes and drugs. We outline novel strategies to repurpose and integrate molecular and clinical data originating from biobanks to gain insights analogous to those obtained from randomized controlled trials. Emphasizing the importance of increased diversity, we envision future directions for the field that should pave the way to the clinical implementation of pharmacogenetics.
Collapse
|
7
|
Abstract
The risk of emergence and spread of novel human pathogens originating from an animal reservoir has increased in the past decades. However, the unpredictable nature of disease emergence makes surveillance and preparedness challenging. Knowledge of general risk factors for emergence and spread, combined with local level data is needed to develop a risk-based methodology for early detection. This involves the implementation of the One Health approach, integrating human, animal and environmental health sectors, as well as social sciences, bioinformatics and more. Recent technical advances, such as metagenomic sequencing, will aid the rapid detection of novel pathogens on the human-animal interface.
Collapse
|
8
|
Abstract
Genomics is both a data- and compute-intensive discipline. The success of genomics depends on an adequate informatics infrastructure that can address growing data demands and enable a diverse range of resource-intensive computational activities. Designing a suitable infrastructure is a challenging task, and its success largely depends on its adoption by users. In this article, we take a user-centric view of the genomics, where users are bioinformaticians, computational biologists, and data scientists. We try to take their point of view on how traditional computational activities for genomics are expanding due to data growth, as well as the introduction of big data and cloud technologies. The changing landscape of computational activities and new user requirements will influence the design of future genomics infrastructures.
Collapse
Affiliation(s)
- Ritesh Krishna
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| | - Vadim Elisseev
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| |
Collapse
|