1
|
Betschart RO, Riccio C, Aguilera-Garcia D, Blankenberg S, Guo L, Moch H, Seidl D, Solleder H, Thalén F, Thiéry A, Twerenbold R, Zeller T, Zoche M, Ziegler A. Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control. Biom J 2024; 66:e202300278. [PMID: 38988195 DOI: 10.1002/bimj.202300278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 07/12/2024]
Abstract
Rapid advances in high-throughput DNA sequencing technologies have enabled large-scale whole genome sequencing (WGS) studies. Before performing association analysis between phenotypes and genotypes, preprocessing and quality control (QC) of the raw sequence data need to be performed. Because many biostatisticians have not been working with WGS data so far, we first sketch Illumina's short-read sequencing technology. Second, we explain the general preprocessing pipeline for WGS studies. Third, we provide an overview of important QC metrics, which are applied to WGS data: on the raw data, after mapping and alignment, after variant calling, and after multisample variant calling. Fourth, we illustrate the QC with the data from the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on an Illumina NovaSeq 6000 with an average coverage of 35× using a PCR-free protocol. For QC, one genome in a bottle (GIAB) trio was sequenced in four replicates, and one GIAB sample was successfully sequenced 70 times in different runs. Fifth, we provide empirical data on the compression of raw data using the DRAGEN original read archive (ORA). The most important quality metrics in the application were genetic similarity, sample cross-contamination, deviations from the expected Het/Hom ratio, relatedness, and coverage. The compression ratio of the raw files using DRAGEN ORA was 5.6:1, and compression time was linear by genome coverage. In summary, the preprocessing, joint calling, and QC of large WGS studies are feasible within a reasonable time, and efficient QC procedures are readily available.
Collapse
Affiliation(s)
| | | | - Domingo Aguilera-Garcia
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Stefan Blankenberg
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Linlin Guo
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Holger Moch
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Dagmar Seidl
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Hugo Solleder
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | - Felix Thalén
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Raphael Twerenbold
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Tanja Zeller
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), partner site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Martin Zoche
- Institute of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Switzerland
| | - Andreas Ziegler
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Center for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| |
Collapse
|
2
|
Ayalew W, Xiaoyun W, Tarekegn GM, Naboulsi R, Sisay Tessema T, Van Damme R, Bongcam-Rudloff E, Chu M, Liang C, Edea Z, Enquahone S, Ping Y. Whole genome sequences of 70 indigenous Ethiopian cattle. Sci Data 2024; 11:584. [PMID: 38839789 PMCID: PMC11153504 DOI: 10.1038/s41597-024-03342-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 05/02/2024] [Indexed: 06/07/2024] Open
Abstract
Indigenous animal genetic resources play a crucial role in preserving global genetic diversity and supporting the livelihoods of millions of people. In Ethiopia, the majority of the cattle population consists of indigenous breeds. Understanding the genetic architecture of these cattle breeds is essential for effective management and conservation efforts. In this study, we sequenced DNA samples from 70 animals from seven indigenous cattle breeds, generating about two terabytes of pair-end reads with an average coverage of 14X. The sequencing data were pre-processed and mapped to the cattle reference genome (ARS-UCD1.2) with an alignment rate of 99.2%. Finally, the variant calling process produced approximately 35 million high-quality SNPs. These data provide a deeper understanding of the genetic landscape, facilitate the identification of causal mutations, and enable the exploration of evolutionary patterns to assist cattle improvement and sustainable utilization, particularly in the face of unpredictable climate changes.
Collapse
Affiliation(s)
- Wondossen Ayalew
- Key Laboratory of Animal Genetics and Breeding on Tibetan Plateau, Ministry of Agriculture and Rural Affairs, Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences, Lanzhou, 730050, P.R. China
- Institute of Biotechnology, Addis Ababa University, Addis Ababa P.O. Box 1176, Addis Ababa, Ethiopia
| | - Wu Xiaoyun
- Key Laboratory of Animal Genetics and Breeding on Tibetan Plateau, Ministry of Agriculture and Rural Affairs, Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences, Lanzhou, 730050, P.R. China
| | - Getinet Mekuriaw Tarekegn
- Institute of Biotechnology, Addis Ababa University, Addis Ababa P.O. Box 1176, Addis Ababa, Ethiopia.
- Scotland's Rural College (SRUC), Roslin Institute Building, University of Edinburgh, Edinburgh, EH25 9RG, UK.
| | - Rakan Naboulsi
- Childhood Cancer Research Unit, Department of Women's and Children's Health, Karolinska Institute, Tomtebodavägen 18A, 17177, Stockholm, Sweden
| | - Tesfaye Sisay Tessema
- Institute of Biotechnology, Addis Ababa University, Addis Ababa P.O. Box 1176, Addis Ababa, Ethiopia
| | - Renaud Van Damme
- Department of Animal Biosciences, Swedish University of Agricultural Sciences, 75007, Uppsala, Sweden
| | - Erik Bongcam-Rudloff
- Department of Animal Biosciences, Swedish University of Agricultural Sciences, 75007, Uppsala, Sweden
| | - Min Chu
- Key Laboratory of Animal Genetics and Breeding on Tibetan Plateau, Ministry of Agriculture and Rural Affairs, Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences, Lanzhou, 730050, P.R. China
| | - Chunnian Liang
- Key Laboratory of Animal Genetics and Breeding on Tibetan Plateau, Ministry of Agriculture and Rural Affairs, Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences, Lanzhou, 730050, P.R. China
| | - Zewdu Edea
- Ethiopian Bio and Emerging Technology Institute, Addis Ababa, Ethiopia
| | - Solomon Enquahone
- Scotland's Rural College (SRUC), Roslin Institute Building, University of Edinburgh, Edinburgh, EH25 9RG, UK
| | - Yan Ping
- Key Laboratory of Animal Genetics and Breeding on Tibetan Plateau, Ministry of Agriculture and Rural Affairs, Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences, Lanzhou, 730050, P.R. China.
| |
Collapse
|
3
|
Ramirez-Ramirez AR, Bidot-Martínez I, Mirzaei K, Rasoamanalina Rivo OL, Menéndez-Grenot M, Clapé-Borges P, Espinosa-Lopez G, Bertin P. Comparing the performances of SSR and SNP markers for population analysis in Theobroma cacao L., as alternative approach to validate a new ddRADseq protocol for cacao genotyping. PLoS One 2024; 19:e0304753. [PMID: 38820504 PMCID: PMC11142705 DOI: 10.1371/journal.pone.0304753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Accepted: 05/18/2024] [Indexed: 06/02/2024] Open
Abstract
Proper cacao (Theobroma cacao L.) plant genotyping is mandatory for the conservation and use of the species genetic resources. A set of 15 international standard SSR markers was assumed as universal cacao genotyping system. Recently, different SNPs and SNP genotyping techniques have been exploited in cacao. However, a consensus on which to use has not been reached yet, driving the search for new approaches. To validate a new ddRADseq protocol for cacao genotyping, we compared the performances for population analysis of a dataset with 7,880 SNPs obtained from ddRADseq and the genotypic data from the aforementioned SSR set, using 158 cacao plants from productive farms and gene bank. Four genetic groups were identified with STRUCTURE and ADMIXTURE softwares using SSR and SNP data, respectively. Similarities of cacao ancestries among these groups allowed the identification of analogous pairs of groups of individuals, referred to as: G1SSR/G1SNP, G2SSR/G2SNP, G3SSR/G3SNP, G4SSR/G4SNP, whether SSRs or SNPs were used. Both marker systems identified Amelonado and Criollo as the most abundant cacao ancestries among all samples. Genetic distance matrices from both data types were significantly similar to each other according to Mantel test (p < 0.0001). PCoA and UPGMA clustering mostly confirmed the identified genetic groups. AMOVA and FST pairwise comparison revealed a moderate to very large genetic differentiation among identified groups from SSR and SNP data. Genetic diversity parameters from SSR (Hobs = 0.616, Hexp = 0.524 and PIC = 0.544) were higher than that from SNP data (0.288, 0.264, 0.230). In both cases, genetic groups carrying the highest Amelonado proportion (G1SSR and G1SNP) had the lowest genetic diversity parameters among the identified groups. The high congruence among population analysis results using both systems validated the ddRADseq protocol employed for cacao SNP genotyping. These results could provide new ways for developing a universal SNP-based genotyping system very much needed for cacao genetic studies.
Collapse
Affiliation(s)
- Angel Rafael Ramirez-Ramirez
- Faculty of Agroforestry, University of Guantánamo, Guantánamo, Cuba
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-neuve, Belgium
| | | | - Khaled Mirzaei
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-neuve, Belgium
| | | | - Miguel Menéndez-Grenot
- Instituto de Investigaciones Agroforestales, Unidad de Ciencia y Técnica de Base—Baracoa, Baracoa, Guantánamo, Cuba
| | - Pablo Clapé-Borges
- Instituto de Investigaciones Agroforestales, Unidad de Ciencia y Técnica de Base—Baracoa, Baracoa, Guantánamo, Cuba
| | | | - Pierre Bertin
- Earth and Life Institute, Université catholique de Louvain, Louvain-la-neuve, Belgium
| |
Collapse
|
4
|
Malamon JS, Farrell JJ, Xia LC, Dombroski BA, Das RG, Way J, Kuzma AB, Valladares O, Leung YY, Scanlon AJ, Lopez IAB, Brehony J, Worley KC, Zhang NR, Wang LS, Farrer LA, Schellenberg GD, Lee WP, Vardarajan BN. A comparative study of structural variant calling in WGS from Alzheimer's disease families. Life Sci Alliance 2024; 7:e202302181. [PMID: 38418088 PMCID: PMC10902710 DOI: 10.26508/lsa.202302181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 02/07/2024] [Accepted: 02/08/2024] [Indexed: 03/01/2024] Open
Abstract
Detecting structural variants (SVs) in whole-genome sequencing poses significant challenges. We present a protocol for variant calling, merging, genotyping, sensitivity analysis, and laboratory validation for generating a high-quality SV call set in whole-genome sequencing from the Alzheimer's Disease Sequencing Project comprising 578 individuals from 111 families. Employing two complementary pipelines, Scalpel and Parliament, for SV/indel calling, we assessed sensitivity through sample replicates (N = 9) with in silico variant spike-ins. We developed a novel metric, D-score, to evaluate caller specificity for deletions. The accuracy of deletions was evaluated by Sanger sequencing. We generated a high-quality call set of 152,301 deletions of diverse sizes. Sanger sequencing validated 114 of 146 detected deletions (78.1%). Scalpel excelled in accuracy for deletions ≤100 bp, whereas Parliament was optimal for deletions >900 bp. Overall, 83.0% and 72.5% of calls by Scalpel and Parliament were validated, respectively, including all 11 deletions called by both Parliament and Scalpel between 101 and 900 bp. Our flexible protocol successfully generated a high-quality deletion call set and a truth set of Sanger sequencing-validated deletions with precise breakpoints spanning 1-17,000 bp.
Collapse
Affiliation(s)
- John S Malamon
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - John J Farrell
- Biomedical Genetics Section, Department of Medicine, Boston University School of Medicine, Boston University, Boston, MA, USA
| | - Li Charlie Xia
- https://ror.org/03mtd9a03 Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
| | - Beth A Dombroski
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Rueben G Das
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Jessica Way
- Broad Institute, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Amanda B Kuzma
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Otto Valladares
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Yuk Yee Leung
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Allison J Scanlon
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Irving Antonio Barrera Lopez
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Jack Brehony
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Kim C Worley
- https://ror.org/02pttbw34 Human Genome Sequencing Center, and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Nancy R Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
| | - Li-San Wang
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Lindsay A Farrer
- Biomedical Genetics Section, Department of Medicine, Boston University School of Medicine, Boston University, Boston, MA, USA
- Departments of Neurology and Ophthalmology, Boston University School of Medicine, Boston University, Boston, MA, USA
- Departments of Epidemiology and Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Gerard D Schellenberg
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Wan-Ping Lee
- Department of Pathology and Laboratory Medicine, Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Badri N Vardarajan
- https://ror.org/01esghr10 Gertrude H. Sergievsky Center and Taub Institute of Aging Brain, Department of Neurology, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
5
|
Ramirez-Ramirez AR, Mirzaei K, Menéndez-Grenot M, Clapé-Borges P, Espinosa-Lopéz G, Bidot-Martínez I, Bertin P. Using ddRADseq to assess the genetic diversity of in-farm and gene bank cacao resources in the Baracoa region, eastern Cuba, for use and conservation purposes. FRONTIERS IN PLANT SCIENCE 2024; 15:1367632. [PMID: 38504901 PMCID: PMC10948478 DOI: 10.3389/fpls.2024.1367632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 02/12/2024] [Indexed: 03/21/2024]
Abstract
The Baracoa region, eastern Cuba, hosts around 80 % of the country cacao (Theobroma cacao L.) plantations. Cacao plants in farms are diverse in origin and propagation, with grafted and hybrid plants being the more common ones. Less frequent are plants from cuttings, TSH progeny, and traditional Cuban cacao. A national cacao gene bank is also present in Baracoa, with 282 accessions either prospected in Cuba or introduced from other countries. A breeding program associated with the gene bank started in the 1990s based on agro-morphological descriptors. The genetic diversity of cacao resources in Baracoa has been poorly described, except for traditional Cuban cacao, affecting the proper development of the breeding program and the cacao planting policies in the region. To assess the population structure and genetic diversity of cacao resources in Baracoa region, we genotyped plants from both cacao gene bank (CG) and cacao farms (CF) applying a new ddRADseq protocol for cacao. After data processing, two SNPs datasets containing 11,425 and 6,481 high-quality SNPs were generated with 238 CG and 135 CF plants, respectively. SNPs were unevenly distributed along the 10 cacao chromosomes and laid mainly in noncoding regions of the genome. Population structure analysis with these SNP datasets identified seven and four genetic groups in CG and CF samples, respectively. Clustering using UPGMA and principal component analysis mostly agree with population structure results. Amelonado was the predominant cacao ancestry, accounting for 49.22 % (CG) and 57.73 % (CF) of the total. Criollo, Contamana, Iquitos, and Nanay ancestries were detected in both CG and CF samples, while Nacional and Marañon backgrounds were only identified in CG. Genetic differentiation among CG (FST ranging from 0.071 to 0.407) was higher than among CF genetic groups (FST: 0.093-0.282). Genetic diversity parameters showed similar values for CG and CF samples. The CG and CF genetic groups with the lowest genetic diversity parameters had the highest proportion of Amelonado ancestry. These results should contribute to reinforcing the ongoing breeding program and updating the planting policies on cacao farms, with an impact on the social and economic life of the region.
Collapse
Affiliation(s)
- Angel Rafael Ramirez-Ramirez
- Faculty of Agroforestry, University of Guantánamo, Guantánamo, Cuba
- Earth and Life Institute, Université catholique de Louvain (UCLouvain), Louvain-la-neuve, Belgium
| | - Khaled Mirzaei
- Earth and Life Institute, Université catholique de Louvain (UCLouvain), Louvain-la-neuve, Belgium
| | - Miguel Menéndez-Grenot
- Unidad de Ciencia y Técnica de Base-Baracoa / Instituto de Investigaciones Agroforestales (UCTBBaracoa / INAF), Baracoa, Cuba
| | - Pablo Clapé-Borges
- Unidad de Ciencia y Técnica de Base-Baracoa / Instituto de Investigaciones Agroforestales (UCTBBaracoa / INAF), Baracoa, Cuba
| | | | | | - Pierre Bertin
- Earth and Life Institute, Université catholique de Louvain (UCLouvain), Louvain-la-neuve, Belgium
| |
Collapse
|
6
|
Boumajdi N, Bendani H, Kartti S, Alouane T, Belyamani L, Ibrahimi A. A Comprehensive Analysis of 3 Moroccan Genomes Revealed Contributions From Both African and European Ancestries. Evol Bioinform Online 2024; 20:11769343241229278. [PMID: 38327511 PMCID: PMC10848790 DOI: 10.1177/11769343241229278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 01/12/2024] [Indexed: 02/09/2024] Open
Abstract
Genetic variations in the human genome represent the differences in DNA sequence within individuals. This highlights the important role of whole human genome sequencing which has become the keystone for precision medicine and disease prediction. Morocco is an important hub for studying human population migration and mixing history. This study presents the analysis of 3 Moroccan genomes; the variant analysis revealed 6 379 606 single nucleotide variants (SNVs) and 1 050 577 small InDels. Of those identified SNVs, 219 152 were novel, with 1233 occurring in coding regions, and 5580 non-synonymous single nucleotide variants (nsSNP) variants were predicted to affect protein functions. The InDels produced 1055 coding variants and 454 non-3n length variants, and their size ranged from -49 and 49 bp. We further analysed the gene pathways of 8 novel coding variants found in the 3 genomes and revealed 5 genes involved in various diseases and biological pathways. We found that the Moroccan genomes share 92.78% of African ancestry, and 92.86% of Non-Finnish European ancestry, according to the gnomAD database. Then, population structure inference, by admixture analysis and network-based approach, revealed that the studied genomes form a mixed population structure, highlighting the increased genetic diversity in Morocco.
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology, Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
- Mohammed VI Center for Research & Innovation (CM6), Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology, Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
- Mohammed VI Center for Research & Innovation (CM6), Rabat, Morocco
| | - Souad Kartti
- Laboratory of Biotechnology, Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
- Mohammed VI Center for Research & Innovation (CM6), Rabat, Morocco
| | - Tarek Alouane
- Laboratory of Biotechnology, Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research & Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Health Sciences (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology, Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
- Mohammed VI Center for Research & Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Health Sciences (UM6SS), Casablanca, Morocco
| |
Collapse
|
7
|
Belay S, Belay G, Nigussie H, Jian-Lin H, Tijjani A, Ahbara AM, Tarekegn GM, Woldekiros HS, Mor S, Dobney K, Lebrasseur O, Hanotte O, Mwacharo JM. Whole-genome resource sequences of 57 indigenous Ethiopian goats. Sci Data 2024; 11:139. [PMID: 38287052 PMCID: PMC10825132 DOI: 10.1038/s41597-024-02973-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 01/16/2024] [Indexed: 01/31/2024] Open
Abstract
Domestic goats are distributed worldwide, with approximately 35% of the one billion world goat population occurring in Africa. Ethiopia has 52.5 million goats, ~99.9% of which are considered indigenous landraces deriving from animals introduced to the Horn of Africa in the distant past by nomadic herders. They have continued to be managed by smallholder farmers and semi-mobile pastoralists throughout the region. We report here 57 goat genomes from 12 Ethiopian goat populations sampled from different agro-climates. The data were generated through sequencing DNA samples on the Illumina NovaSeq 6000 platform at a mean depth of 9.71x and 150 bp pair-end reads. In total, ~2 terabytes of raw data were generated, and 99.8% of the clean reads mapped successfully against the goat reference genome assembly at a coverage of 99.6%. About 24.76 million SNPs were generated. These SNPs can be used to study the population structure and genome dynamics of goats at the country, regional, and global levels to shed light on the species' evolutionary trajectory.
Collapse
Affiliation(s)
- Shumuye Belay
- Tigray Agricultural Research Institute, Mekelle, Tigray, Ethiopia.
- Addis Ababa University, Department of Microbial, Cellular and Molecular Biology, Addis Ababa, Ethiopia.
- LiveGene Program, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia.
| | - Gurja Belay
- Addis Ababa University, Department of Microbial, Cellular and Molecular Biology, Addis Ababa, Ethiopia.
| | - Helen Nigussie
- Addis Ababa University, Department of Microbial, Cellular and Molecular Biology, Addis Ababa, Ethiopia
| | - Han Jian-Lin
- ILRI-CAAS Joint Laboratory on Livestock and Forage Genetic Resources, Beijing, China
| | - Abdulfatai Tijjani
- LiveGene Program, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia
| | - Abulgasim M Ahbara
- Animal and Veterinary Sciences, Scotland's Rural College (SRUC), Roslin Institute Building, Midlothian, UK
- Department of Zoology, Misurata University, Misurata, Libya
| | - Getinet M Tarekegn
- Animal and Veterinary Sciences, Scotland's Rural College (SRUC), Roslin Institute Building, Midlothian, UK
- Institute of Biotechnology, Addis Ababa University, Addis Ababa, Ethiopia
| | - Helina S Woldekiros
- Department of Anthropology, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Siobhan Mor
- LiveGene Program, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia
- Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool, UK
| | - Keith Dobney
- Department of Archaeology, Classics and Egyptology, University of Liverpool, Liverpool, UK
- School of Philosophical and Historical Inquiry, University of Sydney, Sydney, Australia
| | - Ophelie Lebrasseur
- Department of Archaeology, Classics and Egyptology, University of Liverpool, Liverpool, UK
| | - Olivier Hanotte
- LiveGene Program, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | - Joram M Mwacharo
- Animal and Veterinary Sciences, Scotland's Rural College (SRUC), Roslin Institute Building, Midlothian, UK.
- Small Ruminant Genomics, International Centre for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia.
| |
Collapse
|
8
|
Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges. Brief Bioinform 2024; 25:bbad508. [PMID: 38271481 PMCID: PMC10810331 DOI: 10.1093/bib/bbad508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/18/2023] [Accepted: 12/12/2023] [Indexed: 01/27/2024] Open
Abstract
Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
Collapse
Affiliation(s)
- Yury A Barbitoff
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| | - Mikhail O Ushakov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Tatyana E Lazareva
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Yulia A Nasykhova
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Andrey S Glotov
- Dpt. of Genomic Medicine, D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology, Mendeleevskaya line 3, 199034, St. Petersburg, Russia
| | - Alexander V Predeus
- Bioinformatics Institute, Kentemirovskaya st. 2A, 197342, St. Petersburg, Russia
| |
Collapse
|
9
|
He Q, Sun C, Pan Y. Whole‑exome sequencing reveals Lewis lung carcinoma is a hypermutated Kras/Nras-mutant cancer with extensive regional mutation clusters in its genome. Sci Rep 2024; 14:100. [PMID: 38167599 PMCID: PMC10762126 DOI: 10.1038/s41598-023-50703-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 12/23/2023] [Indexed: 01/05/2024] Open
Abstract
Lewis lung carcinoma (LLC), as a widely used preclinical cancer model, has still not been genetically and genomically characterized. Here, we performed a whole-exome sequencing analysis on the LLC cell line to elucidate its molecular characteristics and etiologies. Our data showed that LLC originated from a male mouse belonging to C57BL/6L (a transitional strain between C57BL/6J and C57BL/6N) and contains substantial somatic SNV and InDel mutations (> 20,000). Extensive regional mutation clusters are present in its genome, which were caused mainly by the mutational processes underlying the SBS1, SBS5, SBS15, SBS17a, and SBS21 signatures during frequent structural rearrangements. Thirty three deleterious mutations are present in 30 cancer genes including Kras, Nras, Trp53, Dcc, and Cacna1d. Cdkn2a and Cdkn2b are biallelically deleted from the genome. Five pathways (RTK/RAS, p53, cell cycle, TGFB, and Hippo) are oncogenically deregulated or affected. The major mutational processes in LLC include chromosomal instability, exposure to metabolic mutagens, spontaneous 5-methylcytosine deamination, defective DNA mismatch repair, and reactive oxygen species. Our data also suggest that LLC is a lung cancer similar to human lung adenocarcinoma. This study lays a molecular basis for the more targeted application of LLC in preclinical research.
Collapse
Affiliation(s)
- Quan He
- Department of Chemistry, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Cuirong Sun
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| | - Yuanjiang Pan
- Department of Chemistry, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
10
|
Lindtke D, Seefried FR, Drögemüller C, Neuditschko M. Increased heterozygosity in low-pass sequencing data allows identification of blood chimeras in cattle. Anim Genet 2023; 54:613-618. [PMID: 37313694 DOI: 10.1111/age.13334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 05/31/2023] [Accepted: 06/02/2023] [Indexed: 06/15/2023]
Abstract
In about 90% of multiple pregnancies in cattle, shared blood circulation between fetuses leads to genetic chimerism in peripheral blood and can reduce reproductive performance in heterosexual co-twins. However, the early detection of heterosexual chimeras requires specialized tests. Here, we used low-pass sequencing data with a median coverage of 0.64× generated from blood samples of 322 F1 crosses between beef and dairy cattle and identified 20 putative blood chimeras through increased levels of genome-wide heterozygosity. In contrast, for 77 samples with routine SNP microarray data generated from hair bulbs of the same F1s, we found no evidence of chimerism, simultaneously observing high levels of genotype discordance with sequencing data. Fifteen out of 18 reported twins showed signs of blood chimerism, in line with previous reports, whereas the presence of five alleged singletons with strong signs of chimerism suggests that the in-utero death rate of co-twins is at the upper limit of former estimates. Together, our results show that low-pass sequencing data allow reliable screening for blood chimeras. They further affirm that blood is not recommended as a source of DNA for the detection of germline variants.
Collapse
|
11
|
Vijayarathna S, Oon CE, Al-Zahrani M, Abualreesh MH, Chen Y, Kanwar JR, Sahreen S, Ghazanfar S, Adnan M, Sasidharan S. Standardized Polyalthia longifolia leaf extract induces the apoptotic HeLa cells death via microRNA regulation: identification, validation, and therapeutic potential. Front Pharmacol 2023; 14:1198425. [PMID: 37693900 PMCID: PMC10483226 DOI: 10.3389/fphar.2023.1198425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Accepted: 08/02/2023] [Indexed: 09/12/2023] Open
Abstract
Polyalthia longifolia var. angustifolia Thw. (Annonaceae), is a famous traditional medicinal plant in Asia. Ample data specifies that the medicinal plant P. longifolia has anticancer activity; however, the detailed mechanisms of action still need to be well studied. Recent studies have revealed the cytotoxicity potential of P. longifolia leaf against HeLa cells. Therefore, the current study was conducted to examine the regulation of miRNAs in HeLa cancer cells treated with the standardized P. longifolia methanolic leaf extract (PLME). The regulation of miRNAs in HeLa cancer cells treated with the standardized PLME extract was studied through Illumina, Hi-Seq. 2000 platform of Next-Generation Sequencing (NGS) and various in silico bioinformatics tools. The PLME treatment regulated a subset of miRNAs in HeLa cells. Interestingly, the PLME treatment against HeLa cancer cells identified 10 upregulated and 43 downregulated (p < 0.05) miRNAs associated with apoptosis induction. Gene ontology (GO) term analysis indicated that PLME induces cell death in HeLa cells by inducing the pro-apoptotic genes. Moreover, the downregulated oncomiRs modulated by PLME treatment in HeLa cells were identified, targeting apoptosis-related genes through gene ontology and pathway analysis. The LC-ESI-MS/MS analysis identified the presence of Vidarabine and Anandamide compounds that were previously reported to exhibit anticancer activity. The findings of this study obviously linked the cell cytotoxicity effect of PLME treatment against the HeLa cells with regulating various miRNAs expression related to apoptosis induction in the HeLa cells. PLME treatment induced apoptotic HeLa cell death mechanism by regulating multiple miRNAs. The identified miRNAs regulated by PLME may provide further insight into the mechanisms that play a critical role in cervical cancer, as well as novel ideas regarding gene therapeutic strategies.
Collapse
Affiliation(s)
- Soundararajan Vijayarathna
- Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia
| | - Chern Ein Oon
- Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia
| | - Majid Al-Zahrani
- Biological Sciences Department, College of Science and Arts, King Abdulaziz University, Rabigh, Saudi Arabia
| | - Muyassar H. Abualreesh
- Department of Marine Biology, Faculty of Marine Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Yeng Chen
- Department of Oral and Craniofacial Sciences, Faculty of Dentistry, University of Malaya, Kuala Lumpur, Malaysia
| | - Jagat R. Kanwar
- Department of Biochemistry, All India Institute of Medical Sciences (AIIMS), Bilaspur, India
| | - Sumaira Sahreen
- Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia
| | - Shakira Ghazanfar
- National Institute of Genomics and Advanced Biotechnology (NIGAB), National Agriculture Research Centre (NARC), Islamabad, Pakistan
| | - Mohd Adnan
- Department of Biology, College of Science, University of Ha’il, Ha’il, Saudi Arabia
| | - Sreenivasan Sasidharan
- Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia
| |
Collapse
|
12
|
Moudgil A, Sobti RC, Kaur T. In-silico identification and comparison of transcription factor binding sites cluster in anterior-posterior patterning genes in Drosophila melanogaster and Tribolium castaneum. PLoS One 2023; 18:e0290035. [PMID: 37590227 PMCID: PMC10434971 DOI: 10.1371/journal.pone.0290035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Accepted: 07/26/2023] [Indexed: 08/19/2023] Open
Abstract
The cis-regulatory data that help in transcriptional regulation is arranged into modular pieces of a few hundred base pairs called CRMs (cis-regulatory modules) and numerous binding sites for multiple transcription factors are prominent characteristics of these cis-regulatory modules. The present study was designed to localize transcription factor binding site (TFBS) clusters on twelve Anterior-posterior (A-P) genes in Tribolium castaneum and compare them to their orthologous gene enhancers in Drosophila melanogaster. Out of the twelve A-P patterning genes, six were gap genes (Kruppel, Knirps, Tailless, Hunchback, Giant, and Caudal) and six were pair rule genes (Hairy, Runt, Even-skipped, Fushi-tarazu, Paired, and Odd-skipped). The genes along with 20 kb upstream and downstream regions were scanned for TFBS clusters using the Motif Cluster Alignment Search Tool (MCAST), a bioinformatics tool that looks for set of nucleotide sequences for statistically significant clusters of non-overlapping occurrence of a given set of motifs. The motifs used in the current study were Hunchback, Caudal, Giant, Kruppel, Knirps, and Even-skipped. The results of the MCAST analysis revealed the maximum number of TFBS for Hunchback, Knirps, Caudal, and Kruppel in both D. melanogaster and T. castaneum, while Bicoid TFBS clusters were found only in D. melanogaster. The size of all the predicted TFBS clusters was less than 1kb in both insect species. These sequences revealed more transversional sites (Tv) than transitional sites (Ti) and the average Ti/Tv ratio was 0.75.
Collapse
Affiliation(s)
- Anshika Moudgil
- Department of Zoology, DAV University, Jalandhar, Punjab, India
| | | | - Tejinder Kaur
- Department of Zoology, DAV University, Jalandhar, Punjab, India
| |
Collapse
|
13
|
Das S, Biswas NK, Basu A. Mapinsights: deep exploration of quality issues and error profiles in high-throughput sequence data. Nucleic Acids Res 2023; 51:e75. [PMID: 37378434 PMCID: PMC10415152 DOI: 10.1093/nar/gkad539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 05/16/2023] [Accepted: 06/27/2023] [Indexed: 06/29/2023] Open
Abstract
High-throughput sequencing (HTS) has revolutionized science by enabling super-fast detection of genomic variants at base-pair resolution. Consequently, it poses the challenging problem of identification of technical artifacts, i.e. hidden non-random error patterns. Understanding the properties of sequencing artifacts holds the key in separating true variants from false positives. Here, we develop Mapinsights, a toolkit that performs quality control (QC) analysis of sequence alignment files, capable of detecting outliers based on sequencing artifacts of HTS data at a deeper resolution compared with existing methods. Mapinsights performs a cluster analysis based on novel and existing QC features derived from the sequence alignment for outlier detection. We applied Mapinsights on community standard open-source datasets and identified various quality issues including technical errors related to sequencing cycles, sequencing chemistry, sequencing libraries and across various orthogonal sequencing platforms. Mapinsights also enables identification of anomalies related to sequencing depth. A logistic regression-based model built on the features of Mapinsights shows high accuracy in detecting 'low-confidence' variant sites. Quantitative estimates and probabilistic arguments provided by Mapinsights can be utilized in identifying errors, bias and outlier samples, and also aid in improving the authenticity of variant calls.
Collapse
Affiliation(s)
- Subrata Das
- National Institute of Biomedical Genomics, Kalyani, 741251, West Bengal, India
| | - Nidhan K Biswas
- National Institute of Biomedical Genomics, Kalyani, 741251, West Bengal, India
| | - Analabha Basu
- National Institute of Biomedical Genomics, Kalyani, 741251, West Bengal, India
| |
Collapse
|
14
|
Chun YW, Miyamoto M, Williams CH, Neitzel LR, Silver-Isenstadt M, Cadar AG, Fuller DT, Fong DC, Liu H, Lease R, Kim S, Katagiri M, Durbin MD, Wang KC, Feaster TK, Sheng CC, Neely MD, Sreenivasan U, Cortes-Gutierrez M, Finn AV, Schot R, Mancini GMS, Ament SA, Ess KC, Bowman AB, Han Z, Bichell DP, Su YR, Hong CC. Impaired Reorganization of Centrosome Structure Underlies Human Infantile Dilated Cardiomyopathy. Circulation 2023; 147:1291-1303. [PMID: 36970983 PMCID: PMC10133173 DOI: 10.1161/circulationaha.122.060985] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 02/22/2023] [Indexed: 03/29/2023]
Abstract
BACKGROUND During cardiomyocyte maturation, the centrosome, which functions as a microtubule organizing center in cardiomyocytes, undergoes dramatic structural reorganization where its components reorganize from being localized at the centriole to the nuclear envelope. This developmentally programmed process, referred to as centrosome reduction, has been previously associated with cell cycle exit. However, understanding of how this process influences cardiomyocyte cell biology, and whether its disruption results in human cardiac disease, remains unknown. We studied this phenomenon in an infant with a rare case of infantile dilated cardiomyopathy (iDCM) who presented with left ventricular ejection fraction of 18% and disrupted sarcomere and mitochondria structure. METHODS We performed an analysis beginning with an infant who presented with a rare case of iDCM. We derived induced pluripotent stem cells from the patient to model iDCM in vitro. We performed whole exome sequencing on the patient and his parents for causal gene analysis. CRISPR/Cas9-mediated gene knockout and correction in vitro were used to confirm whole exome sequencing results. Zebrafish and Drosophila models were used for in vivo validation of the causal gene. Matrigel mattress technology and single-cell RNA sequencing were used to characterize iDCM cardiomyocytes further. RESULTS Whole exome sequencing and CRISPR/Cas9 gene knockout/correction identified RTTN, the gene encoding the centrosomal protein RTTN (rotatin), as the causal gene underlying the patient's condition, representing the first time a centrosome defect has been implicated in a nonsyndromic dilated cardiomyopathy. Genetic knockdowns in zebrafish and Drosophila confirmed an evolutionarily conserved requirement of RTTN for cardiac structure and function. Single-cell RNA sequencing of iDCM cardiomyocytes showed impaired maturation of iDCM cardiomyocytes, which underlie the observed cardiomyocyte structural and functional deficits. We also observed persistent localization of the centrosome at the centriole, contrasting with expected programmed perinuclear reorganization, which led to subsequent global microtubule network defects. In addition, we identified a small molecule that restored centrosome reorganization and improved the structure and contractility of iDCM cardiomyocytes. CONCLUSIONS This study is the first to demonstrate a case of human disease caused by a defect in centrosome reduction. We also uncovered a novel role for RTTN in perinatal cardiac development and identified a potential therapeutic strategy for centrosome-related iDCM. Future study aimed at identifying variants in centrosome components may uncover additional contributors to human cardiac disease.
Collapse
Affiliation(s)
- Young Wook Chun
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Matthew Miyamoto
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Charles H. Williams
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Leif R. Neitzel
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Maya Silver-Isenstadt
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Adrian G. Cadar
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Daniela T. Fuller
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Daniel C. Fong
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Hanhan Liu
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Robert Lease
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Sungseek Kim
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Mikako Katagiri
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Matthew D. Durbin
- Division of Neonatology-Perinatology, Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN 26202
| | - Kuo-Chen Wang
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Tromondae K. Feaster
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Calvin C. Sheng
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - M. Diana Neely
- Department of Cell and Developmental Biology, Vanderbilt University School of Medicine, Nashville, TN 37201
| | - Urmila Sreenivasan
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Marcia Cortes-Gutierrez
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Aloke V. Finn
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - Rachel Schot
- Division of Neonatology-Perinatology, Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN 26202
| | - Grazia M. S. Mancini
- Department of Clinical Genetics, Erasmus University Medical Center (Erasmus MC), P.O. Box 2040, 3000 CA Rotterdam, The Netherlands
| | - Seth A. Ament
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Kevin C. Ess
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN37201
| | - Aaron B. Bowman
- School of Health Sciences, Purdue University, West Lafayette, IN 47906
| | - Zhe Han
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| | - David P. Bichell
- Department of Pediatric Cardiac Surgery, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Yan Ru Su
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN 37201
| | - Charles C. Hong
- Division of Cardiovascular Medicine, Department of Medicine, University of Maryland Medical Center, Baltimore, MD 21201
| |
Collapse
|
15
|
Filipović I. Genomic resources for population analyses of an invasive insect pest Oryctes rhinoceros. Sci Data 2023; 10:199. [PMID: 37041187 PMCID: PMC10090205 DOI: 10.1038/s41597-023-02109-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 03/27/2023] [Indexed: 04/13/2023] Open
Abstract
Over the last few years, various types of NGS data have been accumulating for the coconut rhinoceros beetle (CRB, Oryctes rhinoceros), reflecting the growing interest in curtailing this invasive pest of palm trees. Whilst reference-free analyses of RNA-seq and RAD-seq datasets have been done for different CRB collections, recent availability of the CRB's genome assembly provides an opportunity to collate diverse data and create a reference-based population dataset. Here, I release such a dataset containing 6,725,935 SNPs and genotypes called across 393 individual samples from 16 populations, using the previously published raw sequences generated in 9 different experiments (RAD-Seq, RNA-Seq, WGS). I also provide reference-based datasets for the CRB's mitochondrial variants and for variants of its viral biocontrol agent Oryctes rhinoceros nudivirus. SNP data provide high resolution for determining the geographic origin of invasive CRB. With these genomic resources, new data can be analysed without re-processing the published samples and then integrated to expand the reference datasets.
Collapse
Affiliation(s)
- Igor Filipović
- The University of Queensland, School of Biological Sciences, St. Lucia, Australia.
- QIMR Berghofer Medical Research Institute, Herston, Australia.
| |
Collapse
|
16
|
Zhai Y, Bardel C, Vallée M, Iwaz J, Roy P. Performance comparisons between clustering models for reconstructing NGS results from technical replicates. Front Genet 2023; 14:1148147. [PMID: 37007945 PMCID: PMC10060969 DOI: 10.3389/fgene.2023.1148147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 03/06/2023] [Indexed: 03/18/2023] Open
Abstract
To improve the performance of individual DNA sequencing results, researchers often use replicates from the same individual and various statistical clustering models to reconstruct a high-performance callset. Here, three technical replicates of genome NA12878 were considered and five model types were compared (consensus, latent class, Gaussian mixture, Kamila–adapted k-means, and random forest) regarding four performance indicators: sensitivity, precision, accuracy, and F1-score. In comparison with no use of a combination model, i) the consensus model improved precision by 0.1%; ii) the latent class model brought 1% precision improvement (97%–98%) without compromising sensitivity (= 98.9%); iii) the Gaussian mixture model and random forest provided callsets with higher precisions (both >99%) but lower sensitivities; iv) Kamila increased precision (>99%) and kept a high sensitivity (98.8%); it showed the best overall performance. According to precision and F1-score indicators, the compared non-supervised clustering models that combine multiple callsets are able to improve sequencing performance vs. previously used supervised models. Among the models compared, the Gaussian mixture model and Kamila offered non-negligible precision and F1-score improvements. These models may be thus recommended for callset reconstruction (from either biological or technical replicates) for diagnostic or precision medicine purposes.
Collapse
Affiliation(s)
- Yue Zhai
- Université Lyon 1, Lyon, France
- Université de Lyon, Lyon, France
- Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
- *Correspondence: Yue Zhai,
| | - Claire Bardel
- Université Lyon 1, Lyon, France
- Université de Lyon, Lyon, France
- Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
- Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France
- Service de Génétique, Hospices Civils de Lyon, Bron, France
| | - Maxime Vallée
- Cellule Bioinformatique de La Plateforme de Séquençage Haut Débit NGS-HCL, Hospices Civils de Lyon, Bron, France
| | - Jean Iwaz
- Université Lyon 1, Lyon, France
- Université de Lyon, Lyon, France
- Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
- Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France
| | - Pascal Roy
- Université Lyon 1, Lyon, France
- Université de Lyon, Lyon, France
- Laboratoire de Biométrie et Biologie Évolutive, Villeurbanne, France
- Service de Biostatistique-Bioinformatique, Hospices Civils de Lyon, Lyon, France
| |
Collapse
|
17
|
Asmare S, Alemayehu K, Mwacharo J, Haile A, Abegaz S, Ahbara A. Genetic diversity and within-breed variation in three indigenous Ethiopian sheep based on whole-genome analysis. Heliyon 2023; 9:e14863. [PMID: 37089312 PMCID: PMC10119558 DOI: 10.1016/j.heliyon.2023.e14863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 03/18/2023] [Accepted: 03/20/2023] [Indexed: 03/30/2023] Open
Abstract
The objective of this work was to study genetic diversity by comparing whole genome sequence data of Rutana, Gumuz and Washera sheep found in Amhara and Benishanguel gumuz regional states of Ethiopia. We employed variant calling format tools version 0.1.15 to calculate some genetic diversity indices such as observed heterozygosity, expected heterozygosity, inbreeding coefficient, and nucleotide diversity. The results revealed that, observed heterozygosity ranged from 0.33 in Gumuz to 0.34 in Rutana and Washera sheep. Expected heterozygosity ranged from 0.37 in Rutana to 0.38 in Gumuz and Washera sheep. Expected heterozygosity was found to be higher than observed heterozygosity. Higher inbreeding coefficient (0.12) was recorded for Gumuz sheep compared to 0.09 of Rutana and Washera sheep. Mean nucleotide diversity values were 0.0029, 0.0030 and 0.0028 for Gumuz, Rutana and Washera sheep, respectively. Higher values of nucleotide diversity were recorded. Population structure analysis using principal component analysis revealed no clear separation between Gumuz, Rutana and Washera sheep populations with possibility of gene flow attributed to geographical location proximity. The smaller population size, closed breeding system, genetic drift and uncontrolled (non-random) mating might lead to higher rate of inbreeding in Gumuz, Rutana and Washera sheep, requiring timely intervention. This intervention helps to prevent inbreeding depression and extinction of these valuable breeds of sheep, which helps in sustaining the livelihood of sheep keepers in lowlands and highlands. Nevertheless, the whole-genome analysis revealed high within-breed variation. Uncovered areas of studies like mapping quantitative trait loci, identifying genes underpinning productivity traits such as carcass quantity and meat quality could be carried out on diversified sheep resources identified by the current study. Identifying the genomic regions and biological pathways that contribute to explaining variability in these traits is of great importance for selection purposes. Designing conservation-based within-breed sheep selective breeding programs are recommended considering economically important traits into account.
Collapse
Affiliation(s)
- Sisay Asmare
- Debre Markos University, Burie Campus, P.O. Box 18, Ethiopia
- Bahir Dar University, College of Agriculture and Environmental Sciences, Department of Animal Production and Technology, Bahir Dar, Ethiopia
- Biotechnology Research Institute of Bahir Dar University, Ethiopia
- Corresponding author. Debre Markos University, Burie Campus, P.O. Box 18, Ethiopia.
| | - Kefyalew Alemayehu
- Bahir Dar University, College of Agriculture and Environmental Sciences, Department of Animal Production and Technology, Bahir Dar, Ethiopia
- Biotechnology Research Institute of Bahir Dar University, Ethiopia
| | - Joram Mwacharo
- Small Ruminant Genomics, International Centre for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia
| | - Aynalem Haile
- International Center for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia
| | - Solomon Abegaz
- Ethiopian Institute of Agricultural Research, Addis Ababa, Ethiopia
| | - Abulgasim Ahbara
- Animal and Veterinary Sciences, SRUC, The Roslin Institute Building, Midlothian, Edinburgh, UK
- Departments of Zoology, Faculty of Sciences, Misurata University, Misurata, Libya
| |
Collapse
|
18
|
Chu JT, Gu H, Sun W, Fan RL, Nicholls JM, Valkenburg SA, Poon LL. Heterosubtypic immune pressure accelerates emergence of influenza A virus escape phenotypes in mice. Virus Res 2023; 323:198991. [PMID: 36302472 PMCID: PMC10194115 DOI: 10.1016/j.virusres.2022.198991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 10/21/2022] [Accepted: 10/22/2022] [Indexed: 11/07/2022]
Abstract
Rapid antigenic evolution of the influenza A virus surface antigen hemagglutinin undermines protection conferred by seasonal vaccines. Protective correlates targeted by universal vaccines such as cytotoxic T cells or HA stem directed broadly neutralizing antibodies have been shown to select for immune escape mutants during infection. We developed an in vivo serial passage mouse model for viral adaptation and used next generation sequencing to evaluate full genome viral evolution in the context of broadly protective immunity. Heterosubtypic immune pressure increased the incidence of genome-wide single nucleotide variants, though mutations found in early adapted populations were predominantly stochastic in nature. Prolonged adaptation under heterosubtypic immune selection resulted in the manifestation of highly virulent phenotypes that ablated vaccine mediated protection from mortality. High frequency mutations unique to escape phenotypes were identified within the polymerase encoding segments. These findings suggest that a suboptimial usage of population-wide universal influenza vaccine may drive formation of escape variants attributed to polygenic changes.
Collapse
Affiliation(s)
- Julie Ts Chu
- Division of Public Health Laboratory Sciences, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Haogao Gu
- Division of Public Health Laboratory Sciences, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Wanying Sun
- Division of Public Health Laboratory Sciences, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Rebecca Ly Fan
- Division of Public Health Laboratory Sciences, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China
| | - John M Nicholls
- Department of Pathology, The University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Sophie A Valkenburg
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Australia
| | - Leo Lm Poon
- Division of Public Health Laboratory Sciences, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; HKU-Pasteur Research Pole, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China; Centre for Immunology & Infection, Hong Kong Science Park, Hong Kong Special Administrative Region, China.
| |
Collapse
|
19
|
Zhou Y, Lauschke VM. Challenges Related to the Use of Next-Generation Sequencing for the Optimization of Drug Therapy. Handb Exp Pharmacol 2023; 280:237-260. [PMID: 35792943 DOI: 10.1007/164_2022_596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Over the last decade, next-generation sequencing (NGS) methods have become increasingly used in various areas of human genomics. In routine clinical care, their use is already implemented in oncology to profile the mutational landscape of a tumor, as well as in rare disease diagnostics. However, its utilization in pharmacogenomics is largely lacking behind. Recent population-scale genome data has revealed that human pharmacogenes carry a plethora of rare genetic variations that are not interrogated by conventional array-based profiling methods and it is estimated that these variants could explain around 30% of the genetically encoded functional pharmacogenetic variability.To interpret the impact of such variants on drug response a multitude of computational tools have been developed, but, while there have been major advancements, it remains to be shown whether their accuracy is sufficient to improve personalized pharmacogenetic recommendations in robust trials. In addition, conventional short-read sequencing methods face difficulties in the interrogation of complex pharmacogenes and high NGS test costs require stringent evaluations of cost-effectiveness to decide about reimbursement by national healthcare programs. Here, we illustrate current challenges and discuss future directions toward the clinical implementation of NGS to inform genotype-guided decision-making.
Collapse
Affiliation(s)
- Yitian Zhou
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Volker M Lauschke
- Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden.
- Dr Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany.
- University of Tuebingen, Tuebingen, Germany.
| |
Collapse
|
20
|
Rohmah L, Darwati S, Ulupi N, Khaerunnisa I, Sumantri C. Polymorphism of prolactin (PRL) gene exon 5 and its association with egg production in IPB-D1 chickens. Arch Anim Breed 2022; 65:449-455. [PMID: 36643022 PMCID: PMC9832302 DOI: 10.5194/aab-65-449-2022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
The prolactin (PRL) gene regulates the egg production and incubation in laying chickens. Local chickens' reproductive systems will disrupt as a result of the incubation period activity, and they will lay fewer eggs. This study aimed to determine the prolactin gene polymorphism in IPB-D1 hens and its relationship to egg production. The polymorphism of the exon 5 prolactin gene was examined on 112 samples of the IPB-D1 chicken DNA collection from the Division of Animal Genetics and Breeding, Faculty of Animal Sciences, IPB University. By performing the phenol-chloroform method, the genomic DNA was obtained. A polymerase chain reaction (PCR) product with a size of 557 bp was produced as a result of the DNA amplification. Three single-nucleotide sequences were discovered. Three single-nucleotide polymorphisms (SNPs), g.7835A > G, g.7886A > T, and g.8052T > C, were found in exon 5 of the PRL gene. Each mutation was polymorphic and in Hardy-Weinberg equilibrium. The point mutation g.8052T > C significantly impacted the egg production of IPB-D1 chickens, according to the SNP association analysis on egg production, and may serve as a marker to enhance the selection for the features of egg production in IPB-D1 chickens.
Collapse
Affiliation(s)
- Lailatul Rohmah
- Department of Animal Production and Technology, Faculty of Animal
Sciences, IPB University, Bogor 16680, Indonesia
| | - Sri Darwati
- Department of Animal Production and Technology, Faculty of Animal
Sciences, IPB University, Bogor 16680, Indonesia
| | - Niken Ulupi
- Department of Animal Production and Technology, Faculty of Animal
Sciences, IPB University, Bogor 16680, Indonesia
| | - Isyana Khaerunnisa
- Research Center for Applied Zoology, National Research and Innovation Agency, Bogor 16911, Indonesia
| | - Cece Sumantri
- Department of Animal Production and Technology, Faculty of Animal
Sciences, IPB University, Bogor 16680, Indonesia
| |
Collapse
|
21
|
Turba R, Richmond JQ, Fitz-Gibbon S, Morselli M, Fisher RN, Swift CC, Ruiz-Campos G, Backlin AR, Dellith C, Jacobs DK. Genetic structure and historic demography of endangered unarmoured threespine stickleback at southern latitudes signals a potential new management approach. Mol Ecol 2022; 31:6515-6530. [PMID: 36205603 PMCID: PMC10092051 DOI: 10.1111/mec.16722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/05/2022] [Accepted: 09/29/2022] [Indexed: 01/13/2023]
Abstract
Habitat loss, flood control infrastructure, and drought have left most of southern California and northern Baja California's native freshwater fish near extinction, including the endangered unarmoured threespine stickleback (Gasterosteus aculeatus williamsoni). This subspecies, an unusual morph lacking the typical lateral bony plates of the G. aculeatus complex, occurs at arid southern latitudes in the eastern Pacific Ocean and survives in only three inland locations. Managers have lacked molecular data to answer basic questions about the ancestry and genetic distinctiveness of unarmoured populations. These data could be used to prioritize conservation efforts. We sampled G. aculeatus from 36 localities and used microsatellites and whole genome data to place unarmoured populations within the broader evolutionary context of G. aculeatus across southern California/northern Baja California. We identified three genetic groups with none consisting solely of unarmoured populations. Unlike G. aculeatus at northern latitudes, where Pleistocene glaciation has produced similar historical demographic profiles across populations, we found markedly different demographics depending on sampling location, with inland unarmoured populations showing steeper population declines and lower heterozygosity compared to low armoured populations in coastal lagoons. One exception involved the only high elevation population in the region, where the demography and alleles of unarmoured fish were similar to low armoured populations near the coast, exposing one of several cases of artificial translocation. Our results suggest that the current "management-by-phenotype" approach, based on lateral plates, is incidentally protecting the most imperilled populations; however, redirecting efforts toward evolutionary units, regardless of phenotype, may more effectively preserve adaptive potential.
Collapse
Affiliation(s)
- Rachel Turba
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, USA
| | | | - Sorel Fitz-Gibbon
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, USA
| | - Marco Morselli
- Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, California, USA
| | | | - Camm C Swift
- Emeritus, Section of Fishes, Natural History Museum of Los Angeles County, Los Angeles, California, USA
| | - Gorgonio Ruiz-Campos
- Ichthyological Collection, Facultad de Ciencias, Universidad Autónoma de Baja California, Ensenada, Baja California, Mexico
| | - Adam R Backlin
- U.S. Geological Survey, Western Ecological Research Center, San Diego Field Station-Santa Ana Office, Santa Ana, California, USA
| | - Chris Dellith
- U.S. Fish and Wildlife Service, Ventura, California, USA
| | - David K Jacobs
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, USA
| |
Collapse
|
22
|
Høy Hansen M, Steensboe Lang C, Abildgaard N, Nyvold CG. Comparative evaluation of the heterozygous variant standard deviation as a quality measure for next-generation sequencing. J Biomed Inform 2022; 135:104234. [DOI: 10.1016/j.jbi.2022.104234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 09/15/2022] [Accepted: 10/17/2022] [Indexed: 11/30/2022]
|
23
|
Rao W, Guo L, Ling Y, Dong L, Li W, Ying J, Li W. Developing an effective quality evaluation strategy of next-generation sequencing for accurate detecting non-small cell lung cancer samples with variable characteristics: a real-world clinical practice. J Cancer Res Clin Oncol 2022:10.1007/s00432-022-04388-1. [DOI: 10.1007/s00432-022-04388-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Accepted: 09/29/2022] [Indexed: 10/31/2022]
|
24
|
Formenti G, Rhie A, Walenz BP, Thibaud-Nissen F, Shafin K, Koren S, Myers EW, Jarvis ED, Phillippy AM. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 2022; 19:696-704. [PMID: 35361932 PMCID: PMC9745813 DOI: 10.1038/s41592-022-01445-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/07/2022] [Indexed: 12/15/2022]
Abstract
Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.
Collapse
Affiliation(s)
- Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA.
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
| | - Brian P Walenz
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Eugene W Myers
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
25
|
Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Korbel JO, Eichler EE, Zody MC, Dilthey AT, Marschall T. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat Genet 2022; 54:518-525. [PMID: 35410384 PMCID: PMC9005351 DOI: 10.1038/s41588-022-01043-w] [Citation(s) in RCA: 76] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 03/03/2022] [Indexed: 12/30/2022]
Abstract
Typical genotyping workflows map reads to a reference genome before identifying genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. Furthermore, short-read lengths limit the ability to characterize repetitive genomic regions, which are particularly challenging for fast k-mer-based genotypers. In the present study, we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation-a process we refer to as genome inference. Compared with mapping-based approaches, PanGenie is more than 4 times faster at 30-fold coverage and achieves better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (≥50 bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being faster compared with alignment-based workflows.
Collapse
Affiliation(s)
- Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | | | - Tobias Rausch
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
- European Molecular Biology Laboratory, GeneCore, Heidelberg, Germany
| | - Peter A Audano
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Yafei Mao
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | | | - Alexander T Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute of Medical Statistics and Computational Biology, University of Cologne, Cologne, Germany
- Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases, University of Cologne, Cologne, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
26
|
Corominas J, Smeekens SP, Nelen MR, Yntema HG, Kamsteeg EJ, Pfundt R, Gilissen C. Clinical exome sequencing - mistakes and caveats. Hum Mutat 2022; 43:1041-1055. [PMID: 35191116 PMCID: PMC9541396 DOI: 10.1002/humu.24360] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2021] [Revised: 01/11/2022] [Accepted: 02/18/2022] [Indexed: 11/30/2022]
Abstract
Massive parallel sequencing technology has become the predominant technique for genetic diagnostics and research. Many genetic laboratories have wrestled with the challenges of setting up genetic testing workflows based on a completely new technology. The learning curve we went through as a laboratory was accompanied by growing pains while we gained new knowledge and expertise. Here we discuss some important mistakes that have been made in our laboratory through 10 years of clinical exome sequencing but that have given us important new insights on how to adapt our working methods. We provide these examples and the lessons that we learned to help other laboratories avoid to make the same mistakes.
Collapse
Affiliation(s)
- Jordi Corominas
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Sanne P Smeekens
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Marcel R Nelen
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Helger G Yntema
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands.,Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Erik-Jan Kamsteeg
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands.,Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Rolph Pfundt
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands.,Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Christian Gilissen
- Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands.,Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| |
Collapse
|
27
|
Zhao S, Jiang L, Yu H, Guo Y. GTQC: Automated Genotyping Array Quality Control and Report. J Genomics 2022; 10:39-44. [PMID: 35300047 PMCID: PMC8922302 DOI: 10.7150/jgen.69860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 01/26/2022] [Indexed: 12/16/2022] Open
Abstract
Genotyping array is the most economical approach for conducting large-scale genome-wide genetic association studies. Thorough quality control is key to generating high integrity genotyping data and robust results. Quality control of genotyping array is generally a complicated process, as it requires intensive manual labor in implementing the established protocols and curating a comprehensive quality report. There is an urgent need to reduce manual intervention via an automated quality control process. Based on previously established protocols and strategies, we developed an R package GTQC (GenoTyping Quality Control) to automate a majority of the quality control steps for general array genotyping data. GTQC covers a comprehensive spectrum of genotype data quality metrics and produces a detailed HTML report comprising tables and figures. Here, we describe the concepts underpinning GTQC and demonstrate its effectiveness using a real genotyping dataset. R package GTQC streamlines a majority of the quality control steps and produces a detailed HTML report on a plethora of quality control metrics, thus enabling a swift and rigorous data quality inspection prior to downstream GWAS and related analyses. By significantly cutting down on the time on genotyping quality control procedures, GTQC ensures maximum utilization of available resources and minimizes waste and inefficient allocation of manual efforts. GTQC tool can be accessed at https://github.com/slzhao/GTQC.
Collapse
Affiliation(s)
- Shilin Zhao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN
| | - Limin Jiang
- Department Internal Medicine, University of New Mexico, Comprehensive Cancer Center, Albuquerque, NM
| | - Hui Yu
- Department Internal Medicine, University of New Mexico, Comprehensive Cancer Center, Albuquerque, NM
| | - Yan Guo
- Department Internal Medicine, University of New Mexico, Comprehensive Cancer Center, Albuquerque, NM
| |
Collapse
|
28
|
Avila E, Speransa PA, Lindholz CG, Kahmann A, Alho CS. Haplotype distribution in a forensic full mtDNA genome database of admixed Southern Brazilians and its association with self-declared ancestry and pigmentation traits. Forensic Sci Int Genet 2021; 57:102650. [PMID: 34972071 DOI: 10.1016/j.fsigen.2021.102650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 12/01/2021] [Indexed: 11/04/2022]
Abstract
BACKGROUND The advent of massively parallel sequencing (MPS) applications focused on the generation of forensic-quality full mitochondrial genome sequences led to a popularization of the technique on a global scale. However, the lack of forensic-graded population databases has refrained a wider adoption of full genome sequences as the industry standard, despite its better discrimination capacity of individual maternal lineages. PURPOSE This work describes a forensic-oriented full mtDNA genome database comprised of 480 samples from a Southern Brazilian population. METHODS A collection of mitochondrial sequences were obtained from low-pass, full genome DNA sequencing results. The complete sample set was evaluated regarding haplotype composition and distribution. Summary statistics and forensic parameters were calculated and are presented for the database, with detailed information concerning the impact of removing genetic information in the form of specific variants or increasingly larger genomic regions. Interpopulational analysis comparing haplotypical diversity in Brazilian and 26 worldwide populations was also performed. The association between mitochondrial genetic variability and phenotypic diversity was also evaluated in populations, with self-declared ancestry and three distinct phenotypic pigmentation traits (eyes, skin and hair colors) as parameters. RESULTS The presented database can be used to evaluate mitochondrial-related genetic evidence, providing LR values of up to 20,465 for unobserved haplotypes. Haplotype distribution in Southern Brazil seems to be different than the remaining of the country, with a larger contribution of maternal lines with European origin. Despite association can be found between lighter and darker phenotypes or self-declared ancestry and haplotype distribution, prediction models cannot be reliably proposed due to the admixed nature of the Brazilian population. CONCLUSIONS The proposed database provides a basis for statistical calculation and frequency estimation of full mitochondrial genomes, and can be part of an integrated, representative, national database comprising most of the genetic diversity of maternal lineages in the country.
Collapse
Affiliation(s)
- Eduardo Avila
- Forensic Genetics Laboratory, School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil; Technical Scientific Section, Federal Police Department in Rio Grande do Sul State, Porto Alegre, RS, Brazil; National Institute of Science and Technology - Forensic Science, Porto Alegre, RS, Brazil.
| | - Pietro Augusto Speransa
- Forensic Genetics Laboratory, School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
| | - Catieli Gobetti Lindholz
- Forensic Genetics Laboratory, School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil.
| | - Alessandro Kahmann
- National Institute of Science and Technology - Forensic Science, Porto Alegre, RS, Brazil; Institute of Mathematics, Statistics and Physics, Federal University of Rio Grande, Rio Grande, RS, Brazil.
| | - Clarice Sampaio Alho
- Forensic Genetics Laboratory, School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil; National Institute of Science and Technology - Forensic Science, Porto Alegre, RS, Brazil.
| |
Collapse
|
29
|
Christensen KA, Rondeau EB, Sakhrani D, Biagi CA, Johnson H, Joshi J, Flores AM, Leelakumari S, Moore R, Pandoh PK, Withler RE, Beacham TD, Leggatt RA, Tarpey CM, Seeb LW, Seeb JE, Jones SJM, Devlin RH, Koop BF. The pink salmon genome: Uncovering the genomic consequences of a two-year life cycle. PLoS One 2021; 16:e0255752. [PMID: 34919547 PMCID: PMC8682878 DOI: 10.1371/journal.pone.0255752] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 12/02/2021] [Indexed: 12/30/2022] Open
Abstract
Pink salmon (Oncorhynchus gorbuscha) adults are the smallest of the five Pacific salmon native to the western Pacific Ocean. Pink salmon are also the most abundant of these species and account for a large proportion of the commercial value of the salmon fishery worldwide. A two-year life history of pink salmon generates temporally isolated populations that spawn either in even-years or odd-years. To uncover the influence of this genetic isolation, reference genome assemblies were generated for each year-class and whole genome re-sequencing data was collected from salmon of both year-classes. The salmon were sampled from six Canadian rivers and one Japanese river. At multiple centromeres we identified peaks of Fst between year-classes that were millions of base-pairs long. The largest Fst peak was also associated with a million base-pair chromosomal polymorphism found in the odd-year genome near a centromere. These Fst peaks may be the result of a centromere drive or a combination of reduced recombination and genetic drift, and they could influence speciation. Other regions of the genome influenced by odd-year and even-year temporal isolation and tentatively under selection were mostly associated with genes related to immune function, organ development/maintenance, and behaviour.
Collapse
Affiliation(s)
- Kris A. Christensen
- West Vancouver, Fisheries and Oceans Canada, British Columbia, Canada
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
- * E-mail: (KAC); (BFK)
| | - Eric B. Rondeau
- West Vancouver, Fisheries and Oceans Canada, British Columbia, Canada
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
- Pacific Biological Station, Fisheries and Oceans Canada, Nanaimo, British Columbia, Canada
| | - Dionne Sakhrani
- West Vancouver, Fisheries and Oceans Canada, British Columbia, Canada
| | - Carlo A. Biagi
- West Vancouver, Fisheries and Oceans Canada, British Columbia, Canada
| | - Hollie Johnson
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
| | - Jay Joshi
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
| | - Anne-Marie Flores
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
| | - Sreeja Leelakumari
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Richard Moore
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Pawan K. Pandoh
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Ruth E. Withler
- Pacific Biological Station, Fisheries and Oceans Canada, Nanaimo, British Columbia, Canada
| | - Terry D. Beacham
- Pacific Biological Station, Fisheries and Oceans Canada, Nanaimo, British Columbia, Canada
| | | | - Carolyn M. Tarpey
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, Washington, United States of America
| | - Lisa W. Seeb
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, Washington, United States of America
| | - James E. Seeb
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, Washington, United States of America
| | - Steven J. M. Jones
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Robert H. Devlin
- West Vancouver, Fisheries and Oceans Canada, British Columbia, Canada
| | - Ben F. Koop
- Department of Biology, University of Victoria, Victoria, British Columbia, Canada
- * E-mail: (KAC); (BFK)
| |
Collapse
|
30
|
Durward-Akhurst SA, Schaefer RJ, Grantham B, Carey WK, Mickelson JR, McCue ME. Genetic Variation and the Distribution of Variant Types in the Horse. Front Genet 2021; 12:758366. [PMID: 34925451 PMCID: PMC8676274 DOI: 10.3389/fgene.2021.758366] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 11/10/2021] [Indexed: 11/13/2022] Open
Abstract
Genetic variation is a key contributor to health and disease. Understanding the link between an individual's genotype and the corresponding phenotype is a major goal of medical genetics. Whole genome sequencing (WGS) within and across populations enables highly efficient variant discovery and elucidation of the molecular nature of virtually all genetic variation. Here, we report the largest catalog of genetic variation for the horse, a species of importance as a model for human athletic and performance related traits, using WGS of 534 horses. We show the extent of agreement between two commonly used variant callers. In data from ten target breeds that represent major breed clusters in the domestic horse, we demonstrate the distribution of variants, their allele frequencies across breeds, and identify variants that are unique to a single breed. We investigate variants with no homozygotes that may be potential embryonic lethal variants, as well as variants present in all individuals that likely represent regions of the genome with errors, poor annotation or where the reference genome carries a variant. Finally, we show regions of the genome that have higher or lower levels of genetic variation compared to the genome average. This catalog can be used for variant prioritization for important equine diseases and traits, and to provide key information about regions of the genome where the assembly and/or annotation need to be improved.
Collapse
Affiliation(s)
- S. A. Durward-Akhurst
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| | - R. J. Schaefer
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| | - B. Grantham
- Interval Bio LLC, Mountain View, CA, United States
| | - W. K. Carey
- Interval Bio LLC, Mountain View, CA, United States
| | - J. R. Mickelson
- Department of Veterinary and Biomedical Sciences, University of Minnesota, Minneapolis, MN, United States
| | - M. E. McCue
- Department of Veterinary Population Medicine, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
31
|
Kovalchik KA, Ma Q, Wessling L, Saab F, Despault J, Kubiniok P, Hamelin DJ, Faridi P, Li C, Purcell AW, Jang A, Paramithiotis E, Tognetti M, Reiter L, Bruderer R, Lanoix J, Bonneil É, Courcelles M, Thibault P, Caron E, Sirois I. MhcVizPipe: A Quality Control Software for Rapid Assessment of Small- to Large-Scale Immunopeptidome Data Sets. Mol Cell Proteomics 2021; 21:100178. [PMID: 34798331 PMCID: PMC8717601 DOI: 10.1016/j.mcpro.2021.100178] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 10/28/2021] [Accepted: 11/01/2021] [Indexed: 12/12/2022] Open
Abstract
Mass spectrometry (MS)-based immunopeptidomics is maturing into an automatized, high-throughput technology, producing small- to large-scale datasets of clinically relevant MHC class I- and II-associated peptides. Consequently, the development of quality control (QC) and quality assurance (QA) systems capable of detecting sample and/or measurement issues is important for instrument operators and scientists in charge of downstream data interpretation. Here, we created MhcVizPipe (MVP), a semi-automated QC software tool that enables rapid and simultaneous assessment of multiple MHC class I and II immunopeptidomic datasets generated by MS, including datasets generated from large sample cohorts. In essence, MVP provides a rapid and consolidated view of sample quality, composition and MHC-specificity to greatly accelerate the 'pass-fail' QC decision-making process toward data interpretation. MVP parallelizes the use of well-established immunopeptidomic algorithms (NetMHCpan, NetMHCIIpan and GibbsCluster) and rapidly generates organized and easy-to-understand reports in HTML format. The reports are fully portable and can be viewed on any computer with a modern web browser. MVP is intuitive to use and will find utility in any specialized immunopeptidomic laboratory and proteomics core facility that provides immunopeptidomic services to the community.
Collapse
Affiliation(s)
| | - Qing Ma
- School of Electrical Engineering and Computer Science, Faculty of Engineering, University of Ottawa, ON K1N 6N5, Canada
| | - Laura Wessling
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Frederic Saab
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Jérôme Despault
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Peter Kubiniok
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - David J Hamelin
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Pouya Faridi
- Infection and Immunity Program and Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Chen Li
- Infection and Immunity Program and Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Anthony W Purcell
- Infection and Immunity Program and Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Anne Jang
- CellCarta, Montreal, QC H2X 3Y7, Canada
| | | | | | - Lukas Reiter
- Biognosys, Wagistrasse 21, 8952 Schlieren, Switzerland
| | | | - Joël Lanoix
- Institute of Research in Immunology and Cancer, Montreal, QC H3T 1J4, Canada
| | - Éric Bonneil
- Institute of Research in Immunology and Cancer, Montreal, QC H3T 1J4, Canada
| | - Mathieu Courcelles
- Institute of Research in Immunology and Cancer, Montreal, QC H3T 1J4, Canada
| | - Pierre Thibault
- Institute of Research in Immunology and Cancer, Montreal, QC H3T 1J4, Canada; Department of Chemistry, Université de Montréal, Montreal, QC H3T 1J4, Canada
| | - Etienne Caron
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada; Department of Pathology and Cellular Biology, Faculty of Medicine, Université de Montréal, QC H3T 1J4, Canada.
| | - Isabelle Sirois
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada.
| |
Collapse
|
32
|
Jiang L, Guo Y, Yu H, Hoff K, Ding X, Zhou W, Edwards J. Detecting SARS-CoV-2 and its variant strains with a full genome tiling array. Brief Bioinform 2021; 22:bbab213. [PMID: 34097003 PMCID: PMC8344516 DOI: 10.1093/bib/bbab213] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Revised: 05/04/2021] [Accepted: 05/15/2021] [Indexed: 11/13/2022] Open
Abstract
Coronavirus disease 2019 pandemic is the most damaging pandemic in recent human history. Rapid detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and variant strains is paramount for recovery from this pandemic. Conventional SARS-CoV-2 tests interrogate only limited regions of the whole SARS-CoV-2 genome, which are subjected to low specificity and miss the opportunity of detecting variant strains. In this work, we developed the first SARS-CoV-2 tiling array that captures the entire SARS-CoV-2 genome at single nucleotide resolution and offers the opportunity to detect point mutations. A thorough bioinformatics protocol of two base calling methods has been developed to accompany this array. To demonstrate the effectiveness of the tiling array, we genotyped all genomic positions of eight SARS-CoV-2 samples. Using high-throughput sequencing as the benchmark, we show that the tiling array had a genome-wide accuracy of at least 99.5%. From the tiling array analysis results, we identified the D614G mutation in the spike protein in four of the eight samples, suggesting the widespread distribution of this variant at the early stage of the outbreak in the United States. Two additional nonsynonymous mutations were identified in one sample in the nucleocapsid protein (P13L and S197L), which may complicate future vaccine development. With around $5 per array, supreme accuracy, and an ultrafast bioinformatics protocol, the SARS-CoV-2 tiling array makes an invaluable toolkit for combating current and future pandemics. Our SARS-CoV-2 tilting array is currently utilized by Molecular Vision, a CLIA-certified lab for SARS-CoV-2 diagnosis.
Collapse
Affiliation(s)
- Limin Jiang
- University of New Mexico, Albuquerque, NM 87131, USA
| | - Yan Guo
- Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87131, USA
| | - Hui Yu
- Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87131, USA
| | - Kendal Hoff
- Centrillion Biosciences, Albuquerque, NM 87131, USA
| | - Xun Ding
- Centrillion Biosciences, Albuquerque, NM 87131, USA
| | - Wei Zhou
- Centrillion Biosciences, Albuquerque, NM 87131, USA
| | - Jeremy Edwards
- Department of Chemistry, University of New Mexico, Albuquerque, NM 87131, USA
| |
Collapse
|
33
|
Sprang M, Krüger M, Andrade-Navarro MA, Fontaine JF. Statistical guidelines for quality control of next-generation sequencing techniques. Life Sci Alliance 2021; 4:4/11/e202101113. [PMID: 34462322 PMCID: PMC8408346 DOI: 10.26508/lsa.202101113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/17/2021] [Accepted: 08/10/2021] [Indexed: 12/24/2022] Open
Abstract
More and more next-generation sequencing (NGS) data are made available every day. However, the quality of this data is not always guaranteed. Available quality control tools require profound knowledge to correctly interpret the multiplicity of quality features. Moreover, it is usually difficult to know if quality features are relevant in all experimental conditions. Therefore, the NGS community would highly benefit from condition-specific data-driven guidelines derived from many publicly available experiments, which reflect routinely generated NGS data. In this work, we have characterized well-known quality guidelines and related features in big datasets and concluded that they are too limited for assessing the quality of a given NGS file accurately. Therefore, we present new data-driven guidelines derived from the statistical analysis of many public datasets using quality features calculated by common bioinformatics tools. Thanks to this approach, we confirm the high relevance of genome mapping statistics to assess the quality of the data, and we demonstrate the limited scope of some quality features that are not relevant in all conditions. Our guidelines are available at https://cbdm.uni-mainz.de/ngs-guidelines.
Collapse
Affiliation(s)
- Maximilian Sprang
- Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
| | - Matteo Krüger
- Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
| | | | - Jean-Fred Fontaine
- Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Mainz, Germany
| |
Collapse
|
34
|
Possible Protective Effect of LOXL1 Variant in the Cohort of Chernobyl Catastrophe Clean-Up Workers. Genes (Basel) 2021; 12:genes12081231. [PMID: 34440405 PMCID: PMC8392314 DOI: 10.3390/genes12081231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 08/06/2021] [Accepted: 08/08/2021] [Indexed: 11/16/2022] Open
Abstract
Ionising radiation (IR) is an environmental factor known to alter genomes and therefore challenge organisms to adapt. Lithuanian clean-up workers of the Chernobyl nuclear disaster (LCWC) experienced high doses of IR, leading to different consequences. This study aims to characterise a unique protective genomic variation in a relatively healthy LCWC group. This variation influenced their individual reaction to IR and potentially protects against certain diseases such as exfoliation syndrome and glaucoma. Clinical and IR dosage data were collected using a questionnaire to characterise the cohort of 93 LCWC. Genome-wide genotyping using Illumina beadchip technology was performed. The control group included 466 unrelated, self-reported healthy individuals of Lithuanian descent. Genotypes were filtered out from the microarray dataset using a catalogue of SNPs. The data were used to perform association, linkage disequilibrium, and epistasis analysis. Phenotype data analysis showed the distribution of the most common disease groups among the LCWC. A genomic variant of statistical significance (Fishers' exact test, p = 0.019), rs3825942, was identified in LOXL1 (NM_005576.4:c.458G>A). Linkage disequilibrium and epistasis analysis for this variant identified the genes LHFPL3, GALNT6, PIH1D1, ANKS1B, and METRNL as potentially involved in the etiopathogenesis of exfoliation syndrome and glaucoma, which were not previously associated with the disease. The LOXL1 variant is mostly considered a risk factor in the development of exfoliation syndrome and glaucoma. The influence of recent positive selection, the phenomenon of allele-flipping, and the fact that only individuals with the homozygous reference allele have glaucoma in the cohort of the LCWC suggest otherwise. The identification of rs3825942 and other potentially protective genomic variants may be useful for further analysis of the genetic architecture and etiopathogenetic mechanisms of other multifactorial diseases.
Collapse
|
35
|
Liu J, Deng Y, Yu B, Mo B, Luo L, Yang J, Zhang X, Wang Z, Wang Y, Zhu J, Yang H, Fang S, Cheng Z, Li J, Shu Y, Luo G, Xiong W, Wei J, Li Z. Targeted resequencing showing novel common and rare genetic variants increases the risk of asthma in the Chinese Han population. J Clin Lab Anal 2021; 35:e23813. [PMID: 33969541 DOI: 10.1002/jcla.23813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Revised: 04/16/2021] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Although studies have identified hundreds of genetic variants associated with asthma risk, a large fraction of heritability remains unexplained, especially in Chinese individuals. METHODS To identify genetic risk factors for asthma in a Han Chinese population, 211 asthma-related genes were first selected based on database searches. The genes were then sequenced for subjects in a Discovery Cohort (284 asthma patients and 205 older healthy controls) using targeted next-generation sequencing. Bioinformatics analysis and statistical association analyses were performed to reveal the associations between rare/common variants and asthma, respectively. The identified common risk variants underwent a validation analysis using a Replication Cohort (664 patients and 650 controls). RESULTS First, we identified 18 potentially functional rare loss-of-function (LOF) variants in 21/284 (7.4%) of the asthma cases. Second, using burden tests, we found that the asthma group had nominally significant (p < 0.05) burdens of rare nonsynonymous variants in 10 genes. Third, 23 common single-nucleotide polymorphisms were associated with the risk of asthma, 7/23 (30.4%) and 9/23 (39.1%) of which were modestly significant (p < 9.1 × 10-4 ) in the Replication Cohort and Combined Cohort, respectively. According to our cumulative risk model involving the modestly associated alleles, middle- and high-risk subjects had a 2.0-fold (95% CI: 1.621-2.423, p = 2.624 × 10-11 ) and 6.0-fold (95% CI: 3.623-10.156, p = 7.086 × 10-12 ) increased risk of asthma, respectively, compared with low-risk subjects. CONCLUSION This study revealed novel rare and common genetic risk factors for asthma, and provided a cumulative risk model for asthma risk prediction and stratification in Han Chinese individuals.
Collapse
Affiliation(s)
- Juan Liu
- Department of Respiratory and Critical Care Medicine, Key Laboratory of Pulmonary Diseases of Health Ministry, Key Cite of National Clinical Research Center for Respiratory Disease, Wuhan Clinical Medical Research Center for Chronic Airway Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Sciences and Technology, Wuhan, China
| | - Yanhan Deng
- Department of Respiratory and Critical Care Medicine, Key Laboratory of Pulmonary Diseases of Health Ministry, Key Cite of National Clinical Research Center for Respiratory Disease, Wuhan Clinical Medical Research Center for Chronic Airway Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Sciences and Technology, Wuhan, China
| | - Bo Yu
- Division of Cardiology, Departments of Internal Medicine and Genetic Diagnosis Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Biwen Mo
- Department of Respiratory Medicine, Affiliated Hospital of Guilin Medical University, Guilin, China
| | - Liman Luo
- Department of Pediatrics, The 306 Hospital of People's Liberation Army, Beijing, China
| | - Jingping Yang
- Department of Respiratory and Critical Care Medicine, The Third Affiliated Hospital of Inner Mongolia Medical University, Baotou, China
| | - Xiaoju Zhang
- Department of Respiratory Medicine, Henan Provincial People's Hospital & the People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Zheng Wang
- Department of Respiratory Medicine, Henan Provincial People's Hospital & the People's Hospital of Zhengzhou University, Zhengzhou, China
| | - Yingnan Wang
- Department of Respiratory and Critical Care Medicine, Renmin Hospital of Three Gorges University, Yichang, China
| | - Jing Zhu
- Department of Respiratory and Critical Care Medicine, Renmin Hospital of Three Gorges University, Yichang, China
| | - Hua Yang
- Department of Respiratory Medicine, University Hospital of Hubei University for Nationalities, Enshi, China
| | - Shirong Fang
- Department of Respiratory Medicine, University Hospital of Hubei University for Nationalities, Enshi, China
| | - Zhenshun Cheng
- Department of Respiratory Medicine, Zhongnan Hospital of Wuhan University, Wuhan University, Wuhan, China
| | - Jingping Li
- Department of Respiratory Medicine, Qianjiang Central Hospital, Qianjiang, China
| | - Ying Shu
- Department of Respiratory Medicine, Qianjiang Central Hospital, Qianjiang, China
| | - Guangwei Luo
- Department of Respiratory Medicine, Wuhan No. 1 Hospital, Wuhan, China
| | - Weining Xiong
- Department of Respiratory and Critical Care Medicine, Key Laboratory of Pulmonary Diseases of Health Ministry, Key Cite of National Clinical Research Center for Respiratory Disease, Wuhan Clinical Medical Research Center for Chronic Airway Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Sciences and Technology, Wuhan, China.,Department of Respiratory Medicine, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jianghong Wei
- Department of Respiratory Medicine, Affiliated Hospital of Guilin Medical University, Guilin, China
| | - Zongzhe Li
- Division of Cardiology, Departments of Internal Medicine and Genetic Diagnosis Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
36
|
Soleimani-Delfan A, Bouzari M, Wang R. A rapid competitive method for bacteriophage genomic DNA extraction. J Virol Methods 2021; 293:114148. [PMID: 33831496 DOI: 10.1016/j.jviromet.2021.114148] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Revised: 03/30/2021] [Accepted: 03/30/2021] [Indexed: 10/21/2022]
Abstract
The bacteriophage (phage) DNA extraction methods for genomics analysis is a critical and time-consuming process. Hence, a rapid and cost-effective method for DNA extraction of phages is favorable for phage biologists. In the present study, a cost-effective, simple and rapid procedure for phage genome extraction in less than 10 min is introduced. Highly concentrated phage lysates were prepared using acetone precipitation followed by extraction using various methods such as commercial kits, TES lysis buffer, potassium iodide, and sodium iodide. The quality of the extracted DNA was analyzed by agarose gel electrophoresis and UV absorbance of DNA at 260 and 280 nm. Finally, the extracted DNA was subjected to restriction digestion and next-generation sequencing to approve the efficiency of the method. Based on the time, cost, and quality of obtained DNA, the acetone precipitation of phages and extraction by potassium iodide or sodium iodide method was determined to be the best method for phage DNA extraction tested in this study. Moreover, the extracted genomic DNA using this method is suitable for phage genomic analysis such as restriction enzyme studies, preparation of DNA library, and also next-generation sequencing.
Collapse
Affiliation(s)
- Abbas Soleimani-Delfan
- Department of Cell and Molecular Biology & Microbiology, Faculty of Biological Science and Technology, University of Isfahan, HezarJereeb Street, 81746-73441, Isfahan, Iran
| | - Majid Bouzari
- Department of Cell and Molecular Biology & Microbiology, Faculty of Biological Science and Technology, University of Isfahan, HezarJereeb Street, 81746-73441, Isfahan, Iran.
| | - Ran Wang
- Institute of Food Safety and Nutrition, Jiangsu Academy of Agricultural Sciences, Nanjing, Jiangsu, China.
| |
Collapse
|
37
|
Garcia-Garcia S, Cortese MF, Rodríguez-Algarra F, Tabernero D, Rando-Segura A, Quer J, Buti M, Rodríguez-Frías F. Next-generation sequencing for the diagnosis of hepatitis B: current status and future prospects. Expert Rev Mol Diagn 2021; 21:381-396. [PMID: 33880971 DOI: 10.1080/14737159.2021.1913055] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
INTRODUCTION Hepatitis B virus (HBV) causes a complex and persistent infection with a major impact on patients health. Viral-genome sequencing can provide valuable information for characterizing virus genotype, infection dynamics and drug and vaccine resistance. AREAS COVERED This article reviews the current literature to describe the next-generation sequencing progress that facilitated a more comprehensive study of HBV quasispecies in diagnosis and clinical monitoring. EXPERT OPINION HBV variability plays a key role in liver disease progression and treatment efficacy. Second-generation sequencing improved the sensitivity for detecting and quantifying mutations, mixed genotypes and viral recombination. Third-generation sequencing enables the analysis of the entire HBV genome, although the high error rate limits its use in clinical practice.
Collapse
Affiliation(s)
- Selene Garcia-Garcia
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Maria Francesca Cortese
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Francisco Rodríguez-Algarra
- Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - David Tabernero
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| | - Ariadna Rando-Segura
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Josep Quer
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Liver Disease Laboratory-Viral Hepatitis, Vall d'Hebron Institut Recerca-Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Maria Buti
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
- Liver Unit, Department of Internal Medicine, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
| | - Francisco Rodríguez-Frías
- Liver Pathology Unit, Departments of Biochemistry and Microbiology, Hospital Universitari Vall d'Hebron, Universitat Autònoma De Barcelona, Barcelona Spain
- Clinical Biochemistry Research Group, Vall d'Hebron Institut Recerca (VHIR), Hospital Universitari Vall d'Hebron, Universitat Autònoma de Barcelona, Barcelona, Spain
- Centro De Investigación Biomédica En Red De Enfermedades Hepáticas Y Digestivas, Instituto De Salud Carlos III, Madrid Spain
| |
Collapse
|
38
|
A streamlined solution for processing, elucidating and quality control of cyclobutane pyrimidine dimer sequencing data. Nat Protoc 2021; 16:2190-2212. [PMID: 33731963 DOI: 10.1038/s41596-021-00496-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 01/06/2021] [Indexed: 01/13/2023]
Abstract
UV radiation may lead to melanoma and nonmelanoma skin cancers by causing helix-distorting DNA damage such as cyclobutane pyrimidine dimers (CPDs). These DNA lesions, if located in important genes and not repaired promptly, are mutagenic and may eventually result in carcinogenesis. Examining CPD formation and repair processes across the genome can shed light on the mutagenesis mechanisms associated with UV damage in relevant cancers. We recently developed CPD-Seq, a high-throughput and single-nucleotide resolution sequencing technique that can specifically capture UV-induced CPD lesions across the genome. This novel technique has been increasingly used in studies of UV damage and can be adapted to sequence other clinically relevant DNA lesions. Although the library preparation protocol has been established, a systematic protocol to analyze CPD-Seq data has not been described yet. To streamline the various general or specific analysis steps, we developed a protocol named CPDSeqer to assist researchers with CPD-Seq data processing. CPDSeqer can accommodate both a single- and multiple-sample experimental design, and it allows both genome-wide analyses and regional scrutiny (such as of suspected UV damage hotspots). The runtime of CPDSeqer scales with raw data size and takes roughly 4 h per sample with the possibility of acceleration by parallel computing. Various guiding graphics are generated to help diagnose the performance of the experiment and inform regional enrichment of CPD formation. UV damage comparison analyses are set forth in three analysis scenarios, and the resulting HTML pages report damage directional trends and statistical significance. CPDSeqer can be accessed at https://github.com/shengqh/cpdseqer .
Collapse
|
39
|
Blackman A, Morrison B, Maruri F, van der Heijden Y, Nochowicz CH, Guo Y, Scholz M, Rustad T, Sherman DR, Sterling TR. Re-evaluation of a novel resistance mutation in eccC5 of the ESX-5 secretion system in ofloxacin-resistant Mycobacterium tuberculosis. J Antimicrob Chemother 2021; 76:820-822. [PMID: 33367727 DOI: 10.1093/jac/dkaa507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Amondrea Blackman
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Vanderbilt Tuberculosis Center, Vanderbilt University School of Medicine, Nashville, TN, USA
| | | | - Fernanda Maruri
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Vanderbilt Tuberculosis Center, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Yuri van der Heijden
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Vanderbilt Tuberculosis Center, Vanderbilt University School of Medicine, Nashville, TN, USA.,The Aurum Institute, Johannesburg, South Africa
| | - Cindy Hager Nochowicz
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Vanderbilt Tuberculosis Center, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Yan Guo
- University of New Mexico, Albuquerque, NM, USA
| | - Matthew Scholz
- Vanderbilt Technologies for Advanced Genomics (VANTAGE) Core, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Tige Rustad
- Seattle Children's Hospital, Seattle, WA, USA
| | - David R Sherman
- Seattle Children's Hospital, Seattle, WA, USA.,Department of Microbiology, University of Washington, Seattle, WA, USA
| | - Timothy R Sterling
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Vanderbilt Tuberculosis Center, Vanderbilt University School of Medicine, Nashville, TN, USA
| |
Collapse
|
40
|
Kairov U, Molkenov A, Rakhimova S, Kozhamkulov U, Sharip A, Karabayev D, Daniyarov A, H Lee J, D Terwilliger J, Akilzhanova A, Zhumadilov Z. Whole-genome sequencing data of Kazakh individuals. BMC Res Notes 2021; 14:45. [PMID: 33541395 PMCID: PMC7863413 DOI: 10.1186/s13104-021-05464-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 01/28/2021] [Indexed: 11/22/2022] Open
Abstract
Objectives Kazakhstan is a Central Asian crossroad of European and Asian populations situated along the way of the Great Silk Way. The territory of Kazakhstan has historically been inhabited by nomadic tribes and today is the multi-ethnic country with the dominant Kazakh ethnic group. We sequenced and analyzed the whole-genomes of five ethnic healthy Kazakh individuals with high coverage using next-generation sequencing platform. This whole-genome sequence data of healthy Kazakh individuals can be a valuable reference for biomedical studies investigating disease associations and population-wide genomic studies of ethnically diverse Central Asian region. Data description Blood samples have been collected from five ethnic healthy Kazakh individuals living in Kazakhstan. The genomic DNA was extracted from blood and sequenced. Sequencing was performed on Illumina HiSeq2000 next-generation sequencing platform. We sequenced and analyzed the whole-genomes of ethnic Kazakh individuals with the coverage ranging from 26 to 32X. Ranging from 98.85 to 99.58% base pairs were totally mapped and aligned on the human reference genome GRCh37 hg19. Het/Hom and Ts/Tv ratios for each whole genome ranged from 1.35 to 1.49 and from 2.07 to 2.08, respectively. Sequencing data are available in the National Center for Biotechnology Information SRA database under the accession number PRJNA374772.
Collapse
Affiliation(s)
- Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan.
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Saule Rakhimova
- Laboratory of Genomic and Personalized Medicine, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Ulan Kozhamkulov
- Laboratory of Genomic and Personalized Medicine, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Aigul Sharip
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Daniyar Karabayev
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Asset Daniyarov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | | | | | - Ainur Akilzhanova
- Laboratory of Genomic and Personalized Medicine, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, Nur-Sultan, Kazakhstan
| | | |
Collapse
|
41
|
Global Autozygosity Is Associated with Cancer Risk, Mutational Signature and Prognosis. Cancers (Basel) 2020; 12:cancers12123646. [PMID: 33291726 PMCID: PMC7761949 DOI: 10.3390/cancers12123646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/25/2020] [Accepted: 12/01/2020] [Indexed: 11/16/2022] Open
Abstract
Simple Summary Global autozygosity in the form of runs of homozygosity is associated with various diseases. Heterozygosity ratio, an alternative measure of global autozygosity, is used to assess cancer risk in this study. Our analysis shows strong and consistent associations between heterozygosity ratios and various cancer types. Further analysis reveals the heterozygosity ratio’s potential connections to mutational signatures and cancer prognosis. Abstract Global autozygosity quantifies the genome-wide levels of homozygous and heterozygous variants. It is the signature of non-random reproduction, though it can also be driven by other factors, and has been used to assess risk in various diseases. However, the association between global autozygosity and cancer risk has not been studied. From 4057 cancer subjects and 1668 healthy controls, we found strong associations between global autozygosity and risk in ten different cancer types. For example, the heterozygosity ratio was found to be significantly associated with breast invasive carcinoma in Blacks and with male skin cutaneous melanoma in Caucasians. We also discovered eleven associations between global autozygosity and mutational signatures which can explain a portion of the etiology. Furthermore, four significant associations for heterozygosity ratio were revealed in disease-specific survival analyses. This study demonstrates that global autozygosity is effective for cancer risk assessment.
Collapse
|
42
|
Dhorne-Pollet S, Barrey E, Pollet N. A new method for long-read sequencing of animal mitochondrial genomes: application to the identification of equine mitochondrial DNA variants. BMC Genomics 2020; 21:785. [PMID: 33176683 PMCID: PMC7661214 DOI: 10.1186/s12864-020-07183-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 10/26/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mitochondrial DNA is remarkably polymorphic. This is why animal geneticists survey mitochondrial genomes variations for fundamental and applied purposes. We present here an approach to sequence whole mitochondrial genomes using nanopore long-read sequencing. Our method relies on the selective elimination of nuclear DNA using an exonuclease treatment and on the amplification of circular mitochondrial DNA using a multiple displacement amplification step. RESULTS We optimized each preparative step to obtain a 100 million-fold enrichment of horse mitochondrial DNA relative to nuclear DNA. We sequenced these amplified mitochondrial DNA using nanopore sequencing technology and obtained mitochondrial DNA reads that represented up to half of the sequencing output. The sequence reads were 2.3 kb of mean length and provided an even coverage of the mitochondrial genome. Long-reads spanning half or more of the whole mtDNA provided a coverage that varied between 118X and 488X. We evaluated SNPs identified using these long-reads by Sanger sequencing as ground truth and found a precision of 100.0%; a recall of 93.1% and a F1-score of 0.964 using the Twilight horse mtDNA reference. The choice of the mtDNA reference impacted variant calling efficiency with F1-scores varying between 0.947 and 0.964. CONCLUSIONS Our method to amplify mtDNA and to sequence it using the nanopore technology is usable for mitochondrial DNA variant analysis. With minor modifications, this approach could easily be applied to other large circular DNA molecules.
Collapse
Affiliation(s)
- Sophie Dhorne-Pollet
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France
| | - Eric Barrey
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France
| | - Nicolas Pollet
- Université Paris-Saclay, CNRS, IRD, UMR Évolution, Génomes, Comportement et Écologie, 91198, Gif-sur-Yvette, France.
| |
Collapse
|
43
|
Samuels DC, Below JE, Ness S, Yu H, Leng S, Guo Y. Alternative Applications of Genotyping Array Data Using Multivariant Methods. Trends Genet 2020; 36:857-867. [PMID: 32773169 PMCID: PMC7572808 DOI: 10.1016/j.tig.2020.07.006] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 07/08/2020] [Accepted: 07/09/2020] [Indexed: 10/23/2022]
Abstract
One of the forerunners that pioneered the revolution of high-throughput genomic technologies is the genotyping microarray technology, which can genotype millions of single-nucleotide variants simultaneously. Owing to apparent benefits, such as high speed, low cost, and high throughput, the genotyping array has gained lasting applications in genome-wide association studies (GWAS) and thus accumulated an enormous amount of data. Empowered by continuous manufactural upgrades and analytical innovation, unconventional applications of genotyping array data have emerged to address more diverse genetic problems, holding promise of boosting genetic research into human diseases through the re-mining of the rich accumulated data. Here, we review several unconventional genotyping array analysis techniques that have been built on the idea of large-scale multivariant analysis and provide empirical application examples. These unconventional outcomes of genotyping arrays include polygenic score, runs of homozygosity (ROH)/heterozygosity ratio, distant pedigree computation, and mitochondrial DNA (mtDNA) copy number inference.
Collapse
Affiliation(s)
- David C Samuels
- Department of Molecular Physiology and Biophysics, Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37232, USA
| | - Jennifer E Below
- Devision of Genetic Medicine, Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Scott Ness
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Hui Yu
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Shuguang Leng
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA
| | - Yan Guo
- Department of Internal Medicine, Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM 87109, USA.
| |
Collapse
|
44
|
The sockeye salmon genome, transcriptome, and analyses identifying population defining regions of the genome. PLoS One 2020; 15:e0240935. [PMID: 33119641 PMCID: PMC7595290 DOI: 10.1371/journal.pone.0240935] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 10/06/2020] [Indexed: 12/12/2022] Open
Abstract
Sockeye salmon (Oncorhynchus nerka) is a commercially and culturally important species to the people that live along the northern Pacific Ocean coast. There are two main sockeye salmon ecotypes—the ocean-going (anadromous) ecotype and the fresh-water ecotype known as kokanee. The goal of this study was to better understand the population structure of sockeye salmon and identify possible genomic differences among populations and between the two ecotypes. In pursuit of this goal, we generated the first reference sockeye salmon genome assembly and an RNA-seq transcriptome data set to better annotate features of the assembly. Resequenced whole-genomes of 140 sockeye salmon and kokanee were analyzed to understand population structure and identify genomic differences between ecotypes. Three distinct geographic and genetic groups were identified from analyses of the resequencing data. Nucleotide variants in an immunoglobulin heavy chain variable gene cluster on chromosome 26 were found to differentiate the northwestern group from the southern and upper Columbia River groups. Several candidate genes were found to be associated with the kokanee ecotype. Many of these genes were related to ammonia tolerance or vision. Finally, the sex chromosomes of this species were better characterized, and an alternative sex-determination mechanism was identified in a subset of upper Columbia River kokanee.
Collapse
|
45
|
Naranpanawa DNU, Chandrasekara CHWMRB, Bandaranayake PCG, Bandaranayake AU. Raw transcriptomics data to gene specific SSRs: a validated free bioinformatics workflow for biologists. Sci Rep 2020; 10:18236. [PMID: 33106560 PMCID: PMC7588437 DOI: 10.1038/s41598-020-75270-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2019] [Accepted: 09/21/2020] [Indexed: 02/07/2023] Open
Abstract
Recent advances in next-generation sequencing technologies have paved the path for a considerable amount of sequencing data at a relatively low cost. This has revolutionized the genomics and transcriptomics studies. However, different challenges are now created in handling such data with available bioinformatics platforms both in assembly and downstream analysis performed in order to infer correct biological meaning. Though there are a handful of commercial software and tools for some of the procedures, cost of such tools has made them prohibitive for most research laboratories. While individual open-source or free software tools are available for most of the bioinformatics applications, those components usually operate standalone and are not combined for a user-friendly workflow. Therefore, beginners in bioinformatics might find analysis procedures starting from raw sequence data too complicated and time-consuming with the associated learning-curve. Here, we outline a procedure for de novo transcriptome assembly and Simple Sequence Repeats (SSR) primer design solely based on tools that are available online for free use. For validation of the developed workflow, we used Illumina HiSeq reads of different tissue samples of Santalum album (sandalwood), generated from a previous transcriptomics project. A portion of the designed primers were tested in the lab with relevant samples and all of them successfully amplified the targeted regions. The presented bioinformatics workflow can accurately assemble quality transcriptomes and develop gene specific SSRs. Beginner biologists and researchers in bioinformatics can easily utilize this workflow for research purposes.
Collapse
Affiliation(s)
- D N U Naranpanawa
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
- Postgraduate Institute of Science, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - C H W M R B Chandrasekara
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - P C G Bandaranayake
- Agricultural Biotechnology Centre, Faculty of Agriculture, University of Peradeniya, Peradeniya, 20400, Sri Lanka
| | - A U Bandaranayake
- Department of Computer Engineering, Faculty of Engineering, University of Peradeniya, Peradeniya, 20400, Sri Lanka.
| |
Collapse
|
46
|
Miao X, Li B, Shen Y, Yu H, Zhu G, Liang C, Fu X, Wang C, Li S, Zhang B. Development and Verification of an Economical Method of Custom Target Library Construction. ACS OMEGA 2020; 5:13087-13095. [PMID: 32548494 PMCID: PMC7288555 DOI: 10.1021/acsomega.0c01014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 05/21/2020] [Indexed: 05/13/2023]
Abstract
Although technological advances have greatly reduced the cost of DNA sequencing, sample preparation time and reagent costs remain the limiting factors for many studies. Based on low-cost targeted amplification, we developed an economical method for custom target library construction based on DNA nanoball (DNB) technology and two-step polymerase chain reaction (PCR). Here, we refer to this method as the two-step PCR, which was compared to traditional multiplex PCR methods in three aspects, data quality, efficiency, and specificity to humans. The results confirmed that two-step PCR reduces to finishing 128 sequencing libraries in only 2 h 24 min 59 s of the total PCR time and at a data utilization rate of 0.44 at a cost of approximately $1.70 per sample for targeted sequencing via the two-step PCR. The replacement of traditional multiplex PCR methods with this strategy makes the sample preparation process before sequencing relatively more cost-effective and further reduces the cost of next-generation sequencing (NGS). This method may also be free from the interference of other species and the limitations of sample type and DNA content. These findings reveal possibilities for broad applications of this approach in forensic research.
Collapse
Affiliation(s)
- Xinyao Miao
- School
of Forensic Sciences, Xi’an Jiaotong
University, 710049 Xi’an, P. R. China
| | - Bowen Li
- School
of Life Sciences, Sichuan University, 610207 Chengdu, P. R. China
| | - Yuesheng Shen
- School
of Life Sciences, Northwest University, 710069 Xi’an, P. R. China
| | - Huiyun Yu
- School
of Life Sciences, Northwest A&F University, 712100 Yangling, P. R. China
| | - Guoqiang Zhu
- Key
Laboratory of Bio-Resources and Eco-Environment of Ministry of Education,
College of Life Sciences, Sichuan University, 610065 Chengdu, P. R. China
| | - Chen Liang
- School of
Mechanical Engineering, Xi’an Jiaotong
University, 710049 Xi’an, P. R. China
| | - Xiao Fu
- The
Beijing Genomics Institute (BGI)—Tianjin, 301700 Tianjin, P. R. China
| | - Chu Wang
- School
of Life Sciences, Xiamen Medical College, 361023 Xiamen, P. R. China
| | - Shengbin Li
- School
of Forensic Sciences, Xi’an Jiaotong
University, 710049 Xi’an, P. R. China
| | - Bao Zhang
- School
of Forensic Sciences, Xi’an Jiaotong
University, 710049 Xi’an, P. R. China
| |
Collapse
|
47
|
He X, Chen S, Li R, Han X, He Z, Yuan D, Zhang S, Duan X, Niu B. Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes. Brief Bioinform 2020; 22:5854402. [PMID: 32510555 DOI: 10.1093/bib/bbaa083] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 04/19/2020] [Accepted: 04/21/2020] [Indexed: 12/21/2022] Open
Abstract
Next-generation sequencing (NGS) technology has revolutionised human cancer research, particularly via detection of genomic variants with its ultra-high-throughput sequencing and increasing affordability. However, the inundation of rich cancer genomics data has resulted in significant challenges in its exploration and translation into biological insights. One of the difficulties in cancer genome sequencing is software selection. Currently, multiple tools are widely used to process NGS data in four stages: raw sequence data pre-processing and quality control (QC), sequence alignment, variant calling and annotation and visualisation. However, the differences between these NGS tools, including their installation, merits, drawbacks and application, have not been fully appreciated. Therefore, a systematic review of the functionality and performance of NGS tools is required to provide cancer researchers with guidance on software and strategy selection. Another challenge is the multidimensional QC of sequencing data because QC can not only report varied sequence data characteristics but also reveal deviations in diverse features and is essential for a meaningful and successful study. However, monitoring of QC metrics in specific steps including alignment and variant calling is neglected in certain pipelines such as the 'Best Practices Workflows' in GATK. In this review, we investigated the most widely used software for the fundamental analysis and QC of cancer genome sequencing data and provided instructions for selecting the most appropriate software and pipelines to ensure precise and efficient conclusions. We further discussed the prospects and new research directions for cancer genomics.
Collapse
|
48
|
Blanco C, Verbanic S, Seelig B, Chen IA. High throughput sequencing of in vitro selections of mRNA-displayed peptides: data analysis and applications. Phys Chem Chem Phys 2020; 22:6492-6506. [PMID: 31967131 PMCID: PMC8219182 DOI: 10.1039/c9cp05912a] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
In vitro selection using mRNA display is currently a widely used method to isolate functional peptides with desired properties. The analysis of high throughput sequencing (HTS) data from in vitro evolution experiments has proven to be a powerful technique but only recently has it been applied to mRNA display selections. In this Perspective, we introduce aspects of mRNA display and HTS that may be of interest to physical chemists. We highlight the potential of HTS to analyze in vitro selections of peptides and review recent advances in the application of HTS analysis to mRNA display experiments. We discuss some possible issues involved with HTS analysis and summarize some strategies to alleviate them. Finally, the potential for future impact of advancing HTS analysis on mRNA display experiments is discussed.
Collapse
Affiliation(s)
- Celia Blanco
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA 93106, USA.
| | | | | | | |
Collapse
|
49
|
Laissue P, Vaiman D. Exploring the Molecular Aetiology of Preeclampsia by Massive Parallel Sequencing of DNA. Curr Hypertens Rep 2020; 22:31. [PMID: 32172383 DOI: 10.1007/s11906-020-01039-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
PURPOSE OF REVIEW This manuscript aims to review (for the first time) studies describing NGS sequencing of preeclampsia (PE) women's DNA. RECENT FINDINGS Describing markers for the early detection of PE is an essential task because, although associated molecular dysfunction begins early on during pregnancy, the disease's clinical signs usually appear late in pregnancy. Although several biochemical biomarkers have been proposed, their use in clinical environments is still limited, thereby encouraging research into PE's genetic origin. Hundreds of genes involved in numerous implantation- and placentation-related biological processes may be coherent candidates for PE aetiology. Next-generation sequencing (NGS) offers new technical possibilities for PE studying, as it enables large genomic regions to be analysed at affordable cost. This technique has facilitated the description of genes contributing to the molecular origin of a significant amount of monogenic and complex diseases. Regarding PE, NGS of DNA has been used in familial and isolated cases, thereby enabling new genes potentially related to the phenotype to be proposed. For a better understanding of NGS, technical aspects, applications and limitations are presented initially. Thereafter, NGS studies of DNA in familial and non-familial cases are described, including pitfalls and positive findings. The information given here should enable scientists and clinicians to analyse and design new studies permitting the identification of novel clinically useful molecular PE markers.
Collapse
Affiliation(s)
- Paul Laissue
- Biopas Laboratoires, Biopas Group, Bogotá, Colombia. .,Inserm U1016, CNRS UMR8104, Institut Cochin, équipe FGTB, 24, rue du faubourg Saint-Jacques, 75014, Paris, France. .,CIGGUR Genetics Group, School of Medicine and Health Sciences, El Rosario University, Bogotá, Colombia.
| | - Daniel Vaiman
- Inserm U1016, CNRS UMR8104, Institut Cochin, équipe FGTB, 24, rue du faubourg Saint-Jacques, 75014, Paris, France
| |
Collapse
|
50
|
Wercelens P, da Silva W, Hondo F, Castro K, Walter ME, Araújo A, Lifschitz S, Holanda M. Bioinformatics Workflows With NoSQL Database in Cloud Computing. Evol Bioinform Online 2019; 15:1176934319889974. [PMID: 31839702 PMCID: PMC6896126 DOI: 10.1177/1176934319889974] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 10/29/2019] [Indexed: 12/29/2022] Open
Abstract
Scientific workflows can be understood as arrangements of managed activities executed by different processing entities. It is a regular Bioinformatics approach applying workflows to solve problems in Molecular Biology, notably those related to sequence analyses. Due to the nature of the raw data and the in silico environment of Molecular Biology experiments, apart from the research subject, 2 practical and closely related problems have been studied: reproducibility and computational environment. When aiming to enhance the reproducibility of Bioinformatics experiments, various aspects should be considered. The reproducibility requirements comprise the data provenance, which enables the acquisition of knowledge about the trajectory of data over a defined workflow, the settings of the programs, and the entire computational environment. Cloud computing is a booming alternative that can provide this computational environment, hiding technical details, and delivering a more affordable, accessible, and configurable on-demand environment for researchers. Considering this specific scenario, we proposed a solution to improve the reproducibility of Bioinformatics workflows in a cloud computing environment using both Infrastructure as a Service (IaaS) and Not only SQL (NoSQL) database systems. To meet the goal, we have built 3 typical Bioinformatics workflows and ran them on 1 private and 2 public clouds, using different types of NoSQL database systems to persist the provenance data according to the Provenance Data Model (PROV-DM). We present here the results and a guide for the deployment of a cloud environment for Bioinformatics exploring the characteristics of various NoSQL database systems to persist provenance data.
Collapse
Affiliation(s)
- Polyane Wercelens
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Waldeyr da Silva
- Department of Computer Science, University of Brasília, Brasília, Brazil.,NEPBIO (Group of Biological Studies and Research on Cerrado), Federal Institute of Goiás (IFG), Formosa, Goiás, Brazil
| | - Fernanda Hondo
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Klayton Castro
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | | | - Aletéia Araújo
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Sergio Lifschitz
- Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Maristela Holanda
- Department of Computer Science, University of Brasília, Brasília, Brazil
| |
Collapse
|