1
|
Rather MA, Agarwal D, Bhat TA, Khan IA, Zafar I, Kumar S, Amin A, Sundaray JK, Qadri T. Bioinformatics approaches and big data analytics opportunities in improving fisheries and aquaculture. Int J Biol Macromol 2023; 233:123549. [PMID: 36740117 DOI: 10.1016/j.ijbiomac.2023.123549] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Revised: 01/30/2023] [Accepted: 01/31/2023] [Indexed: 02/05/2023]
Abstract
Aquaculture has witnessed an excellent growth rate during the last two decades and offers huge potential to provide nutritional as well as livelihood security. Genomic research has contributed significantly toward the development of beneficial technologies for aquaculture. The existing high throughput technologies like next-generation technologies generate oceanic data which requires extensive analysis using appropriate tools. Bioinformatics is a rapidly evolving science that involves integrating gene based information and computational technology to produce new knowledge for the benefit of aquaculture. Bioinformatics provides new opportunities as well as challenges for information and data processing in new generation aquaculture. Rapid technical advancements have opened up a world of possibilities for using current genomics to improve aquaculture performance. Understanding the genes that govern economically relevant characteristics, necessitates a significant amount of additional research. The various dimensions of data sources includes next-generation DNA sequencing, protein sequencing, RNA sequencing gene expression profiles, metabolic pathways, molecular markers, and so on. Appropriate bioinformatics tools are developed to mine the biologically relevant and commercially useful results. The purpose of this scoping review is to present various arms of diverse bioinformatics tools with special emphasis on practical translation to the aquaculture industry.
Collapse
Affiliation(s)
- Mohd Ashraf Rather
- Division of Fish Genetics and Biotechnology, Faculty of Fisheries Ganderbal, Sher-e- Kashmir University of Agricultural Science and Technology, Kashmir, India.
| | - Deepak Agarwal
- Institute of Fisheries Post Graduation Studies OMR Campus, Vaniyanchavadi, Chennai, India
| | | | - Irfan Ahamd Khan
- Division of Fish Genetics and Biotechnology, Faculty of Fisheries Ganderbal, Sher-e- Kashmir University of Agricultural Science and Technology, Kashmir, India
| | - Imran Zafar
- Department of Bioinformatics and Computational Biology, Virtual University Punjab, Pakistan
| | - Sujit Kumar
- Department of Bioinformatics and Computational Biology, Virtual University Punjab, Pakistan
| | - Adnan Amin
- Postgraduate Institute of Fisheries Education and Research Kamdhenu University, Gandhinagar-India University of Kurasthra, India; Department of Aquatic Environmental Management, Faculty of Fisheries Rangil- Ganderbel -SKUAST-K, India
| | - Jitendra Kumar Sundaray
- ICAR-Central Institute of Freshwater Aquaculture, Kausalyaganga, Bhubaneswar, Odisha 751002, India
| | - Tahiya Qadri
- Division of Food Science and Technology, SKUAST-K, Shalimar, India
| |
Collapse
|
2
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
3
|
Lei Y, Meng Y, Guo X, Ning K, Bian Y, Li L, Hu Z, Anashkina AA, Jiang Q, Dong Y, Zhu X. Overview of structural variation calling: Simulation, identification, and visualization. Comput Biol Med 2022; 145:105534. [DOI: 10.1016/j.compbiomed.2022.105534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 12/11/2022]
|
4
|
Ray A. Machine learning in postgenomic biology and personalized medicine. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2022; 12:e1451. [PMID: 35966173 PMCID: PMC9371441 DOI: 10.1002/widm.1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 12/22/2021] [Indexed: 06/15/2023]
Abstract
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Collapse
Affiliation(s)
- Animesh Ray
- Riggs School of Applied Life Sciences, Keck Graduate Institute, 535 Watson Drive, Claremont, CA91711, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
5
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
6
|
Padovani de Souza K, Setubal JC, Ponce de Leon F de Carvalho AC, Oliveira G, Chateau A, Alves R. Machine learning meets genome assembly. Brief Bioinform 2020; 20:2116-2129. [PMID: 30137230 DOI: 10.1093/bib/bby072] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 07/11/2018] [Accepted: 07/22/2018] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsolved. Among them is the task of assembling DNA fragments from previously unsequenced organisms, which is classified as an NP-hard (nondeterministic polynomial time hard) problem, for which no efficient computational solution with reasonable execution time exists. However, several tools that produce approximate solutions have been used with results that have facilitated scientific discoveries, although there is ample room for improvement. As with other NP-hard problems, machine learning algorithms have been one of the approaches used in recent years in an attempt to find better solutions to the DNA fragment assembly problem, although still at a low scale. RESULTS This paper presents a broad review of pioneering literature comprising artificial intelligence-based DNA assemblers-particularly the ones that use machine learning-to provide an overview of state-of-the-art approaches and to serve as a starting point for further study in this field.
Collapse
Affiliation(s)
| | - João Carlos Setubal
- University of São Paulo, Brazil.,Department of Computer Science, University of São Paulo, Brazil
| | | | | | - Annie Chateau
- Vale Technology Institute-Sustainable Development, Brazil
| | - Ronnie Alves
- Federal University of Pará, Brazil.,University of Montpellier, LIRMM, France
| |
Collapse
|
7
|
Singh A, Masih A, Monroy-Nieto J, Singh PK, Bowers J, Travis J, Khurana A, Engelthaler DM, Meis JF, Chowdhary A. A unique multidrug-resistant clonal Trichophyton population distinct from Trichophyton mentagrophytes/Trichophyton interdigitale complex causing an ongoing alarming dermatophytosis outbreak in India: Genomic insights and resistance profile. Fungal Genet Biol 2019; 133:103266. [DOI: 10.1016/j.fgb.2019.103266] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 08/29/2019] [Accepted: 08/29/2019] [Indexed: 01/09/2023]
|
8
|
Vilne B, Meistere I, Grantiņa-Ieviņa L, Ķibilds J. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks. Front Microbiol 2019; 10:1722. [PMID: 31447800 PMCID: PMC6691741 DOI: 10.3389/fmicb.2019.01722] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Accepted: 07/12/2019] [Indexed: 12/14/2022] Open
Abstract
Foodborne diseases (FBDs) are infections of the gastrointestinal tract caused by foodborne pathogens (FBPs) such as bacteria [Salmonella, Listeria monocytogenes and Shiga toxin-producing E. coli (STEC)] and several viruses, but also parasites and some fungi. Artificial intelligence (AI) and its sub-discipline machine learning (ML) are re-emerging and gaining an ever increasing popularity in the scientific community and industry, and could lead to actionable knowledge in diverse ranges of sectors including epidemiological investigations of FBD outbreaks and antimicrobial resistance (AMR). As genotyping using whole-genome sequencing (WGS) is becoming more accessible and affordable, it is increasingly used as a routine tool for the detection of pathogens, and has the potential to differentiate between outbreak strains that are closely related, identify virulence/resistance genes and provide improved understanding of transmission events within hours to days. In most cases, the computational pipeline of WGS data analysis can be divided into four (though, not necessarily consecutive) major steps: de novo genome assembly, genome characterization, comparative genomics, and inference of phylogeny or phylogenomics. In each step, ML could be used to increase the speed and potentially the accuracy (provided increasing amounts of high-quality input data) of identification of the source of ongoing outbreaks, leading to more efficient treatment and prevention of additional cases. In this review, we explore whether ML or any other form of AI algorithms have already been proposed for the respective tasks and compare those with mechanistic model-based approaches.
Collapse
Affiliation(s)
- Baiba Vilne
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
- SIA net-OMICS, Riga, Latvia
| | - Irēna Meistere
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
| | | | - Juris Ķibilds
- Institute of Food Safety, Animal Health and Environment—“BIOR”, Riga, Latvia
| |
Collapse
|
9
|
Khan AR, Pervez MT, Babar ME, Naveed N, Shoaib M. A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective. Evol Bioinform Online 2018; 14:1176934318758650. [PMID: 29511353 PMCID: PMC5826002 DOI: 10.1177/1176934318758650] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 01/19/2018] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Current advancements in next-generation sequencing technology have made possible to sequence whole genome but assembling a large number of short sequence reads is still a big challenge. In this article, we present the comparative study of seven assemblers, namely, ABySS, Velvet, Edena, SGA, Ray, SSAKE, and Perga, using prokaryotic and eukaryotic paired-end as well as single-end data sets from Illumina platform. RESULTS Results showed that in case of single-end data sets, Velvet and ABySS outperformed in all the seven assemblers with comparatively low assembling time and high genome fraction. Velvet consumed the least amount of memory than any other assembler. In case of paired-end data sets, Velvet consumed least amount of time and produced high genome fraction after ABySS and Ray. In terms of low memory usage, SGA and Edena outperformed in all the assemblers. Ray also showed good genome fraction; however, extremely high assembling time consumed by the Ray might make it prohibitively slow on larger data sets of single and paired-end data. CONCLUSIONS Our comparison study will provide assistance to the scientists for selecting the suitable assembler according to their data sets and will also assist the developers to upgrade or develop a new assembler for de novo assembling.
Collapse
Affiliation(s)
- Abdul Rafay Khan
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Tariq Pervez
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | | | - Nasir Naveed
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
| |
Collapse
|
10
|
Acuña-Amador L, Primot A, Cadieu E, Roulet A, Barloy-Hubler F. Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains. BMC Genomics 2018; 19:54. [PMID: 29338683 PMCID: PMC5771137 DOI: 10.1186/s12864-017-4429-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 12/29/2017] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Without knowledge of their genomic sequences, it is impossible to make functional models of the bacteria that make up human and animal microbiota. Unfortunately, the vast majority of publicly available genomes are only working drafts, an incompleteness that causes numerous problems and constitutes a major obstacle to genotypic and phenotypic interpretation. In this work, we began with an example from the class Bacteroidia in the phylum Bacteroidetes, which is preponderant among human orodigestive microbiota. We successfully identify the genetic loci responsible for assembly breaks and misassemblies and demonstrate the importance and usefulness of long-read sequencing and curated reannotation. RESULTS We showed that the fragmentation in Bacteroidia draft genomes assembled from massively parallel sequencing linearly correlates with genomic repeats of the same or greater size than the reads. We also demonstrated that some of these repeats, especially the long ones, correspond to misassembled loci in three reference Porphyromonas gingivalis genomes marked as circularized (thus complete or finished). We prove that even at modest coverage (30X), long-read resequencing together with PCR contiguity verification (rrn operons and an integrative and conjugative element or ICE) can be used to identify and correct the wrongly combined or assembled regions. Finally, although time-consuming and labor-intensive, consistent manual biocuration of three P. gingivalis strains allowed us to compare and correct the existing genomic annotations, resulting in a more accurate interpretation of the genomic differences among these strains. CONCLUSIONS In this study, we demonstrate the usefulness and importance of long-read sequencing in verifying published genomes (even when complete) and generating assemblies for new bacterial strains/species with high genomic plasticity. We also show that when combined with biological validation processes and diligent biocurated annotation, this strategy helps reduce the propagation of errors in shared databases, thus limiting false conclusions based on incomplete or misleading information.
Collapse
Affiliation(s)
- Luis Acuña-Amador
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.,Laboratorio de Investigación en Bacteriología Anaerobia, Centro de Investigación en Enfermedades Tropicales, Facultad de Microbiología, Universidad de Costa Rica, San José, Costa Rica
| | - Aline Primot
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Edouard Cadieu
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France
| | - Alain Roulet
- GenoToul Genome & Transcriptome (GeT-PlaGe), INRA, US1426, Castanet-Tolosan, France
| | - Frédérique Barloy-Hubler
- Institut de Génétique et Développement de Rennes, CNRS, UMR6290, Université de Rennes 1, Rennes, France.
| |
Collapse
|
11
|
Quainoo S, Coolen JPM, van Hijum SAFT, Huynen MA, Melchers WJG, van Schaik W, Wertheim HFL. Whole-Genome Sequencing of Bacterial Pathogens: the Future of Nosocomial Outbreak Analysis. Clin Microbiol Rev 2017; 30:1015-1063. [PMID: 28855266 PMCID: PMC5608882 DOI: 10.1128/cmr.00016-17] [Citation(s) in RCA: 228] [Impact Index Per Article: 32.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Outbreaks of multidrug-resistant bacteria present a frequent threat to vulnerable patient populations in hospitals around the world. Intensive care unit (ICU) patients are particularly susceptible to nosocomial infections due to indwelling devices such as intravascular catheters, drains, and intratracheal tubes for mechanical ventilation. The increased vulnerability of infected ICU patients demonstrates the importance of effective outbreak management protocols to be in place. Understanding the transmission of pathogens via genotyping methods is an important tool for outbreak management. Recently, whole-genome sequencing (WGS) of pathogens has become more accessible and affordable as a tool for genotyping. Analysis of the entire pathogen genome via WGS could provide unprecedented resolution in discriminating even highly related lineages of bacteria and revolutionize outbreak analysis in hospitals. Nevertheless, clinicians have long been hesitant to implement WGS in outbreak analyses due to the expensive and cumbersome nature of early sequencing platforms. Recent improvements in sequencing technologies and analysis tools have rapidly increased the output and analysis speed as well as reduced the overall costs of WGS. In this review, we assess the feasibility of WGS technologies and bioinformatics analysis tools for nosocomial outbreak analyses and provide a comparison to conventional outbreak analysis workflows. Moreover, we review advantages and limitations of sequencing technologies and analysis tools and present a real-world example of the implementation of WGS for antimicrobial resistance analysis. We aimed to provide health care professionals with a guide to WGS outbreak analysis that highlights its benefits for hospitals and assists in the transition from conventional to WGS-based outbreak analysis.
Collapse
Affiliation(s)
- Scott Quainoo
- Department of Microbiology, Radboud University, Nijmegen, The Netherlands
| | - Jordy P M Coolen
- Department of Medical Microbiology, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - Sacha A F T van Hijum
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
- NIZO, Ede, The Netherlands
| | - Martijn A Huynen
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - Willem J G Melchers
- Department of Medical Microbiology, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - Willem van Schaik
- Institute of Microbiology and Infection, University of Birmingham, Birmingham, United Kingdom
| | - Heiman F L Wertheim
- Department of Medical Microbiology, Radboud University Medical Centre, Nijmegen, The Netherlands
| |
Collapse
|
12
|
Wang L, Xia Q, Zhang Y, Zhu X, Zhu X, Li D, Ni X, Gao Y, Xiang H, Wei X, Yu J, Quan Z, Zhang X. Updated sesame genome assembly and fine mapping of plant height and seed coat color QTLs using a new high-density genetic map. BMC Genomics 2016; 17:31. [PMID: 26732604 PMCID: PMC4702397 DOI: 10.1186/s12864-015-2316-4] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 12/15/2015] [Indexed: 12/23/2022] Open
Abstract
Background Sesame is an important high-quality oil seed crop. The sesame genome was de novo sequenced and assembled in 2014 (version 1.0); however, the number of anchored pseudomolecules was higher than the chromosome number (2n = 2x = 26) due to the lack of a high-density genetic map with 13 linkage groups. Results We resequenced a permanent population consisting of 430 recombinant inbred lines and constructed a genetic map to improve the sesame genome assembly. We successfully anchored 327 scaffolds onto 13 pseudomolecules. The new genome assembly (version 2.0) included 97.5 % of the scaffolds greater than 150 kb in size present in assembly version 1.0 and increased the total pseudomolecule length from 233.7 to 258.4 Mb with 94.3 % of the genome assembled and 97.2 % of the predicted gene models anchored. Based on the new genome assembly, a bin map including 1,522 bins spanning 1090.99 cM was generated and used to identified 41 quantitative trait loci (QTLs) for sesame plant height and 9 for seed coat color. The plant height-related QTLs explained 3–24 % the phenotypic variation (mean value, 8 %), and 29 of them were detected in at least two field trials. Two major loci (qPH-8.2 and qPH-3.3) that contributed 23 and 18 % of the plant height were located in 350 and 928-kb spaces on Chr8 and Chr3, respectively. qPH-3.3, is predicted to be responsible for the semi-dwarf sesame plant phenotype and contains 102 candidate genes. This is the first report of a sesame semi-dwarf locus and provides an interesting opportunity for a plant architecture study of the sesame. For the sesame seed coat color, the QTLs of the color spaces L*, a*, and b* were detected with contribution rates of 3–46 %. qSCb-4.1 contributed approximately 39 % of the b* value and was located on Chr4 in a 199.9-kb space. A list of 32 candidate genes for the locus, including a predicted black seed coat-related gene, was determined by screening the newly anchored genome. Conclusions This study offers a high-density genetic map and an improved assembly of the sesame genome. The number of linkage groups and pseudomolecules in this assembly equals the number of sesame chromosomes for the first time. The map and updated genome assembly are expected to serve as a platform for future comparative genomics and genetic studies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2316-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Linhai Wang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Qiuju Xia
- Shenzhen Engineering Laboratory of Crop Molecular Design Breeding, BGI-agro, 518083, Shenzhen, China.
| | - Yanxin Zhang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Xiaodong Zhu
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Xiaofeng Zhu
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Donghua Li
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Xuemei Ni
- Shenzhen Engineering Laboratory of Crop Molecular Design Breeding, BGI-agro, 518083, Shenzhen, China.
| | - Yuan Gao
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Haitao Xiang
- Shenzhen Engineering Laboratory of Crop Molecular Design Breeding, BGI-agro, 518083, Shenzhen, China.
| | - Xin Wei
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Jingyin Yu
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| | - Zhiwu Quan
- Shenzhen Engineering Laboratory of Crop Molecular Design Breeding, BGI-agro, 518083, Shenzhen, China.
| | - Xiurong Zhang
- Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops of the Ministry of Agriculture, Wuhan, 430062, China.
| |
Collapse
|
13
|
Antipov D, Korobeynikov A, McLean JS, Pevzner PA. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 2015; 32:1009-15. [PMID: 26589280 DOI: 10.1093/bioinformatics/btv688] [Citation(s) in RCA: 364] [Impact Index Per Article: 40.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Accepted: 11/13/2015] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Recent advances in single molecule real-time (SMRT) and nanopore sequencing technologies have enabled high-quality assemblies from long and inaccurate reads. However, these approaches require high coverage by long reads and remain expensive. On the other hand, the inexpensive short reads technologies produce accurate but fragmented assemblies. Thus, a hybrid approach that assembles long reads (with low coverage) and short reads has a potential to generate high-quality assemblies at reduced cost. RESULTS We describe hybridSPAdes algorithm for assembling short and long reads and benchmark it on a variety of bacterial assembly projects. Our results demonstrate that hybridSPAdes generates accurate assemblies (even in projects with relatively low coverage by long reads) thus reducing the overall cost of genome sequencing. We further present the first complete assembly of a genome from single cells using SMRT reads. AVAILABILITY AND IMPLEMENTATION hybridSPAdes is implemented in C++ as a part of SPAdes genome assembler and is publicly available at http://bioinf.spbau.ru/en/spades CONTACT d.antipov@spbu.ru SUPPLEMENTARY INFORMATION supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dmitry Antipov
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
| | - Jeffrey S McLean
- Department of Periodontics, University of Washington, Seattle, WA 98195, USA
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, Department of Computer Science and Engineering, University of California, San Diego, USA and
| |
Collapse
|
14
|
Zhu X, Leung HCM, Wang R, Chin FYL, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y. misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. BMC Bioinformatics 2015; 16:386. [PMID: 26573684 PMCID: PMC4647709 DOI: 10.1186/s12859-015-0818-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2015] [Accepted: 11/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence). Results We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. Conclusions We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0818-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiao Zhu
- College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. .,Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Henry C M Leung
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Rongjie Wang
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Francis Y L Chin
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Siu Ming Yiu
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Guangri Quan
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Yajie Li
- The Fourth Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| | - Rui Zhang
- The Fourth Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Bo Liu
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Yucui Dong
- Department of Immunology, Harbin Medical University, Harbin, Heilongjiang, China.
| | - Guohui Zhou
- College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.
| | - Yadong Wang
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| |
Collapse
|
15
|
Vasilinetc I, Prjibelski AD, Gurevich A, Korobeynikov A, Pevzner PA. Assembling short reads from jumping libraries with large insert sizes. Bioinformatics 2015; 31:3262-8. [PMID: 26040456 DOI: 10.1093/bioinformatics/btv337] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2015] [Accepted: 05/26/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Advances in Next-Generation Sequencing technologies and sample preparation recently enabled generation of high-quality jumping libraries that have a potential to significantly improve short read assemblies. However, assembly algorithms have to catch up with experimental innovations to benefit from them and to produce high-quality assemblies. RESULTS We present a new algorithm that extends recently described exSPAnder universal repeat resolution approach to enable its applications to several challenging data types, including jumping libraries generated by the recently developed Illumina Nextera Mate Pair protocol. We demonstrate that, with these improvements, bacterial genomes often can be assembled in a few contigs using only a single Nextera Mate Pair library of short reads. AVAILABILITY AND IMPLEMENTATION Described algorithms are implemented in C++ as a part of SPAdes genome assembler, which is freely available at bioinf.spbau.ru/en/spades. CONTACT ap@bioinf.spbau.ru SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Irina Vasilinetc
- Algorithmic Biology Lab, St. Petersburg Academic University 194021
| | - Andrey D Prjibelski
- Algorithmic Biology Lab, St. Petersburg Academic University 194021, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St.Petersburg State University, 199004
| | - Alexey Gurevich
- Algorithmic Biology Lab, St. Petersburg Academic University 194021, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St.Petersburg State University, 199004
| | - Anton Korobeynikov
- Algorithmic Biology Lab, St. Petersburg Academic University 194021, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St.Petersburg State University, 199004, Department of Mathematics and Mechanics, St. Petersburg State University, St. Petersburg, 198504, Russia and
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St.Petersburg State University, 199004, Department of Computer Science and Engineering, University of California, San Diego, CA 92093-0404, USA
| |
Collapse
|