1
|
Anthony WE, Allison SD, Broderick CM, Chavez Rodriguez L, Clum A, Cross H, Eloe-Fadrosh E, Evans S, Fairbanks D, Gallery R, Gontijo JB, Jones J, McDermott J, Pett-Ridge J, Record S, Rodrigues JLM, Rodriguez-Reillo W, Shek KL, Takacs-Vesbach T, Blanchard JL. From soil to sequence: filling the critical gap in genome-resolved metagenomics is essential to the future of soil microbial ecology. ENVIRONMENTAL MICROBIOME 2024; 19:56. [PMID: 39095861 PMCID: PMC11295382 DOI: 10.1186/s40793-024-00599-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 07/22/2024] [Indexed: 08/04/2024]
Abstract
Soil microbiomes are heterogeneous, complex microbial communities. Metagenomic analysis is generating vast amounts of data, creating immense challenges in sequence assembly and analysis. Although advances in technology have resulted in the ability to easily collect large amounts of sequence data, soil samples containing thousands of unique taxa are often poorly characterized. These challenges reduce the usefulness of genome-resolved metagenomic (GRM) analysis seen in other fields of microbiology, such as the creation of high quality metagenomic assembled genomes and the adoption of genome scale modeling approaches. The absence of these resources restricts the scale of future research, limiting hypothesis generation and the predictive modeling of microbial communities. Creating publicly available databases of soil MAGs, similar to databases produced for other microbiomes, has the potential to transform scientific insights about soil microbiomes without requiring the computational resources and domain expertise for assembly and binning.
Collapse
Affiliation(s)
| | - Steven D Allison
- University of California Irvine, Irvine, CA, USA
- Department of Earth System Science, University of California, Irvine, CA, USA
| | - Caitlin M Broderick
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | | | - Alicia Clum
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Hugh Cross
- National Ecological Observatory Network - Battelle, Boulder, CO, USA
| | | | - Sarah Evans
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Dawson Fairbanks
- University of California Riverside, Riverside, CA, USA
- The University of Arizona, Tucson, AZ, USA
| | | | | | - Jennifer Jones
- W.K. Kellogg Biological Station, Michigan State University, Hickory Corners, MI, USA
| | - Jason McDermott
- Pacific Northwest National Laboratory, Richland, WA, 99354, USA
| | - Jennifer Pett-Ridge
- Lawrence Livermore National Laboratory, Livermore, CA, USA
- Life & Environmental Sciences Department, University of California Merced, Merced, CA, 95343, USA
| | | | | | | | | | | | | |
Collapse
|
2
|
Yu YW. On Minimizers and Convolutional Filters: Theoretical Connections and Applications to Genome Analysis. J Comput Biol 2024; 31:381-395. [PMID: 38687333 DOI: 10.1089/cmb.2024.0483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2024] Open
Abstract
Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
Collapse
Affiliation(s)
- Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
- Department of Ray and Stephanie Lane Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
3
|
Zhang T, Zhou J, Gao W, Jia Y, Wei Y, Wang G. Complex genome assembly based on long-read sequencing. Brief Bioinform 2022; 23:6657663. [PMID: 35940845 DOI: 10.1093/bib/bbac305] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/20/2022] [Accepted: 07/06/2022] [Indexed: 11/12/2022] Open
Abstract
High-quality genome chromosome-scale sequences provide an important basis for genomics downstream analysis, especially the construction of haplotype-resolved and complete genomes, which plays a key role in genome annotation, mutation detection, evolutionary analysis, gene function research, comparative genomics and other aspects. However, genome-wide short-read sequencing is difficult to produce a complete genome in the face of a complex genome with high duplication and multiple heterozygosity. The emergence of long-read sequencing technology has greatly improved the integrity of complex genome assembly. We review a variety of computational methods for complex genome assembly and describe in detail the theories, innovations and shortcomings of collapsed, semi-collapsed and uncollapsed assemblers based on long reads. Among the three methods, uncollapsed assembly is the most correct and complete way to represent genomes. In addition, genome assembly is closely related to haplotype reconstruction, that is uncollapsed assembly realizes haplotype reconstruction, and haplotype reconstruction promotes uncollapsed assembly. We hope that gapless, telomere-to-telomere and accurate assembly of complex genomes can be truly routinely achieved using only a simple process or a single tool in the future.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yanan Wei
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
4
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
5
|
Blanke M, Morgenstern B. App-SpaM: phylogenetic placement of short reads without sequence alignment. BIOINFORMATICS ADVANCES 2021; 1:vbab027. [PMID: 36700102 PMCID: PMC9710606 DOI: 10.1093/bioadv/vbab027] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 09/27/2021] [Accepted: 10/11/2021] [Indexed: 01/28/2023]
Abstract
Motivation Phylogenetic placement is the task of placing a query sequence of unknown taxonomic origin into a given phylogenetic tree of a set of reference sequences. A major field of application of such methods is, for example, the taxonomic identification of reads in metabarcoding or metagenomic studies. Several approaches to phylogenetic placement have been proposed in recent years. The most accurate of them requires a multiple sequence alignment of the references as input. However, calculating multiple alignments is not only time-consuming but also limits the applicability of these approaches. Results Herein, we propose Alignment-free phylogenetic placement algorithm based on Spaced-word Matches (App-SpaM), an efficient algorithm for the phylogenetic placement of short sequencing reads on a tree of a set of reference sequences. App-SpaM produces results of high quality that are on a par with the best available approaches to phylogenetic placement, while our software is two orders of magnitude faster than these existing methods. Our approach neither requires a multiple alignment of the reference sequences nor alignments of the queries to the references. This enables App-SpaM to perform phylogenetic placement on a broad variety of datasets. Availability and implementation The source code of App-SpaM is freely available on Github at https://github.com/matthiasblanke/App-SpaM together with detailed instructions for installation and settings. App-SpaM is furthermore available as a Conda-package on the Bioconda channel. Contact matthias.blanke@biologie.uni-goettingen.de. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Matthias Blanke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- International Max Planck Research School for Genome Science, Göttingen 37077, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- Campus-Institute Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
6
|
Advani D, Sharma S, Kumari S, Ambasta RK, Kumar P. Precision Oncology, Signaling and Anticancer Agents in Cancer Therapeutics. Anticancer Agents Med Chem 2021; 22:433-468. [PMID: 33687887 DOI: 10.2174/1871520621666210308101029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Revised: 01/05/2021] [Accepted: 01/12/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND The global alliance for genomics and healthcare facilities provides innovational solutions to expedite research and clinical practices for complex and incurable health conditions. Precision oncology is an emerging field explicitly tailored to facilitate cancer diagnosis, prevention and treatment based on patients' genetic profile. Advancements in "omics" techniques, next-generation sequencing, artificial intelligence and clinical trial designs provide a platform for assessing the efficacy and safety of combination therapies and diagnostic procedures. METHOD Data were collected from Pubmed and Google scholar using keywords: "Precision medicine", "precision medicine and cancer", "anticancer agents in precision medicine" and reviewed comprehensively. RESULTS Personalized therapeutics including immunotherapy, cancer vaccines, serve as a groundbreaking solution for cancer treatment. Herein, we take a measurable view of precision therapies and novel diagnostic approaches targeting cancer treatment. The contemporary applications of precision medicine have also been described along with various hurdles identified in the successful establishment of precision therapeutics. CONCLUSION This review highlights the key breakthroughs related to immunotherapies, targeted anticancer agents, and target interventions related to cancer signaling mechanisms. The success story of this field in context to drug resistance, safety, patient survival and in improving quality of life is yet to be elucidated. We conclude that, in the near future, the field of individualized treatments may truly revolutionize the nature of cancer patient care.
Collapse
Affiliation(s)
- Dia Advani
- Molecular Neuroscience and Functional Genomics Laboratory Shahbad Daulatpur, Bawana Road, Delhi 110042. India
| | - Sudhanshu Sharma
- Molecular Neuroscience and Functional Genomics Laboratory Shahbad Daulatpur, Bawana Road, Delhi 110042. India
| | - Smita Kumari
- Molecular Neuroscience and Functional Genomics Laboratory Shahbad Daulatpur, Bawana Road, Delhi 110042. India
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory Shahbad Daulatpur, Bawana Road, Delhi 110042. India
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory Shahbad Daulatpur, Bawana Road, Delhi 110042. India
| |
Collapse
|
7
|
The Role of Artificial Intelligence and Machine Learning Techniques: Race for COVID-19 Vaccine. ARCHIVES OF CLINICAL INFECTIOUS DISEASES 2020. [DOI: 10.5812/archcid.103232] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
8
|
Uddin M, Wang Y, Woodbury-Smith M. Artificial intelligence for precision medicine in neurodevelopmental disorders. NPJ Digit Med 2019; 2:112. [PMID: 31799421 PMCID: PMC6872596 DOI: 10.1038/s41746-019-0191-0] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Accepted: 10/29/2019] [Indexed: 12/23/2022] Open
Abstract
The ambition of precision medicine is to design and optimize the pathway for diagnosis, therapeutic intervention, and prognosis by using large multidimensional biological datasets that capture individual variability in genes, function and environment. This offers clinicians the opportunity to more carefully tailor early interventions- whether treatment or preventative in nature-to each individual patient. Taking advantage of high performance computer capabilities, artificial intelligence (AI) algorithms can now achieve reasonable success in predicting risk in certain cancers and cardiovascular disease from available multidimensional clinical and biological data. In contrast, less progress has been made with the neurodevelopmental disorders, which include intellectual disability (ID), autism spectrum disorder (ASD), epilepsy and broader neurodevelopmental disorders. Much hope is pinned on the opportunity to quantify risk from patterns of genomic variation, including the functional characterization of genes and variants, but this ambition is confounded by phenotypic and etiologic heterogeneity, along with the rare and variable penetrant nature of the underlying risk variants identified so far. Structural and functional brain imaging and neuropsychological and neurophysiological markers may provide further dimensionality, but often require more development to achieve sensitivity for diagnosis. Herein, therefore, lies a precision medicine conundrum: can artificial intelligence offer a breakthrough in predicting risks and prognosis for neurodevelopmental disorders? In this review we will examine these complexities, and consider some of the strategies whereby artificial intelligence may overcome them.
Collapse
Affiliation(s)
- Mohammed Uddin
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE
- 2The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON Canada
| | - Yujiang Wang
- 3Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
- 4School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - Marc Woodbury-Smith
- 2The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON Canada
- 3Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
| |
Collapse
|