1
|
Tay AP, Didi K, Wickramarachchi A, Bauer DC, Wilson LOW, Maselko M. Synsor: a tool for alignment-free detection of engineered DNA sequences. Front Bioeng Biotechnol 2024; 12:1375626. [PMID: 39070163 PMCID: PMC11272466 DOI: 10.3389/fbioe.2024.1375626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/18/2024] [Indexed: 07/30/2024] Open
Abstract
DNA sequences of nearly any desired composition, length, and function can be synthesized to alter the biology of an organism for purposes ranging from the bioproduction of therapeutic compounds to invasive pest control. Yet despite offering many great benefits, engineered DNA poses a risk due to their possible misuse or abuse by malicious actors, or their unintentional introduction into the environment. Monitoring the presence of engineered DNA in biological or environmental systems is therefore crucial for routine and timely detection of emerging biological threats, and for improving public acceptance of genetic technologies. To address this, we developed Synsor, a tool for identifying engineered DNA sequences in high-throughput sequencing data. Synsor leverages the k-mer signature differences between naturally occurring and engineered DNA sequences and uses an artificial neural network to classify whether a DNA sequence is natural or engineered. By querying suspected sequences against the model, Synsor can identify sequences that are likely to have been engineered. Using natural plasmid and engineered vector sequences, we showed that Synsor identifies engineered DNA with >99% accuracy. We demonstrate how Synsor can be used to detect potential genetically engineered organisms and locate where engineered DNA is being introduced into the environment by analysing genomic and metagenomic data from yeast and wastewater samples, respectively. Synsor is therefore a powerful tool that will streamline the process of identifying engineered DNA in poorly characterized biological or environmental systems, thereby allowing for enhanced monitoring of emerging biological threats.
Collapse
Affiliation(s)
- Aidan P. Tay
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
- Applied Biosciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia
| | - Kieran Didi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
| | - Denis C. Bauer
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
- Applied Biosciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia
| | - Laurence O. W. Wilson
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
- Applied Biosciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia
| | - Maciej Maselko
- Applied Biosciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia
- Health and Biosecurity, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, NSW, Australia
| |
Collapse
|
2
|
Thuronyi BW, DeBenedictis EA, Barrick JE. No assembly required: Time for stronger, simpler publishing standards for DNA sequences. PLoS Biol 2023; 21:e3002376. [PMID: 37971964 PMCID: PMC10653517 DOI: 10.1371/journal.pbio.3002376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023] Open
Abstract
Uniformly accessible DNA sequences are needed to improve experimental reproducibility and automation. Rather than descriptions of how engineered DNA is assembled, publishers should require complete and empirically validated sequences.
Collapse
Affiliation(s)
- B W. Thuronyi
- Department of Chemistry, Williams College, Williamstown, Massachusetts, United States of America
| | | | - Jeffrey E. Barrick
- Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
3
|
Spirgel R, Comolli J, Guido NJ. A Machine Learning Method for Genome Engineering Design Tool Attribution. Health Secur 2023; 21:407-414. [PMID: 37594776 DOI: 10.1089/hs.2022.0152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/19/2023] Open
Abstract
As the ability to engineer biological systems improves with increasingly advanced technology, the risk of accidental or intentional release of a dangerous genetically modified organism becomes greater. It is important that authorities can carry out attribution for the source of a genetically modified biological agent release. In the absence of evidence that ties a release directly to the individuals responsible, attribution can be carried out in part by discovering the in silico tools used to design the engineered genetic components, which can leave a signature in the DNA of the organism. Previous attribution methods have focused on identifying the laboratory of origin of an engineered organism using machine learning on plasmid signatures. The next logical step is to address attribution using signatures from the tools that are used to create the engineered modifications. A random forest classifier was developed that discriminates between design tools used to optimize coding regions for incorporation into the genome of another organism. To this end, tens of thousands of genes were optimized with 4 different codon optimization methods and relevant features from these sequences were generated for a machine learning classifier. This method achieves more than 97% accuracy in predicting which tools were used to design codon optimized genes for expression in other organisms. The methods presented here lay the groundwork for the creation of effective organism engineering attribution techniques. Such methods can act both as deterrents for future attempts at creating dangerous organisms as well as tools for forensic science.
Collapse
Affiliation(s)
- Rebecca Spirgel
- Rebecca Spirgel, MS, is Associate Technical Staff, Group 23, MIT Lincoln Laboratory, Lexington, MA
| | - James Comolli
- James Comolli, PhD, Group 23, MIT Lincoln Laboratory, Lexington, MA
| | - Nicholas J Guido
- Nicholas J. Guido, PhD, are Technical Staff, Group 23, MIT Lincoln Laboratory, Lexington, MA
| |
Collapse
|
4
|
Crook OM, Warmbrod KL, Lipstein G, Chung C, Bakerlee CW, McKelvey TG, Holland SR, Swett JL, Esvelt KM, Alley EC, Bradshaw WJ. Analysis of the first genetic engineering attribution challenge. Nat Commun 2022; 13:7374. [PMID: 36450726 PMCID: PMC9712580 DOI: 10.1038/s41467-022-35032-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 11/16/2022] [Indexed: 12/03/2022] Open
Abstract
The ability to identify the designer of engineered biological sequences-termed genetic engineering attribution (GEA)-would help ensure due credit for biotechnological innovation, while holding designers accountable to the communities they affect. Here, we present the results of the first Genetic Engineering Attribution Challenge, a public data-science competition to advance GEA techniques. Top-scoring teams dramatically outperformed previous models at identifying the true lab-of-origin of engineered plasmid sequences, including an increase in top-1 and top-10 accuracy of 10 percentage points. A simple ensemble of prizewinning models further increased performance. New metrics, designed to assess a model's ability to confidently exclude candidate labs, also showed major improvements, especially for the ensemble. Most winning teams adopted CNN-based machine-learning approaches; however, one team achieved very high accuracy with an extremely fast neural-network-free approach. Future work, including future competitions, should further explore a wide diversity of approaches for bringing GEA technology into practical use.
Collapse
Affiliation(s)
- Oliver M Crook
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, UK
| | - Kelsey Lane Warmbrod
- Johns Hopkins Center for Health Security, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Institute of Public Health Genetics, University of Washington, Seattle, WA, USA
| | | | | | | | | | | | | | - Kevin M Esvelt
- altLabs Inc, Berkeley, CA, USA
- Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ethan C Alley
- altLabs Inc, Berkeley, CA, USA.
- Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - William J Bradshaw
- altLabs Inc, Berkeley, CA, USA.
- Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|