1
|
Giannantoni L, Bardini R, Savino A, Di Carlo S. Biology System Description Language (BiSDL): a modeling language for the design of multicellular synthetic biological systems. BMC Bioinformatics 2024; 25:166. [PMID: 38664639 PMCID: PMC11046772 DOI: 10.1186/s12859-024-05782-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Accepted: 04/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND The Biology System Description Language (BiSDL) is an accessible, easy-to-use computational language for multicellular synthetic biology. It allows synthetic biologists to represent spatiality and multi-level cellular dynamics inherent to multicellular designs, filling a gap in the state of the art. Developed for designing and simulating spatial, multicellular synthetic biological systems, BiSDL integrates high-level conceptual design with detailed low-level modeling, fostering collaboration in the Design-Build-Test-Learn cycle. BiSDL descriptions directly compile into Nets-Within-Nets (NWNs) models, offering a unique approach to spatial and hierarchical modeling in biological systems. RESULTS BiSDL's effectiveness is showcased through three case studies on complex multicellular systems: a bacterial consortium, a synthetic morphogen system and a conjugative plasmid transfer process. These studies highlight the BiSDL proficiency in representing spatial interactions and multi-level cellular dynamics. The language facilitates the compilation of conceptual designs into detailed, simulatable models, leveraging the NWNs formalism. This enables intuitive modeling of complex biological systems, making advanced computational tools more accessible to a broader range of researchers. CONCLUSIONS BiSDL represents a significant step forward in computational languages for synthetic biology, providing a sophisticated yet user-friendly tool for designing and simulating complex biological systems with an emphasis on spatiality and cellular dynamics. Its introduction has the potential to transform research and development in synthetic biology, allowing for deeper insights and novel applications in understanding and manipulating multicellular systems.
Collapse
Affiliation(s)
- Leonardo Giannantoni
- Department of Control and Computer Engineering, Polytechnic University of Turin, Corso Duca degli Abruzzi, 24, 100129, Turin, TO, Italy
| | - Roberta Bardini
- Department of Control and Computer Engineering, Polytechnic University of Turin, Corso Duca degli Abruzzi, 24, 100129, Turin, TO, Italy.
| | - Alessandro Savino
- Department of Control and Computer Engineering, Polytechnic University of Turin, Corso Duca degli Abruzzi, 24, 100129, Turin, TO, Italy
| | - Stefano Di Carlo
- Department of Control and Computer Engineering, Polytechnic University of Turin, Corso Duca degli Abruzzi, 24, 100129, Turin, TO, Italy
| |
Collapse
|
2
|
Magdy Mohamed Abdelaziz Barakat S, Sallehuddin R, Yuhaniz SS, R. Khairuddin RF, Mahmood Y. Genome assembly composition of the String "ACGT" array: a review of data structure accuracy and performance challenges. PeerJ Comput Sci 2023; 9:e1180. [PMID: 37547391 PMCID: PMC10403225 DOI: 10.7717/peerj-cs.1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 04/27/2023] [Indexed: 08/08/2023]
Abstract
Background The development of sequencing technology increases the number of genomes being sequenced. However, obtaining a quality genome sequence remains a challenge in genome assembly by assembling a massive number of short strings (reads) with the presence of repetitive sequences (repeats). Computer algorithms for genome assembly construct the entire genome from reads in two approaches. The de novo approach concatenates the reads based on the exact match between their suffix-prefix (overlapping). Reference-guided approach orders the reads based on their offsets in a well-known reference genome (reads alignment). The presence of repeats extends the technical ambiguity, making the algorithm unable to distinguish the reads resulting in misassembly and affecting the assembly approach accuracy. On the other hand, the massive number of reads causes a big assembly performance challenge. Method The repeat identification method was introduced for misassembly by prior identification of repetitive sequences, creating a repeat knowledge base to reduce ambiguity during the assembly process, thus enhancing the accuracy of the assembled genome. Also, hybridization between assembly approaches resulted in a lower misassembly degree with the aid of the reference genome. The assembly performance is optimized through data structure indexing and parallelization. This article's primary aim and contribution are to support the researchers through an extensive review to ease other researchers' search for genome assembly studies. The study also, highlighted the most recent developments and limitations in genome assembly accuracy and performance optimization. Results Our findings show the limitations of the repeat identification methods available, which only allow to detect of specific lengths of the repeat, and may not perform well when various types of repeats are present in a genome. We also found that most of the hybrid assembly approaches, either starting with de novo or reference-guided, have some limitations in handling repetitive sequences as it is more computationally costly and time intensive. Although the hybrid approach was found to outperform individual assembly approaches, optimizing its performance remains a challenge. Also, the usage of parallelization in overlapping and reads alignment for genome assembly is yet to be fully implemented in the hybrid assembly approach. Conclusion We suggest combining multiple repeat identification methods to enhance the accuracy of identifying the repeats as an initial step to the hybrid assembly approach and combining genome indexing with parallelization for better optimization of its performance.
Collapse
Affiliation(s)
| | - Roselina Sallehuddin
- Computer Science, School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
| | - Siti Sophiayati Yuhaniz
- Advanced Informatics Department, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Kuala Lumpur, Malaysia
| | | | - Yasir Mahmood
- Faculty of Information Technology, The University of Lahore, Lahore, Lahore, Pakistan
| |
Collapse
|
3
|
Rahman A, Medvedev P. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Res 2022; 32:1746-1753. [PMID: 35896425 PMCID: PMC9528984 DOI: 10.1101/gr.276601.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/26/2022] [Indexed: 11/24/2022]
Abstract
Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers. We prove that, contrary to popular belief, even when there are no sequencing errors, unitigs are not always safe (i.e., they are not guaranteed to be substrings of the sequenced genome). We also prove that the unitigs of a bidirected de Bruijn graph are different from those of a doubled de Bruijn graph and, contrary to our expectations, result in underassembly. Using experimental simulations, we then confirm that these two artifacts exist not only in theory but also in the output of widely used assemblers. In particular, when coverage is low, then even error-free data result in unsafe unitigs; also, unitigs may unnecessarily split palindromes in half if special care is not taken. To the best of our knowledge, this paper is the first to theoretically predict the existence of these assembler artifacts and confirm and measure the extent of their occurrence in practice.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
4
|
Bhardwaj V, Pevzner PA, Rashtchian C, Safonova Y. Trace Reconstruction Problems in Computational Biology. IEEE TRANSACTIONS ON INFORMATION THEORY 2021; 67:3295-3314. [PMID: 34176957 PMCID: PMC8224466 DOI: 10.1109/tit.2020.3030569] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The problem of reconstructing a string from its error-prone copies, the trace reconstruction problem, was introduced by Vladimir Levenshtein two decades ago. While there has been considerable theoretical work on trace reconstruction, practical solutions have only recently started to emerge in the context of two rapidly developing research areas: immunogenomics and DNA data storage. In immunogenomics, traces correspond to mutated copies of genes, with mutations generated naturally by the adaptive immune system. In DNA data storage, traces correspond to noisy copies of DNA molecules that encode digital data, with errors being artifacts of the data retrieval process. In this paper, we introduce several new trace generation models and open questions relevant to trace reconstruction for immunogenomics and DNA data storage, survey theoretical results on trace reconstruction, and highlight their connections to computational biology. Throughout, we discuss the applicability and shortcomings of known solutions and suggest future research directions.
Collapse
Affiliation(s)
- Vinnu Bhardwaj
- Electrical and Computer Engineering Department, University of California San Diego, La Jolla, USA
| | - Pavel A. Pevzner
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
| | - Cyrus Rashtchian
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
- Qualcomm Institute, University of California San Diego, La Jolla, USA
| | - Yana Safonova
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
| |
Collapse
|
5
|
Abstract
Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science and Engineering, Pennsylvania State University, University Park, Pennsylvania, United States of America
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania, United States of America
- Center for Computational Biology and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America
- * E-mail:
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, Maryland, United States of America
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
6
|
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
| | - Paul Medevedev
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
- Department of Biochemistry and Molecular Biology, Penn State, University Park, State College, PA, USA
- Center for Computational Biology and Bioinformatics, Penn State, University Park, State College, PA, USA
| |
Collapse
|
7
|
|
8
|
Ten Simple Rules for writing algorithmic bioinformatics conference papers. PLoS Comput Biol 2020; 16:e1007742. [PMID: 32240173 PMCID: PMC7117650 DOI: 10.1371/journal.pcbi.1007742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Conferences are great venues for disseminating algorithmic bioinformatics results, but they unfortunately do not offer an opportunity to make major revisions in the way that journals do. As a result, it is not possible for authors to fix mistakes that might be easily correctable but nevertheless can cause the paper to be rejected. As a reviewer, I wish that I had the opportunity to tell the authors, “Hey, you forgot to do this really important thing, without which it is hard to accept the paper, but if you could go back and fix it, you might have a great paper for the conference.” This lack of a back and forth can be especially problematic for first-time submitters or those from outside the field, e.g., biologists. In this article, I outline Ten Simple Rules to follow when writing an algorithmic bioinformatics conference paper to avoid having it rejected.
Collapse
|