1
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
2
|
Niya B, Yaakoubi K, Beraich FZ, Arouch M, Meftah Kadmiri I. Current status and future developments of assessing microbiome composition and dynamics in anaerobic digestion systems using metagenomic approaches. Heliyon 2024; 10:e28221. [PMID: 38560681 PMCID: PMC10979216 DOI: 10.1016/j.heliyon.2024.e28221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 03/12/2024] [Accepted: 03/13/2024] [Indexed: 04/04/2024] Open
Abstract
The metagenomic approach stands as a powerful technique for examining the composition of microbial communities and their involvement in various anaerobic digestion (AD) systems. Understanding the structure, function, and dynamics of microbial communities becomes pivotal for optimizing the biogas process, enhancing its stability and improving overall performance. Currently, taxonomic profiling of biogas-producing communities relies mainly on high-throughput 16S rRNA sequencing, offering insights into the bacterial and archaeal structures of AD assemblages and their correlations with fed substrates and process parameters. To delve even deeper, shotgun and genome-centric metagenomic approaches are employed to recover individual genomes from the metagenome. This provides a nuanced understanding of collective functionalities, interspecies interactions, and microbial associations with abiotic factors. The application of OMICs in AD systems holds the potential to revolutionize the field, leading to more efficient and sustainable waste management practices particularly through the implementation of precision anaerobic digestion systems. As ongoing research in this area progresses, anticipations are high for further exciting developments in the future. This review serves to explore the current landscape of metagenomic analyses, with focus on advancing our comprehension and critically evaluating biases and recommendations in the analysis of microbial communities in anaerobic digesters. Its objective is to explore how contemporary metagenomic approaches can be effectively applied to enhance our understanding and contribute to the refinement of the AD process. This marks a substantial stride towards achieving a more comprehensive understanding of anaerobic digestion systems.
Collapse
Affiliation(s)
- Btissam Niya
- Plant and Microbial Biotechnology Center, Moroccan Foundation of Advanced Science Innovation and Research MAScIR, Mohammed VI Polytechnic University (UM6P), Lot 660, Hay Moulay Rachid, 43150, Benguerir, Morocco
- Engineering, Industrial Management & Innovation Laboratory IMII, Faculty of Science and Technics (FST), Hassan 1st University of Settat, Morocco
| | - Kaoutar Yaakoubi
- Plant and Microbial Biotechnology Center, Moroccan Foundation of Advanced Science Innovation and Research MAScIR, Mohammed VI Polytechnic University (UM6P), Lot 660, Hay Moulay Rachid, 43150, Benguerir, Morocco
| | - Fatima Zahra Beraich
- Biodome.sarl, Research and Development Design Office of Biogas Technology, Casablanca, Morocco
| | - Moha Arouch
- Engineering, Industrial Management & Innovation Laboratory IMII, Faculty of Science and Technics (FST), Hassan 1st University of Settat, Morocco
| | - Issam Meftah Kadmiri
- Plant and Microbial Biotechnology Center, Moroccan Foundation of Advanced Science Innovation and Research MAScIR, Mohammed VI Polytechnic University (UM6P), Lot 660, Hay Moulay Rachid, 43150, Benguerir, Morocco
| |
Collapse
|
3
|
Murillo Carrasco AG, Furuya TK, Uno M, Citrangulo Tortelli T, Chammas R. deltaXpress (ΔXpress): a tool for mapping differentially correlated genes using single-cell qPCR data. BMC Bioinformatics 2023; 24:402. [PMID: 37884889 PMCID: PMC10605457 DOI: 10.1186/s12859-023-05541-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 10/20/2023] [Indexed: 10/28/2023] Open
Abstract
BACKGROUND High-throughput experiments provide deep insight into the molecular biology of different species, but more tools need to be developed to handle this type of data. At the transcriptomics level, quantitative Polymerase Chain Reaction technology (qPCR) can be affordably adapted to produce high-throughput results through a single-cell approach. In addition to comparative expression profiles between groups, single-cell approaches allow us to evaluate and propose new dependency relationships among markers. However, this alternative has not been explored before for large-scale qPCR-based experiments. RESULTS Herein, we present deltaXpress (ΔXpress), a web app for analyzing data from single-cell qPCR experiments using a combination of HTML and R programming languages in a friendly environment. This application uses cycle threshold (Ct) values and categorical information for each sample as input, allowing the best pair of housekeeping genes to be chosen to normalize the expression of target genes. ΔXpress emulates a bulk analysis by observing differentially expressed genes, but in addition, it allows the discovery of pairwise genes differentially correlated when comparing two experimental conditions. Researchers can download normalized data or use subsequent modules to map differentially correlated genes, perform conventional comparisons between experimental groups, obtain additional information about their genes (gene glossary), and generate ready-to-publication images (600 dots per inch). CONCLUSIONS ΔXpress web app is freely available to non-commercial users at https://alexismurillo.shinyapps.io/dXpress/ and can be used for different experiments in all technologies involving qPCR with at least one housekeeping region.
Collapse
Affiliation(s)
- Alexis Germán Murillo Carrasco
- Center for Translational Research in Oncology (LIM24), Instituto Do Cancer Do Estado de Sao Paulo (ICESP), Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo (HCFMUSP), São Paulo, SP, CEP 01246-000, Brazil.
- Comprehensive Center for Precision Oncology, Universidade de Sao Paulo, São Paulo, Brazil.
| | - Tatiane Katsue Furuya
- Center for Translational Research in Oncology (LIM24), Instituto Do Cancer Do Estado de Sao Paulo (ICESP), Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo (HCFMUSP), São Paulo, SP, CEP 01246-000, Brazil
- Comprehensive Center for Precision Oncology, Universidade de Sao Paulo, São Paulo, Brazil
| | - Miyuki Uno
- Center for Translational Research in Oncology (LIM24), Instituto Do Cancer Do Estado de Sao Paulo (ICESP), Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo (HCFMUSP), São Paulo, SP, CEP 01246-000, Brazil
- Comprehensive Center for Precision Oncology, Universidade de Sao Paulo, São Paulo, Brazil
| | - Tharcisio Citrangulo Tortelli
- Center for Translational Research in Oncology (LIM24), Instituto Do Cancer Do Estado de Sao Paulo (ICESP), Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo (HCFMUSP), São Paulo, SP, CEP 01246-000, Brazil
- Comprehensive Center for Precision Oncology, Universidade de Sao Paulo, São Paulo, Brazil
| | - Roger Chammas
- Center for Translational Research in Oncology (LIM24), Instituto Do Cancer Do Estado de Sao Paulo (ICESP), Hospital das Clinicas da Faculdade de Medicina da Universidade de Sao Paulo (HCFMUSP), São Paulo, SP, CEP 01246-000, Brazil.
- Comprehensive Center for Precision Oncology, Universidade de Sao Paulo, São Paulo, Brazil.
| |
Collapse
|
4
|
Singla D, Yadav IS. GAAP: A GUI-based Genome Assembly and Annotation Package. Curr Genomics 2022; 23:77-82. [PMID: 36778979 PMCID: PMC9878834 DOI: 10.2174/1389202923666220128155537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 12/13/2021] [Accepted: 12/23/2021] [Indexed: 11/22/2022] Open
Abstract
Background: Next-generation sequencing (NGS) technologies are being continuously used for high-throughput sequencing data generation that requires easy-to-use GUI-based data analysis software. These kinds of software could be used in-parallel with sequencing for the automatic data analysis. At present, very few software are available for use and most of them are commercial, thus creating a gap between data generation and data analysis. Methods: GAAP is developed on the NodeJS platform that uses HTML, JavaScript as the front-end for communication with users. We have implemented FastQC and trimmomatic tool for quality checking and control. Velvet and Prodigal are integrated for genome assembly and gene prediction. The annotation will be done with the help of remote NCBI Blast and IPR-Scan. In the back- end, we have used PERL and JavaScript for the processing of data. To evaluate the performance of GAAP, we have assembled a viral (SRR11621811), bacterial (SRR17153353) and human genome (SRR16845439). Results: We have used GAAP software to assemble, and annotate a COVID-19 genome on a desktop computer that resulted in a single contig of 27994bp with 99.57% reference genome coverage. This assembly predicted 11 genes, of which 10 were annotated using annotation module of GAAP. We have also assembled a bacterial and human genome 138 and 194281 contigs with N50 value 100399 and 610, respectively. Conclusion: In this study, we have developed freely available, platform-independent genome assembly and annotation (GAAP) software (www.deepaklab.com/gaap). The software itself acts as a complete data analysis package with quality check, quality control, de-novo genome assembly, gene prediction and annotation (Blast, PFAM, GO-Term, pathway and enzyme mapping) modules.
Collapse
Affiliation(s)
- Deepak Singla
- School of Agricultural Biotechnology, Punjab Agricultural University, Ludhiana, India,Address correspondence to this author at the School of Agricultural Biotechnology, Punjab Agricultural University, 141004, Ludhiana, India; Tel: +91-9582943705; E-mail:
| | - Inderjit Singh Yadav
- School of Agricultural Biotechnology, Punjab Agricultural University, Ludhiana, India
| |
Collapse
|
5
|
Sinha R, Pal RK, De RK. GenSeg and MR-GenSeg: A Novel Segmentation Algorithm and its Parallel MapReduce Based Approach for Identifying Genomic Regions With Copy Number Variations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:443-454. [PMID: 32750860 DOI: 10.1109/tcbb.2020.3000661] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identifying intragenic as well as intergenic sequences of the DNA, having structural alterations, is a significantly important research area, since this may be the root cause of many neurological and autoimmune diseases, including cancer. Working with whole genome NGS data has provided a new insight in this regard, but has lead to huge explosion of data that is growing exponentially. Hence, the challenges lie in efficient means of storage and processing this big data. In this study, we have developed a novel segmentation algorithm, called GenSeg, and its parallel MapReduce based algorithm, called MR-GenSeg, for detecting copy number variations. In order to annotate CNVs (variants), segments formed by GenSeg/MR-GenSeg have been represented in a novel way using a binary tree, where each node is a CNV event. GenSeg considers each position specific data of whole genome DNA sequence, so that precise identification of breakpoints is possible. GenSeg/MR-GenSeg has been compared with twelve popular CNV detection algorithms, where it has outperformed the others in terms of sensitivity, and has achieved a good F-score value. MR-GenSeg has excelled in terms of SpeedUp, when compared with these algorithms. The effect of CNVs on immunoglobulin (IG) genes has also been analysed in this study. Availability: The source codes are available at https://github.com/rituparna-sinha/MapReduce-GENSEG.
Collapse
|
6
|
Herrando‐Pérez S, Tobler R, Huber CD. smartsnp
, an
r
package for fast multivariate analyses of big genomic data. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13684] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Affiliation(s)
- Salvador Herrando‐Pérez
- Australian Centre for Ancient DNA School of Biological Sciences The University of Adelaide Adelaide SA Australia
- Department of Biogeography and Global Change Museo Nacional de Ciencias NaturalesSpanish National Research Council (CSIC) Madrid Spain
| | - Raymond Tobler
- Australian Centre for Ancient DNA School of Biological Sciences The University of Adelaide Adelaide SA Australia
- Evolution of Cultural Diversity Initiative Australian National University Canberra ACT Australia
| | - Christian D. Huber
- Australian Centre for Ancient DNA School of Biological Sciences The University of Adelaide Adelaide SA Australia
- Department of Biology The Pennsylvania State University University Park PA USA
| |
Collapse
|
7
|
Jain S, Saxena A, Hesarur S, Bhadhadhara K, Bharti N, Kasibhatla SM, Sonavane U, Joshi R. GenoVault: a cloud based genomics repository. BioData Min 2021; 14:36. [PMID: 34325724 PMCID: PMC8319889 DOI: 10.1186/s13040-021-00268-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 07/02/2021] [Indexed: 11/15/2022] Open
Abstract
GenoVault is a cloud-based repository for handling Next Generation Sequencing (NGS) data. It is developed using OpenStack-based private cloud with various services like keystone for authentication, cinder for block storage, neutron for networking and nova for managing compute instances for the Cloud. GenoVault uses object-based storage, which enables data to be stored as objects instead of files or blocks for faster retrieval from different distributed object nodes. Along with a web-based interface, a JavaFX-based desktop client has also been developed to meet the requirements of large file uploads that are usually seen in NGS datasets. Users can store files in their respective object-based storage areas and the metadata provided by the user during file uploads is used for querying the database. GenoVault repository is designed taking into account future needs and hence can scale both vertically and horizontally using OpenStack-based cloud features. Users have an option to make the data shareable to the public or restrict the access as private. Data security is ensured as every container is a separate entity in object-based storage architecture which is also supported by Secure File Transfer Protocol (SFTP) for data upload and download. The data is uploaded by the user in individual containers that include raw read files (fastq), processed alignment files (bam, sam, bed) and the output of variation detection (vcf). GenoVault architecture allows verification of the data in terms of integrity and authentication before making it available to collaborators as per the user’s permissions. GenoVault is useful for maintaining the organization-wide NGS data generated in various labs which is not yet published and submitted to public repositories like NCBI. GenoVault also provides support to share NGS data among the collaborating institutions. GenoVault can thus manage vast volumes of NGS data on any OpenStack-based private cloud.
Collapse
Affiliation(s)
- Sankalp Jain
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Amit Saxena
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Suprit Hesarur
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Kirti Bhadhadhara
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Neeraj Bharti
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | | | - Uddhavesh Sonavane
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Rajendra Joshi
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India.
| |
Collapse
|
8
|
Prasad A, Bhargava H, Gupta A, Shukla N, Rajagopal S, Gupta S, Sharma A, Valadi J, Nigam V, Suravajhala P. Next Generation Sequencing. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
9
|
Tripathi P, Singh J, Lal JA, Tripathi V. Next-Generation Sequencing: An Emerging Tool for Drug Designing. Curr Pharm Des 2020; 25:3350-3357. [PMID: 31544713 DOI: 10.2174/1381612825666190911155508] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 09/05/2019] [Indexed: 12/14/2022]
Abstract
BACKGROUND With the outbreak of high throughput next-generation sequencing (NGS), the biological research of drug discovery has been directed towards the oncology and infectious disease therapeutic areas, with extensive use in biopharmaceutical development and vaccine production. METHOD In this review, an effort was made to address the basic background of NGS technologies, potential applications of NGS in drug designing. Our purpose is also to provide a brief introduction of various Nextgeneration sequencing techniques. DISCUSSIONS The high-throughput methods execute Large-scale Unbiased Sequencing (LUS) which comprises of Massively Parallel Sequencing (MPS) or NGS technologies. The Next geneinvolved necessarily executes Largescale Unbiased Sequencing (LUS) which comprises of MPS or NGS technologies. These are related terms that describe a DNA sequencing technology which has revolutionized genomic research. Using NGS, an entire human genome can be sequenced within a single day. CONCLUSION Analysis of NGS data unravels important clues in the quest for the treatment of various lifethreatening diseases and other related scientific problems related to human welfare.
Collapse
Affiliation(s)
- Pooja Tripathi
- Department of Computational Biology and Bioinformatics, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture Technology and Sciences, Prayagraj, India
| | - Jyotsna Singh
- Department of Molecular and Cellular Engineering, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture Technology and Sciences, Prayagraj, India
| | - Jonathan A Lal
- Department of Molecular and Cellular Engineering, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture Technology and Sciences, Prayagraj, India.,Institute for Public Health Genomics, Maastricht University, Maastricht, Netherlands
| | - Vijay Tripathi
- Department of Molecular and Cellular Engineering, Jacob Institute of Biotechnology and Bioengineering, Sam Higginbottom University of Agriculture Technology and Sciences, Prayagraj, India
| |
Collapse
|
10
|
Yukselen O, Turkyilmaz O, Ozturk AR, Garber M, Kucukural A. DolphinNext: a distributed data processing platform for high throughput genomics. BMC Genomics 2020; 21:310. [PMID: 32306927 PMCID: PMC7168977 DOI: 10.1186/s12864-020-6714-x] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Accepted: 04/01/2020] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND The emergence of high throughput technologies that produce vast amounts of genomic data, such as next-generation sequencing (NGS) is transforming biological research. The dramatic increase in the volume of data, the variety and continuous change of data processing tools, algorithms and databases make analysis the main bottleneck for scientific discovery. The processing of high throughput datasets typically involves many different computational programs, each of which performs a specific step in a pipeline. Given the wide range of applications and organizational infrastructures, there is a great need for highly parallel, flexible, portable, and reproducible data processing frameworks. Several platforms currently exist for the design and execution of complex pipelines. Unfortunately, current platforms lack the necessary combination of parallelism, portability, flexibility and/or reproducibility that are required by the current research environment. To address these shortcomings, workflow frameworks that provide a platform to develop and share portable pipelines have recently arisen. We complement these new platforms by providing a graphical user interface to create, maintain, and execute complex pipelines. Such a platform will simplify robust and reproducible workflow creation for non-technical users as well as provide a robust platform to maintain pipelines for large organizations. RESULTS To simplify development, maintenance, and execution of complex pipelines we created DolphinNext. DolphinNext facilitates building and deployment of complex pipelines using a modular approach implemented in a graphical interface that relies on the powerful Nextflow workflow framework by providing 1. A drag and drop user interface that visualizes pipelines and allows users to create pipelines without familiarity in underlying programming languages. 2. Modules to execute and monitor pipelines in distributed computing environments such as high-performance clusters and/or cloud 3. Reproducible pipelines with version tracking and stand-alone versions that can be run independently. 4. Modular process design with process revisioning support to increase reusability and pipeline development efficiency. 5. Pipeline sharing with GitHub and automated testing 6. Extensive reports with R-markdown and shiny support for interactive data visualization and analysis. CONCLUSION DolphinNext is a flexible, intuitive, web-based data processing and analysis platform that enables creating, deploying, sharing, and executing complex Nextflow pipelines with extensive revisioning and interactive reporting to enhance reproducible results.
Collapse
Affiliation(s)
- Onur Yukselen
- Bioinformatics Core, University of Massachusetts Medical School, Worcester, MA, 01605, USA
| | - Osman Turkyilmaz
- RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA, 01605, USA
| | - Ahmet Rasit Ozturk
- RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA, 01605, USA
| | - Manuel Garber
- Bioinformatics Core, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
- Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
| | - Alper Kucukural
- Bioinformatics Core, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
- Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, 01605, USA.
| |
Collapse
|
11
|
Tripathi R, Aier I, Chakraborty P, Varadwaj PK. Unravelling the role of long non-coding RNA - LINC01087 in breast cancer. Noncoding RNA Res 2019; 5:1-10. [PMID: 31989062 PMCID: PMC6965516 DOI: 10.1016/j.ncrna.2019.12.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 12/17/2019] [Accepted: 12/17/2019] [Indexed: 02/09/2023] Open
Abstract
Apoptosis is a 'programmed fate' of all cells participating in diverse physiological and pathological conditions. The role of critical regulators and their involvement in this complex multi-stage process of apoptosis weaved around non-coding RNAs (ncRNAs) is poorly deciphered in breast carcinoma (BC). Aberrant expression patterns of the ncRNAs and their interacting partners, either ncRNAs or coding RNAs or proteins at any point along these pathways, may lead to the malignant transformation of the affected cells, tumour metastasis and resistance to anticancer drugs. Longest non-coding type of ncRNAs (lncRNAs) have been considered as critical factors for the development and progression of breast cancer. The aim of our study was to identify set of novel lncRNAs interacting with microRNAs (miRNAs) or proteins that were significantly dysregulated in breast cancer using RNA-Sequencing (RNA-Seq) technique in different samples acting as oncogenic drivers contributing to cancerous phenotype involved in post-transcriptional processing of RNAs. Four lncRNAs; LINC01087, lnc-CLSTN2-1:1, lnc-c7orf65-3:3 and LINC01559:2 were selected for further analysis. Gene expression analysis of over-expressed LINC01087 in vitro reduced both cell viability and apoptosis. We integrated miRNA and mRNA (hsa-miR-548 and AKT1) expression profiles with curated regulations with lncRNA (LINC01087) which has not been previously associated with any breast cancer type, using different computational tools. The network (lncRNA→ miRNA→ mRNA) is promising for the identification of carcinoma associated genes and apoptosis signaling path highlighting the potential roles of LINC01087, hsa-miR548n, AKT1 gene which may play crucial role in proliferation.
Collapse
Affiliation(s)
- Rashmi Tripathi
- Department of Bioinformatics and Applied Sciences, Indian Institute of Information Technology-Allahabad, Allahabad, India
| | - Imlimaong Aier
- Department of Bioinformatics and Applied Sciences, Indian Institute of Information Technology-Allahabad, Allahabad, India
| | - Pavan Chakraborty
- Department of Information Technology, Indian Institute of Information Technology-Allahabad, Allahabad, India
| | - Pritish Kumar Varadwaj
- Department of Bioinformatics and Applied Sciences, Indian Institute of Information Technology-Allahabad, Allahabad, India
| |
Collapse
|
12
|
Calarco L, Barratt J, Ellis J. Detecting sequence variants in clinically important protozoan parasites. Int J Parasitol 2019; 50:1-18. [PMID: 31857072 DOI: 10.1016/j.ijpara.2019.10.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Revised: 09/29/2019] [Accepted: 10/01/2019] [Indexed: 02/06/2023]
Abstract
Second and third generation sequencing methods are crucial for population genetic studies, and variant detection is a popular approach for exploiting this sequence data. While mini- and microsatellites are historically useful markers for studying important Protozoa such as Toxoplasma and Plasmodium spp., detecting non-repetitive variants such as those found in genes can be fundamental to investigating a pathogen's biology. These variants, namely single nucleotide polymorphisms and insertions and deletions, can help elucidate the genetic basis of an organism's pathogenicity, identify selective pressures, and resolve phylogenetic relationships. They also have the added benefit of possessing a comparatively low mutation rate, which contributes to their stability. However, there is a plethora of variant analysis tools with nuanced pipelines and conflicting recommendations for best practise, which can be confounding. This lack of standardisation means that variant analysis requires careful parameter optimisation, an understanding of its limitations, and the availability of high quality data. This review explores the value of variant detection when applied to non-model organisms such as clinically important protozoan pathogens. The limitations of current methods are discussed, including special considerations that require the end-users' attention to ensure that the results generated are reproducible, and the biological conclusions drawn are valid.
Collapse
Affiliation(s)
- Larissa Calarco
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia.
| | - Joel Barratt
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| | - John Ellis
- School of Life Sciences, University of Technology Sydney, PO Box 123, Broadway, NSW 2007, Australia
| |
Collapse
|
13
|
FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files. BMC Bioinformatics 2019; 20:361. [PMID: 31253077 PMCID: PMC6599294 DOI: 10.1186/s12859-019-2961-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 06/20/2019] [Indexed: 11/10/2022] Open
Abstract
Background Exploration and processing of FASTQ files are the first steps in state-of-the-art data analysis workflows of Next Generation Sequencing (NGS) platforms. The large amount of data generated by these technologies has put a challenge in terms of rapid analysis and visualization of sequencing information. Recent integration of the R data analysis platform with web visual frameworks has stimulated the development of user-friendly, powerful, and dynamic NGS data analysis applications. Results This paper presents FastqCleaner, a Bioconductor visual application for both quality-control (QC) and pre-processing of FASTQ files. The interface shows diagnostic information for the input and output data and allows to select a series of filtering and trimming operations in an interactive framework. FastqCleaner combines the technology of Bioconductor for NGS data analysis with the data visualization advantages of a web environment. Conclusions FastqCleaner is an user-friendly, offline-capable tool that enables access to advanced Bioconductor infrastructure. The novel concept of a Bioconductor interactive application that can be used without the need for programming skills, makes FastqCleaner a valuable resource for NGS data analysis. Electronic supplementary material The online version of this article (10.1186/s12859-019-2961-8) contains supplementary material, which is available to authorized users.
Collapse
|
14
|
Leivada E, D’Alessandro R, Grohmann KK. Eliciting Big Data From Small, Young, or Non-standard Languages: 10 Experimental Challenges. Front Psychol 2019; 10:313. [PMID: 30837922 PMCID: PMC6382742 DOI: 10.3389/fpsyg.2019.00313] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 02/01/2019] [Indexed: 11/17/2022] Open
Abstract
The aim of this work is to identify and analyze a set of challenges that are likely to be encountered when one embarks on fieldwork in linguistic communities that feature small, young, and/or non-standard languages with a goal to elicit big sets of rich data. For each challenge, we (i) explain its nature and implications, (ii) offer one or more examples of how it is manifested in actual linguistic communities, and (iii) where possible, offer recommendations for addressing it effectively. Our list of challenges involves static characteristics (e.g., absence of orthographic conventions and how it affects data collection), dynamic processes (e.g., speed of language change in small languages and how it affects longitudinal collection of big amounts of data), and interactive relations between non-dynamic features that are nevertheless subject to potentially rapid change (e.g., absence of standardized assessment tools or estimates for psycholinguistic variables). The identified challenges represent the domains of data collection and handling, participant recruitment, and experimental design. Among other issues, we discuss population limits and degree of power, inter- and intraspeaker variation, absence of metalanguage and its implications for the process of eliciting acceptability judgments, and challenges that arise from absence of local funding, conflicting regulations in relation to privacy issues, and exporting large samples of data across countries. Finally, the ten experimental challenges presented are relevant to languages from a broad typological spectrum, encompassing both spoken and sign, extant and nearly extinct languages.
Collapse
Affiliation(s)
- Evelina Leivada
- Department of Language and Culture, UiT The Arctic University of Norway, Tromsø, Norway
| | - Roberta D’Alessandro
- Utrecht Institute of Linguistics, UiL-OTS, Utrecht University, Utrecht, Netherlands
| | | |
Collapse
|
15
|
Abstract
During the last decade, ncRNAs have been investigated intensively and revealed their regulatory role in various biological processes. Worldwide research efforts have identified numerous ncRNAs and multiple RNA subtypes, which are attributed to diverse functionalities known to interact with different functional layers, from DNA and RNA to proteins. This makes the prediction of functions for newly identified ncRNAs challenging. Current bioinformatics and systems biology approaches show promising results to facilitate an identification of these diverse ncRNA functionalities. Here, we review (a) current experimental protocols, i.e., for Next Generation Sequencing, for a successful identification of ncRNAs; (b) sequencing data analysis workflows as well as available computational environments; and (c) state-of-the-art approaches to functionally characterize ncRNAs, e.g., by means of transcriptome-wide association studies, molecular network analyses, or artificial intelligence guided prediction. In addition, we present a strategy to cover the identification and functional characterization of unknown transcripts by using connective workflows.
Collapse
|
16
|
Gao Y, Yurkovich JT, Seo SW, Kabimoldayev I, Dräger A, Chen K, Sastry AV, Fang X, Mih N, Yang L, Eichner J, Cho BK, Kim D, Palsson BO. Systematic discovery of uncharacterized transcription factors in Escherichia coli K-12 MG1655. Nucleic Acids Res 2018; 46:10682-10696. [PMID: 30137486 PMCID: PMC6237786 DOI: 10.1093/nar/gky752] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 07/11/2018] [Accepted: 08/08/2018] [Indexed: 02/03/2023] Open
Abstract
Transcriptional regulation enables cells to respond to environmental changes. Of the estimated 304 candidate transcription factors (TFs) in Escherichia coli K-12 MG1655, 185 have been experimentally identified, but ChIP methods have been used to fully characterize only a few dozen. Identifying these remaining TFs is key to improving our knowledge of the E. coli transcriptional regulatory network (TRN). Here, we developed an integrated workflow for the computational prediction and comprehensive experimental validation of TFs using a suite of genome-wide experiments. We applied this workflow to (i) identify 16 candidate TFs from over a hundred uncharacterized genes; (ii) capture a total of 255 DNA binding peaks for ten candidate TFs resulting in six high-confidence binding motifs; (iii) reconstruct the regulons of these ten TFs by determining gene expression changes upon deletion of each TF and (iv) identify the regulatory roles of three TFs (YiaJ, YdcI, and YeiE) as regulators of l-ascorbate utilization, proton transfer and acetate metabolism, and iron homeostasis under iron-limited conditions, respectively. Together, these results demonstrate how this workflow can be used to discover, characterize, and elucidate regulatory functions of uncharacterized TFs in parallel.
Collapse
Affiliation(s)
- Ye Gao
- Division of Biological Sciences, University of California, San Diego, La Jolla, CA 92093, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - James T Yurkovich
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Sang Woo Seo
- School of Chemical and Biological Engineering, Seoul National University, Seoul, Republic of Korea
| | - Ilyas Kabimoldayev
- Department of Genetic Engineering and Graduate School of Biotechnology, College of Life Sciences, Kyung Hee University, Yongin, Republic of Korea
| | - Andreas Dräger
- Computational Systems Biology of Infection and Antimicrobial-Resistant Pathogens, Center for Bioinformatics Tübingen (ZBIT), 72076 Tübingen, Germany
- Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany
| | - Ke Chen
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Anand V Sastry
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Xin Fang
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Nathan Mih
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
| | - Laurence Yang
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Johannes Eichner
- Computational Systems Biology of Infection and Antimicrobial-Resistant Pathogens, Center for Bioinformatics Tübingen (ZBIT), 72076 Tübingen, Germany
| | - Byung-Kwan Cho
- Novo Nordisk Foundation Center for Biosustainability, 2800 Kongens Lyngby, Denmark
- Department of Biological Sciences, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea
| | - Donghyuk Kim
- Department of Genetic Engineering and Graduate School of Biotechnology, College of Life Sciences, Kyung Hee University, Yongin, Republic of Korea
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea
- School of Biological Sciences, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea
| | - Bernhard O Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, 2800 Kongens Lyngby, Denmark
- Department of Pediatrics, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
17
|
Demography in the Big Data Revolution: Changing the Culture to Forge New Frontiers. POPULATION RESEARCH AND POLICY REVIEW 2018. [DOI: 10.1007/s11113-018-9464-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
18
|
Samaddar S, Sinha R, De RK. A MODEL for DISTRIBUTED PROCESSING and ANALYSES of NGS DATA under MAP-REDUCE PARADIGM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:827-840. [PMID: 29993814 DOI: 10.1109/tcbb.2018.2816022] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Massively parallel sequencing technique, introduced by NGS technology, has resulted in an exponential growth of sequencing data, with greatly reduced cost and increased throughput. This huge explosion of data has introduced new challenges in regard to its storage, integration, processing and analyses. In this paper, we have proposed a novel distributed model under Map-Reduce paradigm to address the NGS big data problem. The architecture of the model involves Map-Reduce based modularized approach involving 3 different phases that support various analytical pipelines. The first phase will generate detailed base level information of various individual genomes, by granulating the alignment data. The other 2 phases independently process this base level information in parallel. One of these 2 phases will provide an integrated DNA profile of multiple individuals, whereas the other phase will generate contigs with similar features in an individual. Each of these 2 phases will generate a repository of genomic information that will facilitate other analytical pipelines. A simulated and real experimental prototypes has been provided as results to show the effectiveness of the model and its superiority over a few existing popular models and tools. A detailed description of the scope of applications of this model is also included in this article.
Collapse
|
19
|
Gut Dysbiosis and Muscle Aging: Searching for Novel Targets against Sarcopenia. Mediators Inflamm 2018; 2018:7026198. [PMID: 29686533 PMCID: PMC5893006 DOI: 10.1155/2018/7026198] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Revised: 11/28/2017] [Accepted: 12/05/2017] [Indexed: 12/12/2022] Open
Abstract
Advanced age is characterized by several changes, one of which is the impairment of the homeostasis of intestinal microbiota. These alterations critically influence host health and have been associated with morbidity and mortality in older adults. “Inflammaging,” an age-related chronic inflammatory process, is a common trait of several conditions, including sarcopenia. Interestingly, imbalanced intestinal microbial community has been suggested to contribute to inflammaging. Changes in gut microbiota accompanying sarcopenia may be attenuated by supplementation with pre- and probiotics. Although muscle aging has been increasingly recognized as a biomarker of aging, the pathophysiology of sarcopenia is to date only partially appreciated. Due to its development in the context of the age-related inflammatory milieu, several studies favor the hypothesis of a tight connection between sarcopenia and inflammaging. However, conclusive evidence describing the signaling pathways involved has not yet been produced. Here, we review the current knowledge of the changes in intestinal microbiota that occur in advanced age with a special emphasis on findings supporting the idea of a modulation of muscle physiology through alterations in gut microbial composition and activity.
Collapse
|
20
|
Tripathi R, Chakraborty P, Varadwaj PK. Unraveling long non-coding RNAs through analysis of high-throughput RNA-sequencing data. Noncoding RNA Res 2017; 2:111-118. [PMID: 30159428 PMCID: PMC6096414 DOI: 10.1016/j.ncrna.2017.06.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 06/19/2017] [Accepted: 06/21/2017] [Indexed: 01/01/2023] Open
Abstract
Extensive genome-wide transcriptome study mediated by high throughput sequencing technique has revolutionized the study of genetics and epigenetic at unprecedented resolution. The research has revealed that besides protein-coding RNAs, large proportions of mammalian transcriptome includes a heap of regulatory non protein-coding RNAs, the number encoded within human genome is enigmatic. Many taboos developed in the past categorized these non-coding RNAs as ''dark matter" and "junks". Breaking the myth, RNA-seq-- a recently developed experimental technique is widely being used for studying non-coding RNAs which has acquired the limelight due to their physiological and pathological significance. The longest member of the ncRNA family-- long non-coding RNAs, acts as stable and functional part of a genome, guiding towards the important clues about the varied biological events like cellular-, structural- processes governing the complexity of an organism. Here, we review the most recent and influential computational approach developed to identify and quantify the long non-coding RNAs serving as an assistant for the users to choose appropriate tools for their specific research.
Collapse
Affiliation(s)
- Rashmi Tripathi
- Department of Bioinformatics, Indian Institute of Information Technology Allahabad, Allahabad, 211015, UP, India
| | - Pavan Chakraborty
- Department of Information Technology, Indian Institute of Information Technology Allahabad, Allahabad, 211015, UP, India
| | - Pritish Kumar Varadwaj
- Department of Bioinformatics, Indian Institute of Information Technology Allahabad, Allahabad, 211015, UP, India
| |
Collapse
|
21
|
Tripathi R, Patel S, Kumari V, Chakraborty P, Varadwaj PK. DeepLNC, a long non-coding RNA prediction tool using deep neural network. ACTA ACUST UNITED AC 2016. [DOI: 10.1007/s13721-016-0129-2] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|