1
|
Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022; 10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
- Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada
| | - Pedro Lenz Casa
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Fernanda Pessi de Abreu
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Daniel Luis Notari
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Scheila de Avila E Silva
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| |
Collapse
|
2
|
Mou X, Jamil HM. Visual Life Sciences Workflow Design Using Distributed and Heterogeneous Resources. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1459-1473. [PMID: 30561349 DOI: 10.1109/tcbb.2018.2886185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Programming or querying usually presupposes some degree of technical familiarity with the syntax of a language and the peculiarity of the objects it manipulates to produce useful information. The degree of abstractions supported in a language helps lessen the depth of such familiarity needed, and aids in improving access to and usability of these resources. To help biologists concentrate more on their science questions and not on how to compute it, several successful workflow orchestration languages and systems have been proposed. Despite their popularity, significant limitations reduce their usability and limit applicability in novel applications. In this paper, we present a visual language, called VisFlow, for workflow orchestration using heterogeneous and distributed resources. We advance the idea that once resources are minimally described and abstracted, arbitrary workflows can be designed solely using query primitives supported in VisFlow. Its capabilities can be augmented by including computational artifacts in the form of library functions written in R, Python, and Java, or even in SQL and XQuery, making it a truly extensible system. We discuss its salient features and illustrate its capabilities using a substantial set of examples.
Collapse
|
3
|
Abstract
The development of next-generation sequencing platforms increased substantially the capacity of data generation. In addition, in the past years, the costs for whole genome sequencing have been reduced that made it easier to access this technology. As a result, the storage and analysis of the data generated became a challenge, ushering in the development of bioinformatic tools, such as programs and programming languages, able to store, process, and analyze this huge amount of information. In this article, we present MELC genomics, a framework for genome assembly in a simple and fast workflow.
Collapse
|
4
|
Chen S, Beltrán JF, Esteban-Jurado C, Franch-Expósito S, Castellví-Bel S, Lipkin S, Wei X, Yu H. GeMSTONE: orchestrated prioritization of human germline mutations in the cloud. Nucleic Acids Res 2017; 45:W207-W214. [PMID: 28521008 PMCID: PMC5556704 DOI: 10.1093/nar/gkx398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Accepted: 04/28/2017] [Indexed: 12/28/2022] Open
Abstract
Integrative analysis of whole-genome/exome-sequencing data has been challenging, especially for the non-programming research community, as it requires simultaneously managing a large number of computational tools. Even computational biologists find it unexpectedly difficult to reproduce results from others or optimize their strategies in an end-to-end workflow. We introduce Germline Mutation Scoring Tool fOr Next-generation sEquencing data (GeMSTONE), a cloud-based variant prioritization tool with high-level customization and a comprehensive collection of bioinformatics tools and data libraries (http://gemstone.yulab.org/). GeMSTONE generates and readily accepts a shareable 'recipe' file for each run to either replicate previous results or analyze new data with identical parameters and provides a centralized workflow for prioritizing germline mutations in human disease within a streamlined workflow rather than a pool of program executions.
Collapse
Affiliation(s)
- Siwei Chen
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.,Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA.,Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - Juan F Beltrán
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.,Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| | - Clara Esteban-Jurado
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, 08036 Barcelona, Catalonia, Spain
| | - Sebastià Franch-Expósito
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, 08036 Barcelona, Catalonia, Spain
| | - Sergi Castellví-Bel
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, 08036 Barcelona, Catalonia, Spain
| | - Steven Lipkin
- Department of Medicine, Weill Cornell College of Medicine, NY 10021, USA
| | - Xiaomu Wei
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA.,Department of Medicine, Weill Cornell College of Medicine, NY 10021, USA
| | - Haiyuan Yu
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.,Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
5
|
Cannon EKS, Birkett SM, Braun BL, Kodavali S, Jennewein DM, Yilmaz A, Antonescu V, Antonescu C, Harper LC, Gardiner JM, Schaeffer ML, Campbell DA, Andorf CM, Andorf D, Lisch D, Koch KE, McCarty DR, Quackenbush J, Grotewold E, Lushbough CM, Sen TZ, Lawrence CJ. POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data. INTERNATIONAL JOURNAL OF PLANT GENOMICS 2011; 2011:923035. [PMID: 22253616 PMCID: PMC3255282 DOI: 10.1155/2011/923035] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2011] [Accepted: 11/29/2011] [Indexed: 05/21/2023]
Abstract
The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time-sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses web Services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn's utility are provided herein.
Collapse
Affiliation(s)
- Ethalinda K. S. Cannon
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Scott M. Birkett
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Bremen L. Braun
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
| | - Sateesh Kodavali
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Douglas M. Jennewein
- Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA
| | - Alper Yilmaz
- Plant Biotechnology Center and Department of Molecular Genetics, The Ohio State University, Columbus, OH 43210, USA
| | - Valentin Antonescu
- Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Sm822, Boston, MA 02215, USA
| | - Corina Antonescu
- Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Sm822, Boston, MA 02215, USA
| | - Lisa C. Harper
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
- USDA-ARS Plant Gene Expression Center, Albany, CA 94710, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | - Jack M. Gardiner
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
- School of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA
| | - Mary L. Schaeffer
- USDA-ARS Plant Genetics Research Unit, University of Missouri, Columbia, MO 65211, USA
- Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211, USA
| | - Darwin A. Campbell
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
| | - Carson M. Andorf
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
| | - Destri Andorf
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
| | - Damon Lisch
- Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | - Karen E. Koch
- Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA
| | - Donald R. McCarty
- Horticultural Sciences Department, University of Florida, Gainesville, FL 32611, USA
| | - John Quackenbush
- Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Sm822, Boston, MA 02215, USA
| | - Erich Grotewold
- Plant Biotechnology Center and Department of Molecular Genetics, The Ohio State University, Columbus, OH 43210, USA
| | - Carol M. Lushbough
- Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA
| | - Taner Z. Sen
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
| | - Carolyn J. Lawrence
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA
- USDA-ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
6
|
Lushbough CM, Jennewein DM, Brendel VP. The BioExtract Server: a web-based bioinformatic workflow platform. Nucleic Acids Res 2011; 39:W528-32. [PMID: 21546552 PMCID: PMC3125737 DOI: 10.1093/nar/gkr286] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The BioExtract Server (bioextract.org) is an open, web-based system designed to aid researchers in the analysis of genomic data by providing a platform for the creation of bioinformatic workflows. Scientific workflows are created within the system by recording tasks performed by the user. These tasks may include querying multiple, distributed data sources, saving query results as searchable data extracts, and executing local and web-accessible analytic tools. The series of recorded tasks can then be saved as a reproducible, sharable workflow available for subsequent execution with the original or modified inputs and parameter settings. Integrated data resources include interfaces to the National Center for Biotechnology Information (NCBI) nucleotide and protein databases, the European Molecular Biology Laboratory (EMBL-Bank) non-redundant nucleotide database, the Universal Protein Resource (UniProt), and the UniProt Reference Clusters (UniRef) database. The system offers access to numerous preinstalled, curated analytic tools and also provides researchers with the option of selecting computational tools from a large list of web services including the European Molecular Biology Open Software Suite (EMBOSS), BioMoby, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). The system further allows users to integrate local command line tools residing on their own computers through a client-side Java applet.
Collapse
Affiliation(s)
- Carol M Lushbough
- Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA.
| | | | | |
Collapse
|
7
|
Néron B, Ménager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S, Tuffery P, Letondal C. Mobyle: a new full web bioinformatics framework. Bioinformatics 2009; 25:3005-11. [PMID: 19689959 PMCID: PMC2773253 DOI: 10.1093/bioinformatics/btp493] [Citation(s) in RCA: 248] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation: For the biologist, running bioinformatics analyses involves a time-consuming management of data and tools. Users need support to organize their work, retrieve parameters and reproduce their analyses. They also need to be able to combine their analytic tools using a safe data flow software mechanism. Finally, given that scientific tools can be difficult to install, it is particularly helpful for biologists to be able to use these tools through a web user interface. However, providing a web interface for a set of tools raises the problem that a single web portal cannot offer all the existing and possible services: it is the user, again, who has to cope with data copy among a number of different services. A framework enabling portal administrators to build a network of cooperating services would therefore clearly be beneficial. Results: We have designed a system, Mobyle, to provide a flexible and usable Web environment for defining and running bioinformatics analyses. It embeds simple yet powerful data management features that allow the user to reproduce analyses and to combine tools using a hierarchical typing system. Mobyle offers invocation of services distributed over remote Mobyle servers, thus enabling a federated network of curated bioinformatics portals without the user having to learn complex concepts or to install sophisticated software. While being focused on the end user, the Mobyle system also addresses the need, for the bioinfomatician, to automate remote services execution: PlayMOBY is a companion tool that automates the publication of BioMOBY web services, using Mobyle program definitions. Availability: The Mobyle system is distributed under the terms of the GNU GPLv2 on the project web site (http://bioweb2.pasteur.fr/projects/mobyle/). It is already deployed on three servers: http://mobyle.pasteur.fr, http://mobyle.rpbs.univ-paris-diderot.fr and http://lipm-bioinfo.toulouse.inra.fr/Mobyle. The PlayMOBY companion is distributed under the terms of the CeCILL license, and is available at http://lipm-bioinfo.toulouse.inra.fr/biomoby/PlayMOBY/. Contact:mobyle-support@pasteur.fr; mobyle-support@rpbs.univ-paris-diderot.fr; letondal@pasteur.fr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bertrand Néron
- Groupe Logiciels et Banques de Données, Institut Pasteur, 28, rue du Dr Roux, 75724 Paris Cedex, France.
| | | | | | | | | | | | | | | | | |
Collapse
|