1
|
Abstract
Background Computational analysis of complex diseases involving multiple organs requires the integration of multiple different models into a unified model. Different models are often constructed in heterogeneous formats. Thus, the integration of the models requires a standard language format that can effectively represent essential biological information. However, the previously introduced formats have limitations that prevent from adequately representing essential biological information, particularly specifications of bio-molecules and biological contexts. Results We defined an XML-based markup language called context-oriented directed association markup language (CODA-ML), which better represents essential biological information. The CODA-ML has two major strengths in designating molecular specifications and biological contexts. It can cover heterogeneous entity types involved in biological events (e.g. gene/protein, compound, cellular function, disease). Molecular types of entities can have molecular specifications which include detailed information of a molecule from isoforms to modifications, enabling high-resolution representation of molecules. In addition, it can distinguish biological events that vary depending on different biological contexts such as cell types or disease conditions. Especially representation of inter-cellular events as well as intra-cellular events is available. These two major strengths can resolve contradictory associations when different models are integrated into one unified model, which improves the accuracy of the model. Conclusions With the CODA-ML, diverse models such as signaling pathways, metabolic pathways, and gene regulatory pathways can be represented in a unified language format. Heterogeneous entity types can be covered by the CODA-ML, thus it enables detailed description for the mechanisms of diseases or drugs from multiple perspectives (e.g., molecule, function or disease). The CODA-ML is expected to help integrate different models into one systemic model in an efficient and effective. The unified model can be used to perform computational analysis not only for cancer but also for other complex diseases involving multiple organs beyond a single cell. Electronic supplementary material The online version of this article (10.1186/s12859-019-2812-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mijin Kwon
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Soorin Yim
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Gwangmin Kim
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Saehwan Lee
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Chungsun Jeong
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea
| | - Doheon Lee
- Department of Bio and Brain Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea. .,Bio-Synergy Research Center, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
2
|
Urda D, Aragón F, Bautista R, Franco L, Veredas FJ, Claros MG, Jerez JM. BLASSO: integration of biological knowledge into a regularized linear model. BMC Syst Biol 2018; 12:94. [PMID: 30458775 PMCID: PMC6245593 DOI: 10.1186/s12918-018-0612-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Background In RNA-Seq gene expression analysis, a genetic signature or biomarker is defined as a subset of genes that is probably involved in a given complex human trait and usually provide predictive capabilities for that trait. The discovery of new genetic signatures is challenging, as it entails the analysis of complex-nature information encoded at gene level. Moreover, biomarkers selection becomes unstable, since high correlation among the thousands of genes included in each sample usually exists, thus obtaining very low overlapping rates between the genetic signatures proposed by different authors. In this sense, this paper proposes BLASSO, a simple and highly interpretable linear model with l1-regularization that incorporates prior biological knowledge to the prediction of breast cancer outcomes. Two different approaches to integrate biological knowledge in BLASSO, Gene-specific and Gene-disease, are proposed to test their predictive performance and biomarker stability on a public RNA-Seq gene expression dataset for breast cancer. The relevance of the genetic signature for the model is inspected by a functional analysis. Results BLASSO has been compared with a baseline LASSO model. Using 10-fold cross-validation with 100 repetitions for models’ assessment, average AUC values of 0.7 and 0.69 were obtained for the Gene-specific and the Gene-disease approaches, respectively. These efficacy rates outperform the average AUC of 0.65 obtained with the LASSO. With respect to the stability of the genetic signatures found, BLASSO outperformed the baseline model in terms of the robustness index (RI). The Gene-specific approach gave RI of 0.15±0.03, compared to RI of 0.09±0.03 given by LASSO, thus being 66% times more robust. The functional analysis performed to the genetic signature obtained with the Gene-disease approach showed a significant presence of genes related with cancer, as well as one gene (IFNK) and one pseudogene (PCNAP1) which a priori had not been described to be related with cancer. Conclusions BLASSO has been shown as a good choice both in terms of predictive efficacy and biomarker stability, when compared to other similar approaches. Further functional analyses of the genetic signatures obtained with BLASSO has not only revealed genes with important roles in cancer, but also genes that should play an unknown or collateral role in the studied disease.
Collapse
Affiliation(s)
- Daniel Urda
- Universidad de Cádiz, Departamento de Ingeniería Informática, Avda. de la Universidad de Cádiz n°10, Puerto Real, Cádiz, 11519, Spain.
| | - Francisco Aragón
- Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Rocío Bautista
- Universidad de Málaga, Plataforma Andaluza de Bioinformática, Parque Tecnológico de Andalucía, Calle Severo Ochoa 34, Málaga, 29590, Spain
| | - Leonardo Franco
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Francisco J Veredas
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Manuel Gonzalo Claros
- Universidad de Málaga, Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Campus Universitario de Teatinos, Málaga, 29071, Spain
| | - José Manuel Jerez
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| |
Collapse
|
3
|
Abstract
BACKGROUND Molecular evolution studies involve many different hard computational problems solved, in most cases, with heuristic algorithms that provide a nearly optimal solution. Hence, diverse software tools exist for the different stages involved in a molecular evolution workflow. RESULTS We present MEvoLib, the first molecular evolution library for Python, providing a framework to work with different tools and methods involved in the common tasks of molecular evolution workflows. In contrast with already existing bioinformatics libraries, MEvoLib is focused on the stages involved in molecular evolution studies, enclosing the set of tools with a common purpose in a single high-level interface with fast access to their frequent parameterizations. The gene clustering from partial or complete sequences has been improved with a new method that integrates accessible external information (e.g. GenBank's features data). Moreover, MEvoLib adjusts the fetching process from NCBI databases to optimize the download bandwidth usage. In addition, it has been implemented using parallelization techniques to cope with even large-case scenarios. CONCLUSIONS MEvoLib is the first library for Python designed to facilitate molecular evolution researches both for expert and novel users. Its unique interface for each common task comprises several tools with their most used parameterizations. It has also included a method to take advantage of biological knowledge to improve the gene partition of sequence datasets. Additionally, its implementation incorporates parallelization techniques to enhance computational costs when handling very large input datasets.
Collapse
Affiliation(s)
- Jorge Álvarez-Jarreta
- Depto. de Informática e Ingeniería de Sistemas (DIIS), Universidad de Zaragoza, María de Luna 1, Zaragoza, 50018, Spain. .,Instituto de Investigación en Ingeniería de Aragón (I3A), Universidad de Zaragoza, Mariano Esquillor s/n, Zaragoza, 50018, Spain.
| | - Eduardo Ruiz-Pesini
- Depto. de Bioquímica, Biología Molecular y Celular, Universidad de Zaragoza, Miguel Server 177, Zaragoza, 50013, Spain.,Instituto de Investigación Sanitaria de Aragón (IIS Aragón), San Juan Bosco 13, Zaragoza, 50009, Spain.,CIBER de enfermedades raras, Instituto de Salud Carlos III, Monforte de Lemos 5, Madrid, 28029, Spain.,Fundación ARAID, María de Luna 11, Zaragoza, 50018, Spain
| |
Collapse
|
4
|
Meng J, Li R, Luan Y. Classification by integrating plant stress response gene expression data with biological knowledge. Math Biosci 2015; 266:65-72. [PMID: 26092610 DOI: 10.1016/j.mbs.2015.06.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 05/03/2015] [Accepted: 06/05/2015] [Indexed: 12/01/2022]
Abstract
Classification of microarray data has always been a challenging task because of the enormous number of genes. In this study, a clustering method by integrating plant stress response gene expression data with biological knowledge is presented. Clustering is one of the promising tools for attribute reduction, but gene clusters are biologically uninformative. So we integrated biological knowledge into genomic analysis to help to improve the interpretation of the results. Biological similarity based on gene ontology (GO) semantic similarity was combined with gene expression data to find out biologically meaningful clusters. Affinity propagation clustering algorithm was chosen to analyze the impact of the biological similarity on the results. Based on clustering result, neighborhood rough set was used to select representative genes for each cluster. The prediction accuracy of classifiers built on reduced gene subsets indicated that our approach outperformed other classical methods. The information fusion was proven to be effective through quantitative analysis, as it could select gene subsets with high biological significance and select significant genes.
Collapse
Affiliation(s)
- Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Rui Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Yushi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| |
Collapse
|
5
|
Luu B, Rosnay MD, Harris PL. Five-year-olds are willing, but 4-year-olds refuse, to trust informants who offer new and unfamiliar labels for parts of the body. J Exp Child Psychol 2013; 116:234-46. [PMID: 23872524 DOI: 10.1016/j.jecp.2013.06.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2012] [Revised: 05/31/2013] [Accepted: 06/11/2013] [Indexed: 11/12/2022]
Abstract
This study employed the selective trust paradigm to examine how children interpret novel labels when compared with labels they already know to be accurate or inaccurate within the biological domain. The participants--3-, 4-, and 5-year-olds (N=144)--were allocated to one of three conditions. In the accurate versus inaccurate condition, one informant labeled body parts correctly, whereas the other labeled them incorrectly (e.g., calling an eye an "arm"). In the accurate versus novel condition, one informant labeled body parts accurately, whereas the other provided novel labels (e.g., calling an eye a "roke"). Finally, in the inaccurate versus novel condition, one informant labeled body parts incorrectly, whereas the other offered novel labels. In subsequent test trials, the two informants provided conflicting labels for unfamiliar internal organs. In the accurate versus inaccurate condition, children sought and endorsed labels from the accurate informant. In the accurate versus novel condition, only 4- and 5-year-olds preferred the accurate informant, whereas 3-year-olds did not selectively prefer either informant. In the inaccurate versus novel condition, only 5-year-olds preferred the novel informant, whereas 3- and 4-year-olds did not demonstrate a selective preference. Results are supportive of previous studies suggesting that 3-year-olds are sensitive to inaccuracy and that 4-year-olds privilege accuracy. However, 3- and 4-year-olds appear to be unsure as to how the novel informant should be construed. In contrast, 5-year-olds appreciate that speakers offering new information are more trustworthy than those offering inaccurate information, but they are cautious in judging such informants as being "better" at providing that information.
Collapse
Affiliation(s)
- Betty Luu
- School of Psychology, The University of Sydney, New South Wales 2006, Australia.
| | | | | |
Collapse
|