1
|
Patrício A, Costa RS, Henriques R. Pattern-centric transformation of omics data grounded on discriminative gene associations aids predictive tasks in TCGA while ensuring interpretability. Biotechnol Bioeng 2024; 121:2881-2892. [PMID: 38859573 DOI: 10.1002/bit.28758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 02/07/2024] [Accepted: 05/18/2024] [Indexed: 06/12/2024]
Abstract
The increasing prevalence of omics data sources is pushing the study of regulatory mechanisms underlying complex diseases such as cancer. However, the vast quantities of molecular features produced and the inherent interplay between them lead to a level of complexity that hampers both descriptive and predictive tasks, requiring custom-built algorithms that can extract relevant information from these sources of data. We propose a transformation that moves data centered on molecules (e.g., transcripts and proteins) to a new data space focused on putative regulatory modules given by statistically relevant co-expression patterns. To this end, the proposed transformation extracts patterns from the data through biclustering and uses them to create new variables with guarantees of interpretability and discriminative power. The transformation is shown to achieve dimensionality reductions of up to 99% and increase predictive performance of various classifiers across multiple omics layers. Results suggest that omics data transformations from gene-centric to pattern-centric data supports both prediction tasks and human interpretation, notably contributing to precision medicine applications.
Collapse
Affiliation(s)
- André Patrício
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
- LAQV-REQUIMTE, Department of Chemistry, NOVA School of Science and Technology, NOVA University Lisbon, Caparica, Portugal
| | - Rafael S Costa
- LAQV-REQUIMTE, Department of Chemistry, NOVA School of Science and Technology, NOVA University Lisbon, Caparica, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
2
|
Zhao H, Sun R, Wu L, Huang P, Liu W, Ma Q, Liao Q, Du J. Bioinformatics Identification and Experimental Validation of a Prognostic Model for the Survival of Lung Squamous Cell Carcinoma Patients. Biochem Genet 2024:10.1007/s10528-024-10828-z. [PMID: 38806973 DOI: 10.1007/s10528-024-10828-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 05/08/2024] [Indexed: 05/30/2024]
Abstract
Lung squamous cell carcinoma (LUSC) kills more than four million people yearly. Creating more trustworthy tumor molecular markers for LUSC early detection, diagnosis, prognosis, and customized treatment is essential. Cuproptosis, a novel form of cell death, opened up a new field of study for searching for trustworthy tumor indicators. Our goal was to build a risk model to assess drug sensitivity, monitor immune function, and predict prognosis in LUSC patients. The 19 cuproptosis-related genes were found in the literature, and patient genomic and clinical information was collected using the Cancer Genomic Atlas (TCGA) database. The LUSC patients were grouped using unsupervised clustering techniques, and 7626 differentially expressed genes were identified. Using univariate COX analysis, LASSO regression analysis, and multivariate COX analysis, a prognostic model for LUSC patients was developed. The tumor immune escape was evaluated using the Tumor Immune Dysfunction and Exclusion (TIDE) method. The R packages 'pRRophetic,' 'ggpubr,' and 'ggplot2' were utilized to examine drug sensitivity. For modeling, a 6-cuproptosis-based gene signature was found. Patients with high-risk LUSC had significantly worse survival rates than those with low-risk conditions. The possibility of tumor immunological escape was increased in patients with higher risk scores due to more immune cell inactivation. For patients with high-risk LUSC, we discovered seven potent potential drugs (AZD6482, CHIR.99021, CMK, Embelin, FTI.277, Imatinib, and Pazopanib). In conclusion, the cuproptosis-based genes predictive risk model can be utilized to predict outcomes, track immune function, and evaluate medication sensitivity in LUSC patients.
Collapse
Affiliation(s)
- Hongtao Zhao
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Ruonan Sun
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Lei Wu
- College of Department of Information and Library Science, Guilin Medical University, Guilin, 541004, China
| | - Peiluo Huang
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Wenjing Liu
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China
| | - Qiuhong Ma
- Department of Clinical Laboratory, Zibo Central Hospital, Zibo, 255036, China.
| | - Qinyuan Liao
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China.
| | - Juan Du
- Department of Immunology, College of Basic Medicine, Guilin Medical University, Guilin, 541199, Guangxi, China.
| |
Collapse
|
3
|
Rostami M, Forouzandeh S, Berahmand K, Soltani M, Shahsavari M, Oussalah M. Gene selection for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med 2022; 123:102228. [PMID: 34998517 DOI: 10.1016/j.artmed.2021.102228] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 11/23/2021] [Accepted: 11/27/2021] [Indexed: 12/20/2022]
Abstract
In recent decades, the improvement of computer technology has increased the growth of high-dimensional microarray data. Thus, data mining methods for DNA microarray data classification usually involve samples consisting of thousands of genes. One of the efficient strategies to solve this problem is gene selection, which improves the accuracy of microarray data classification and also decreases computational complexity. In this paper, a novel social network analysis-based gene selection approach is proposed. The proposed method has two main objectives of the relevance maximization and redundancy minimization of the selected genes. In this method, on each iteration, a maximum community is selected repetitively. Then among the existing genes in this community, the appropriate genes are selected by using the node centrality-based criterion. The reported results indicate that the developed gene selection algorithm while increasing the classification accuracy of microarray data, will also decrease the time complexity.
Collapse
Affiliation(s)
- Mehrdad Rostami
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland.
| | - Saman Forouzandeh
- Department of Computer Engineering, University of Applied Science and Technology, Center of Tehran Municipality ICT org., Tehran, Iran
| | - Kamal Berahmand
- School of Computer Sciences, Science and Engineering Faculty, Queensland University of Technology (QUT), Brisbane, Australia.
| | - Mina Soltani
- Department of Nutrition, Kashan University of Medical Sciences, Kashan, Iran
| | - Meisam Shahsavari
- Department of engineering physics, Tsinghua University, Beijing, China
| | - Mourad Oussalah
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland; Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland.
| |
Collapse
|
4
|
Wang Y, Ma Z, Wong KC, Li X. Evolving Multiobjective Cancer Subtype Diagnosis From Cancer Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2431-2444. [PMID: 32086219 DOI: 10.1109/tcbb.2020.2974953] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Detection and diagnosis of cancer are especially essential for early prevention and effective treatments. Many studies have been proposed to tackle the subtype diagnosis problems with those data, which often suffer from low diagnostic ability and bad generalization. This article studies a multiobjective PSO-based hybrid algorithm (MOPSOHA) to optimize four objectives including the number of features, the accuracy, and two entropy-based measures: the relevance and the redundancy simultaneously, diagnosing the cancer data with high classification power and robustness. First, we propose a novel binary encoding strategy to choose informative gene subsets to optimize those objective functions. Second, a mutation operator is designed to enhance the exploration capability of the swarm. Finally, a local search method based on the "best/1" mutation operator of differential evolutionary algorithm (DE) is employed to exploit the neighborhood area with sparse high-quality solutions since the base vector always approaches to some good promising areas. In order to demonstrate the effectiveness of MOPSOHA, it is tested on 41 cancer datasets including thirty-five cancer gene expression datasets and six independent disease datasets. Compared MOPSOHA with other state-of-the-art algorithms, the performance of MOPSOHA is superior to other algorithms in most of the benchmark datasets.
Collapse
|
5
|
Yousef M, Ülgen E, Uğur Sezerman O. CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis. PeerJ Comput Sci 2021; 7:e336. [PMID: 33816987 PMCID: PMC7959595 DOI: 10.7717/peerj-cs.336] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 11/23/2020] [Indexed: 05/04/2023]
Abstract
Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science, However, they do not prioritize biologically relevant genes since the ultimate goal is to determine features that optimize model performance metrics not to build a biologically meaningful model. Therefore, there is an imminent need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and machine learning. Integrative gene selection enables incorporation of biological domain knowledge from external biological resources. In this study, we propose a new computational approach named CogNet that is an integrative gene selection tool that exploits biological knowledge for grouping the genes for the computational modeling tasks of ranking and classification. In CogNet, the pathfindR serves as the biological grouping tool to allow the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results to build a biologically relevant model. CogNet provides a list of significant KEGG pathways that can classify the data with a very high accuracy. The list also provides the genes belonging to these pathways that are differentially expressed that are used as features in the classification problem. The list facilitates deep analysis and better interpretability of the role of KEGG pathways in classification of the data thus better establishing the biological relevance of these differentially expressed genes. Even though the main aim of our study is not to improve the accuracy of any existing tool, the performance of the CogNet outperforms a similar approach called maTE while obtaining similar performance compared to other similar tools including SVM-RCE. CogNet was tested on 13 gene expression datasets concerning a variety of diseases.
Collapse
Affiliation(s)
- Malik Yousef
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
- Department of Information Systems, Zefat Academic College, Zefat, Israel
| | - Ege Ülgen
- Department of Biostatistics and Medical Informatics, School of Medicine, Acibadem Mehmet Ali Aydinlar University, Istanbul, Turkey
| | - Osman Uğur Sezerman
- Department of Biostatistics and Medical Informatics, School of Medicine, Acibadem Mehmet Ali Aydinlar University, Istanbul, Turkey
| |
Collapse
|
6
|
Acharya S, Cui L, Pan Y. Multi-view feature selection for identifying gene markers: a diversified biological data driven approach. BMC Bioinformatics 2020; 21:483. [PMID: 33375940 PMCID: PMC7772934 DOI: 10.1186/s12859-020-03810-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 10/13/2020] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. RESULTS In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. CONCLUSION A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.
Collapse
Affiliation(s)
- Sudipta Acharya
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, People’s Republic of China
| | - Laizhong Cui
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, People’s Republic of China
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, USA
| |
Collapse
|
7
|
Yousef M, Kumar A, Bakir-Gungor B. Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data. ENTROPY (BASEL, SWITZERLAND) 2020; 23:E2. [PMID: 33374969 PMCID: PMC7821996 DOI: 10.3390/e23010002] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 12/14/2020] [Accepted: 12/16/2020] [Indexed: 12/19/2022]
Abstract
In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.
Collapse
Affiliation(s)
- Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat 13206, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat 13206, Israel
| | - Abhishek Kumar
- Institute of Bioinformatics, International Technology Park, Bangalore 560066, India;
- Manipal Academy of Higher Education (MAHE), Manipal 576104, India
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri 38080, Turkey;
| |
Collapse
|
8
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
9
|
Acharya S, Cui L, Pan Y. A consensus multi-view multi-objective gene selection approach for improved sample classification. BMC Bioinformatics 2020; 21:386. [PMID: 32938388 PMCID: PMC7495900 DOI: 10.1186/s12859-020-03681-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the field of computational biology, analyzing complex data helps to extract relevant biological information. Sample classification of gene expression data is one such popular bio-data analysis technique. However, the presence of a large number of irrelevant/redundant genes in expression data makes a sample classification algorithm working inefficiently. Feature selection is one such high-dimensionality reduction technique that helps to maximize the effectiveness of any sample classification algorithm. Recent advances in biotechnology have improved the biological data to include multi-modal or multiple views. Different 'omics' resources capture various equally important biological properties of entities. However, most of the existing feature selection methodologies are biased towards considering only one out of multiple biological resources. Consequently, some crucial aspects of available biological knowledge may get ignored, which could further improve feature selection efficiency. RESULTS In this present work, we have proposed a Consensus Multi-View Multi-objective Clustering-based feature selection algorithm called CMVMC. Three controlled genomic and proteomic resources like gene expression, Gene Ontology (GO), and protein-protein interaction network (PPIN) are utilized to build two independent views. The concept of multi-objective consensus clustering has been applied within our proposed gene selection method to satisfy both incorporated views. Gene expression data sets of Multiple tissues and Yeast from two different organisms (Homo Sapiens and Saccharomyces cerevisiae, respectively) are chosen for experimental purposes. As the end-product of CMVMC, a reduced set of relevant and non-redundant genes are found for each chosen data set. These genes finally participate in an effective sample classification. CONCLUSIONS The experimental study on chosen data sets shows that our proposed feature-selection method improves the sample classification accuracy and reduces the gene-space up to a significant level. In the case of Multiple Tissues data set, CMVMC reduces the number of genes (features) from 5565 to 41, with 92.73% of sample classification accuracy. For Yeast data set, the number of genes got reduced to 10 from 2884, with 95.84% sample classification accuracy. Two internal cluster validity indices - Silhouette and Davies-Bouldin (DB) and one external validity index Classification Accuracy (CA) are chosen for comparative study. Reported results are further validated through well-known biological significance test and visualization tool.
Collapse
Affiliation(s)
- Sudipta Acharya
- Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, PR China
| | - Laizhong Cui
- Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, PR China
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, USA
| |
Collapse
|
10
|
Perscheid C. Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches. Brief Bioinform 2020; 22:5881664. [PMID: 32761115 DOI: 10.1093/bib/bbaa151] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 06/15/2020] [Accepted: 06/16/2020] [Indexed: 02/06/2023] Open
Abstract
Gene expression data provide the expression levels of tens of thousands of genes from several hundred samples. These data are analyzed to detect biomarkers that can be of prognostic or diagnostic use. Traditionally, biomarker detection for gene expression data is the task of gene selection. The vast number of genes is reduced to a few relevant ones that achieve the best performance for the respective use case. Traditional approaches select genes based on their statistical significance in the data set. This results in issues of robustness, redundancy and true biological relevance of the selected genes. Integrative analyses typically address these shortcomings by integrating multiple data artifacts from the same objects, e.g. gene expression and methylation data. When only gene expression data are available, integrative analyses instead use curated information on biological processes from public knowledge bases. With knowledge bases providing an ever-increasing amount of curated biological knowledge, such prior knowledge approaches become more powerful. This paper provides a thorough overview on the status quo of biomarker detection on gene expression data with prior biological knowledge. We discuss current shortcomings of traditional approaches, review recent external knowledge bases, provide a classification and qualitative comparison of existing prior knowledge approaches and discuss open challenges for this kind of gene selection.
Collapse
Affiliation(s)
- Cindy Perscheid
- Hasso Plattner Institute, University of Potsdam, Potsdam, 14482, Germany
| |
Collapse
|
11
|
Lippmann C, Ultsch A, Lötsch J. Computational functional genomics-based reduction of disease-related gene sets to their key components. Bioinformatics 2020; 35:2362-2370. [PMID: 30500872 DOI: 10.1093/bioinformatics/bty986] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 09/05/2018] [Accepted: 11/29/2018] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION The genetic architecture of diseases becomes increasingly known. This raises difficulties in picking suitable targets for further research among an increasing number of candidates. Although expression based methods of gene set reduction are applied to laboratory-derived genetic data, the analysis of topical sets of genes gathered from knowledge bases requires a modified approach as no quantitative information about gene expression is available. RESULTS We propose a computational functional genomics-based approach at reducing sets of genes to the most relevant items based on the importance of the gene within the polyhierarchy of biological processes characterizing the disease. Knowledge bases about the biological roles of genes can provide a valid description of traits or diseases represented as a directed acyclic graph (DAG) picturing the polyhierarchy of disease relevant biological processes. The proposed method uses a gene importance score derived from the location of the gene-related biological processes in the DAG. It attempts to recreate the DAG and thereby, the roles of the original gene set, with the least number of genes in descending order of importance. This obtained precision and recall of over 70% to recreate the components of the DAG charactering the biological functions of n=540 genes relevant to pain with a subset of only the k=29 best-scoring genes. CONCLUSIONS A new method for reduction of gene sets is shown that is able to reproduce the biological processes in which the full gene set is involved by over 70%; however, by using only ∼5% of the original genes. AVAILABILITY AND IMPLEMENTATION The necessary numerical parameters for the calculation of gene importance are implemented in the R package dbtORA at https://github.com/IME-TMP-FFM/dbtORA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Catharina Lippmann
- Fraunhofer Institute of Molecular Biology and Applied Ecology - Project Group Translational Medicine and Pharmacology (IME-TMP), Frankfurt am Main, Germany
| | - Alfred Ultsch
- DataBionics Research Group, University of Marburg, Marburg, Germany
| | - Jörn Lötsch
- Fraunhofer Institute of Molecular Biology and Applied Ecology - Project Group Translational Medicine and Pharmacology (IME-TMP), Frankfurt am Main, Germany.,Goethe-University, Institute of Clinical Pharmacology, Frankfurt am Main, Germany
| |
Collapse
|
12
|
Atchou K, Ongus J, Machuka E, Juma J, Tiambo C, Djikeng A, Silva JC, Pelle R. Comparative Transcriptomics of the Bovine Apicomplexan Parasite Theileria parva Developmental Stages Reveals Massive Gene Expression Variation and Potential Vaccine Antigens. Front Vet Sci 2020; 7:287. [PMID: 32582776 PMCID: PMC7296165 DOI: 10.3389/fvets.2020.00287] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Accepted: 04/28/2020] [Indexed: 01/10/2023] Open
Abstract
Theileria parva is a protozoan parasite that causes East Coast fever (ECF), an economically important disease of cattle in Africa. It is transmitted mainly by the tick Rhipicephalus appendiculatus. Research efforts to develop a subunit vaccine based on parasite neutralizing antibodies and cytotoxic T-lymphocytes have met with limited success. The molecular mechanisms underlying T. parva life cycle stages in the tick vector and bovine host are poorly understood, thus limiting progress toward an effective and efficient control of ECF. Transcriptomics has been used to identify candidate vaccine antigens or markers associated with virulence and disease pathology. Therefore, characterization of gene expression throughout the parasite's life cycle should shed light on host-pathogen interactions in ECF and identify genes underlying differences in parasite stages as well as potential, novel therapeutic targets. Recently, the first gene expression profiling of T. parva was conducted for the sporoblast, sporozoite, and schizont stages. The sporozoite is infective to cattle, whereas the schizont is the major pathogenic form of the parasite. The schizont can differentiate into piroplasm, which is infective to the tick vector. The present study was designed to extend the T. parva gene expression profiling to the piroplasm stage with reference to the schizont. Pairwise comparison revealed that 3,279 of a possible 4,084 protein coding genes were differentially expressed, with 1,623 (49%) genes upregulated and 1,656 (51%) downregulated in the piroplasm relative to the schizont. In addition, over 200 genes were stage-specific. In general, there were more molecular functions, biological processes, subcellular localizations, and pathways significantly enriched in the piroplasm than in the schizont. Using known antigens as benchmarks, we identified several new potential vaccine antigens, including TP04_0076 and TP04_0640, which were highly immunogenic in naturally T. parva-infected cattle. All the candidate vaccine antigens identified have yet to be investigated for their capacity to induce protective immune response against ECF.
Collapse
Affiliation(s)
- Kodzo Atchou
- Institute for Basic Sciences, Technology and Innovation, Pan African University, Nairobi, Kenya.,Biosciences eastern and central Africa-International Livestock Research Institute (BecA-ILRI), Nairobi, Kenya
| | - Juliette Ongus
- Institute for Basic Sciences, Technology and Innovation, Pan African University, Nairobi, Kenya
| | - Eunice Machuka
- Institute for Basic Sciences, Technology and Innovation, Pan African University, Nairobi, Kenya.,Biosciences eastern and central Africa-International Livestock Research Institute (BecA-ILRI), Nairobi, Kenya
| | - John Juma
- Biosciences eastern and central Africa-International Livestock Research Institute (BecA-ILRI), Nairobi, Kenya
| | - Christian Tiambo
- Biosciences eastern and central Africa-International Livestock Research Institute (BecA-ILRI), Nairobi, Kenya
| | - Appolinaire Djikeng
- Centre for Tropical Livestock Genetics and Health, The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Scotland, United Kingdom
| | - Joana C Silva
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States.,Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Roger Pelle
- Biosciences eastern and central Africa-International Livestock Research Institute (BecA-ILRI), Nairobi, Kenya
| |
Collapse
|
13
|
Dutta P, Saha S, Pai S, Kumar A. A Protein Interaction Information-based Generative Model for Enhancing Gene Clustering. Sci Rep 2020; 10:665. [PMID: 31959782 PMCID: PMC6971242 DOI: 10.1038/s41598-020-57437-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 12/20/2019] [Indexed: 11/18/2022] Open
Abstract
In the field of computational bioinformatics, identifying a set of genes which are responsible for a particular cellular mechanism, is very much essential for tasks such as medical diagnosis or disease gene identification. Accurately grouping (clustering) the genes is one of the important tasks in understanding the functionalities of the disease genes. In this regard, ensemble clustering becomes a promising approach to combine different clustering solutions to generate almost accurate gene partitioning. Recently, researchers have used generative model as a smart ensemble method to produce the right consensus solution. In the current paper, we develop a protein-protein interaction-based generative model that can efficiently perform a gene clustering. Utilizing protein interaction information as the generative model's latent variable enables enhance the generative model's efficiency in inferring final probabilistic labels. The proposed generative model utilizes different weak supervision sources rather utilizing any ground truth information. For weak supervision sources, we use a multi-objective optimization based clustering technique together with the world's largest gene ontology based knowledge-base named Gene Ontology Consortium(GOC). These weakly supervised labels are supplied to a generative model that eventually assigns all genes to probabilistic labels. The comparative study with respect to silhouette score, Biological Homogeneity Index (BHI) and Biological Stability Index (BSI) proves that the proposed generative model outperforms than other state-of-the-art techniques.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Sanket Pai
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Aviral Kumar
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| |
Collapse
|
14
|
Group Based Unsupervised Feature Selection. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 2020. [PMCID: PMC7206179 DOI: 10.1007/978-3-030-47426-3_62] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Unsupervised feature selection is an important task in machine learning applications, yet challenging due to the unavailability of class labels. Although a few unsupervised methods take advantage of external sources of correlations within feature groups in feature selection, they are limited to genomic data, and suffer poor accuracy because they ignore input data or encourage features from the same group. We propose a framework which facilitates unsupervised filter feature selection methods to exploit input data and feature group information simultaneously, encouraging features from different groups. We use this framework to incorporate feature group information into Laplace Score algorithm. Our method achieves high accuracy compared to other popular unsupervised feature selection methods (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\sim $$\end{document}30% maximum improvement of Normalized Mutual Information (NMI)) with low computational costs (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\sim $$\end{document}50 times lower than embedded methods on average). It has many real world applications, particularly the ones that use image, text and genomic data, whose features demonstrate strong group structures.
Collapse
|
15
|
A Framework for Feature Selection to Exploit Feature Group Structures. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 2020. [PMCID: PMC7206161 DOI: 10.1007/978-3-030-47426-3_61] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Filter feature selection methods play an important role in machine learning tasks when low computational costs, classifier independence or simplicity is important. Existing filter methods predominantly focus only on the input data and do not take advantage of the external sources of correlations within feature groups to improve the classification accuracy. We propose a framework which facilitates supervised filter feature selection methods to exploit feature group information from external sources of knowledge and use this framework to incorporate feature group information into minimum Redundancy Maximum Relevance (mRMR) algorithm, resulting in GroupMRMR algorithm. We show that GroupMRMR achieves high accuracy gains over mRMR (up to \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\sim }$$\end{document}35%) and other popular filter methods (up to \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\sim }$$\end{document}50%). GroupMRMR has same computational complexity as that of mRMR, therefore, does not incur additional computational costs. Proposed method has many real world applications, particularly the ones that use genomic, text and image data whose features demonstrate strong group structures.
Collapse
|
16
|
Dutta P, Saha S, Gulati S. Graph-Based Hub Gene Selection Technique Using Protein Interaction Information: Application to Sample Classification. IEEE J Biomed Health Inform 2019; 23:2670-2676. [PMID: 30676987 DOI: 10.1109/jbhi.2019.2894374] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Classification of samples of gene expression profile plays a significant role in prediction and diagnosis of diseases. In the task of sample classification, a robust feature selection algorithm is very much essential to identify the important genes from the high dimensional gene expression data. This paper explores the information of protein-protein interaction with a graph mining technique for finding a proper subset of features (genes), which further takes part in sample classification. Here, our contribution for feature selection is three-fold: first, all the genes are grouped into different clusters based on the integrated information of the gene expression values and their protein interactions using a multi-objective optimization based clustering approach. Second, the confidence scores of the protein interactions are incorporated in a popular graph mining algorithm namely Goldberg algorithm to find out the relevant features. These features are the topologically and functionally significant genes, named as hub genes. Finally, these hub genes are identified varying the degrees of the nodes, and those are utilized for the sample classification task. Different machine learning classifiers are exploited for this purpose, and the classification performance is measured with respect to various performance metrics namely accuracy, sensitivity, specificity, precision, F-measure, and Mathews coefficient correlation. Comparative analysis with respect to two baselines and several existing approaches proves the efficiency of the proposed approach. Furthermore, the robustness of the identified hub-gene modules is endorsed using some strong biological significance analysis.
Collapse
|
17
|
Perscheid C, Grasnick B, Uflacker M. Integrative Gene Selection on Gene Expression Data: Providing Biological Context to Traditional Approaches. J Integr Bioinform 2018; 16:/j/jib.ahead-of-print/jib-2018-0064/jib-2018-0064.xml. [PMID: 30785707 PMCID: PMC6798862 DOI: 10.1515/jib-2018-0064] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/12/2018] [Indexed: 12/30/2022] Open
Abstract
The advance of high-throughput RNA-Sequencing techniques enables researchers to analyze the complete gene activity in particular cells. From the insights of such analyses, researchers can identify disease-specific expression profiles, thus understand complex diseases like cancer, and eventually develop effective measures for diagnosis and treatment. The high dimensionality of gene expression data poses challenges to its computational analysis, which is addressed with measures of gene selection. Traditional gene selection approaches base their findings on statistical analyses of the actual expression levels, which implies several drawbacks when it comes to accurately identifying the underlying biological processes. In turn, integrative approaches include curated information on biological processes from external knowledge bases during gene selection, which promises to lead to better interpretability and improved predictive performance. Our work compares the performance of traditional and integrative gene selection approaches. Moreover, we propose a straightforward approach to integrate external knowledge with traditional gene selection approaches. We introduce a framework enabling the automatic external knowledge integration, gene selection, and evaluation. Evaluation results prove our framework to be a useful tool for evaluation and show that integration of external knowledge improves overall analysis results.
Collapse
Affiliation(s)
- Cindy Perscheid
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | - Bastien Grasnick
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | - Matthias Uflacker
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| |
Collapse
|
18
|
|