1
|
Chen SL, Chin SC, Chan KC, Ho CY. A Machine Learning Approach to Assess Patients with Deep Neck Infection Progression to Descending Mediastinitis: Preliminary Results. Diagnostics (Basel) 2023; 13:2736. [PMID: 37685275 PMCID: PMC10486957 DOI: 10.3390/diagnostics13172736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 07/25/2023] [Accepted: 08/22/2023] [Indexed: 09/10/2023] Open
Abstract
BACKGROUND Deep neck infection (DNI) is a serious infectious disease, and descending mediastinitis is a fatal infection of the mediastinum. However, no study has applied artificial intelligence to assess progression to descending mediastinitis in DNI patients. Thus, we developed a model to assess the possible progression of DNI to descending mediastinitis. METHODS Between August 2017 and December 2022, 380 patients with DNI were enrolled; 75% of patients (n = 285) were assigned to the training group for validation, whereas the remaining 25% (n = 95) were assigned to the test group to determine the accuracy. The patients' clinical and computed tomography (CT) parameters were analyzed via the k-nearest neighbor method. The predicted and actual progression of DNI patients to descending mediastinitis were compared. RESULTS In the training and test groups, there was no statistical significance (all p > 0.05) noted at clinical variables (age, gender, chief complaint period, white blood cells, C-reactive protein, diabetes mellitus, and blood sugar), deep neck space (parapharyngeal, submandibular, retropharyngeal, and multiple spaces involved, ≥3), tracheostomy performance, imaging parameters (maximum diameter of abscess and nearest distance from abscess to level of sternum notch), or progression to mediastinitis. The model had a predictive accuracy of 82.11% (78/95 patients), with sensitivity and specificity of 41.67% and 87.95%, respectively. CONCLUSIONS Our model can assess the progression of DNI to descending mediastinitis depending on clinical and imaging parameters. It can be used to identify DNI patients who will benefit from prompt treatment.
Collapse
Affiliation(s)
- Shih-Lung Chen
- Department of Otorhinolaryngology & Head and Neck Surgery, Chang Gung Memorial Hospital, New Taipei City 333, Taiwan
- School of Medicine, Chang Gung University, Taoyuan 333, Taiwan
| | - Shy-Chyi Chin
- School of Medicine, Chang Gung University, Taoyuan 333, Taiwan
- Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital, New Taipei City 333, Taiwan
| | - Kai-Chieh Chan
- Department of Otorhinolaryngology & Head and Neck Surgery, Chang Gung Memorial Hospital, New Taipei City 333, Taiwan
- School of Medicine, Chang Gung University, Taoyuan 333, Taiwan
| | - Chia-Ying Ho
- School of Medicine, Chang Gung University, Taoyuan 333, Taiwan
- Division of Chinese Internal Medicine, Center for Traditional Chinese Medicine, Chang Gung Memorial Hospital, Taoyuan 333, Taiwan
| |
Collapse
|
2
|
Szabo PM, Vajdi A, Kumar N, Tolstorukov MY, Chen BJ, Edwards R, Ligon KL, Chasalow SD, Chow KH, Shetty A, Bolisetty M, Holloway JL, Golhar R, Kidd BA, Hull PA, Houser J, Vlach L, Siemers NO, Saha S. Cancer-associated fibroblasts are the main contributors to epithelial-to-mesenchymal signatures in the tumor microenvironment. Sci Rep 2023; 13:3051. [PMID: 36810872 PMCID: PMC9944255 DOI: 10.1038/s41598-023-28480-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 01/19/2023] [Indexed: 02/24/2023] Open
Abstract
Epithelial-to-mesenchymal transition (EMT) is associated with tumor initiation, metastasis, and drug resistance. However, the mechanisms underlying these associations are largely unknown. We studied several tumor types to identify the source of EMT gene expression signals and a potential mechanism of resistance to immuno-oncology treatment. Across tumor types, EMT-related gene expression was strongly associated with expression of stroma-related genes. Based on RNA sequencing of multiple patient-derived xenograft models, EMT-related gene expression was enriched in the stroma versus parenchyma. EMT-related markers were predominantly expressed by cancer-associated fibroblasts (CAFs), cells of mesenchymal origin which produce a variety of matrix proteins and growth factors. Scores derived from a 3-gene CAF transcriptional signature (COL1A1, COL1A2, COL3A1) were sufficient to reproduce association between EMT-related markers and disease prognosis. Our results suggest that CAFs are the primary source of EMT signaling and have potential roles as biomarkers and targets for immuno-oncology therapies.
Collapse
Affiliation(s)
- Peter M. Szabo
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA ,grid.428458.70000 0004 1792 8104Present Address: Fate Therapeutics, San Diego, CA USA
| | - Amir Vajdi
- grid.65499.370000 0001 2106 9910Dana-Farber Cancer Institute, Boston, MA USA ,grid.417993.10000 0001 2260 0793Present Address: Merck & Co., Inc., Kenilworth, NJ USA
| | | | | | - Benjamin J. Chen
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Cambridge, MA USA
| | - Robin Edwards
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA ,grid.428496.5Present Address: Daiichi Sankyo, Inc., Princeton, NJ USA
| | - Keith L. Ligon
- grid.65499.370000 0001 2106 9910Dana-Farber Cancer Institute, Boston, MA USA
| | - Scott D. Chasalow
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA
| | - Kin-Hoe Chow
- grid.65499.370000 0001 2106 9910Dana-Farber Cancer Institute, Boston, MA USA
| | - Aniket Shetty
- grid.65499.370000 0001 2106 9910Dana-Farber Cancer Institute, Boston, MA USA
| | - Mohan Bolisetty
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA
| | - James L. Holloway
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Seattle, WA USA
| | - Ryan Golhar
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA
| | - Brian A. Kidd
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Redwood City, CA USA
| | | | - Jeff Houser
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Redwood City, CA USA
| | - Logan Vlach
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Redwood City, CA USA ,grid.152326.10000 0001 2264 7217Present Address: Vanderbilt University, Nashville, TN USA
| | - Nathan O. Siemers
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA ,Present Address: Fiveprime Group, Monterey, CA USA
| | - Saurabh Saha
- grid.419971.30000 0004 0374 8313Bristol Myers Squibb, Princeton, NJ USA ,Present Address: Centessa Pharmaceuticals, Cambridge, MA USA
| |
Collapse
|
3
|
Lee AJ, Reiter T, Doing G, Oh J, Hogan DA, Greene CS. Using genome-wide expression compendia to study microorganisms. Comput Struct Biotechnol J 2022; 20:4315-4324. [PMID: 36016717 PMCID: PMC9396250 DOI: 10.1016/j.csbj.2022.08.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 08/07/2022] [Accepted: 08/07/2022] [Indexed: 11/30/2022] Open
Abstract
A gene expression compendium is a heterogeneous collection of gene expression experiments assembled from data collected for diverse purposes. The widely varied experimental conditions and genetic backgrounds across samples creates a tremendous opportunity for gaining a systems level understanding of the transcriptional responses that influence phenotypes. Variety in experimental design is particularly important for studying microbes, where the transcriptional responses integrate many signals and demonstrate plasticity across strains including response to what nutrients are available and what microbes are present. Advances in high-throughput measurement technology have made it feasible to construct compendia for many microbes. In this review we discuss how these compendia are constructed and analyzed to reveal transcriptional patterns.
Collapse
Affiliation(s)
- Alexandra J. Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Taylor Reiter
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| | - Georgia Doing
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Julia Oh
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, Hanover, NH, USA
| | - Casey S. Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| |
Collapse
|
4
|
Park Y, Heider D, Hauschild AC. Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence. Cancers (Basel) 2021; 13:3148. [PMID: 34202427 PMCID: PMC8269018 DOI: 10.3390/cancers13133148] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 06/16/2021] [Accepted: 06/21/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question.
Collapse
Affiliation(s)
- Youngjun Park
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
| | - Dominik Heider
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
| | - Anne-Christin Hauschild
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
- Department of Medical Informatics, University Medical Center Göttingen, 37075 Göttingen, Germany
| |
Collapse
|
5
|
Liu X, Shang H, Li B, Zhao L, Hua Y, Wu K, Hu M, Fan T. Exploration and validation of hub genes and pathways in the progression of hypoplastic left heart syndrome via weighted gene co-expression network analysis. BMC Cardiovasc Disord 2021; 21:300. [PMID: 34130651 PMCID: PMC8204459 DOI: 10.1186/s12872-021-02108-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Accepted: 06/08/2021] [Indexed: 12/18/2022] Open
Abstract
Background Despite significant progress in surgical treatment of hypoplastic left heart syndrome (HLHS), its mortality and morbidity are still high. Little is known about the molecular abnormalities of the syndrome. In this study, we aimed to probe into hub genes and key pathways in the progression of the syndrome. Methods Differentially expressed genes (DEGs) were identified in left ventricle (LV) or right ventricle (RV) tissues between HLHS and controls using the GSE77798 dataset. Then, weighted gene co-expression network analysis (WGCNA) was performed and key modules were constructed for HLHS. Based on the genes in the key modules, protein–protein interaction networks were conducted, and hub genes and key pathways were screened. Finally, the GSE23959 dataset was used to validate hub genes between HLHS and controls. Results We identified 88 and 41 DEGs in LV and RV tissues between HLHS and controls, respectively. DEGs in LV tissues of HLHS were distinctly involved in heart development, apoptotic signaling pathway and ECM receptor interaction. DEGs in RV tissues of HLHS were mainly enriched in BMP signaling pathway, regulation of cell development and regulation of blood pressure. A total of 16 co-expression network were constructed. Among them, black module (r = 0.79 and p value = 2e−04) and pink module (r = 0.84 and p value = 4e−05) had the most significant correlation with HLHS, indicating that the two modules could be the most relevant for HLHS progression. We identified five hub genes in the black module (including Fbn1, Itga8, Itga11, Itgb5 and Thbs2), and five hub genes (including Cblb, Ccl2, Edn1, Itgb3 and Map2k1) in the pink module for HLHS. Their abnormal expression was verified in the GSE23959 dataset. Conclusions Our findings revealed hub genes and key pathways for HLHS through WGCNA, which could play key roles in the molecular mechanism of HLHS.
Collapse
Affiliation(s)
- Xuelan Liu
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Honglei Shang
- Department of Radiology, The Third Affiliated Hospital of Zhengzhou University, Zhengzhou, 450052, China
| | - Bin Li
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Liyun Zhao
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Ying Hua
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Kaiyuan Wu
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Manman Hu
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China
| | - Taibing Fan
- Department of Children's Heart Center, Henan Provincial People's Hospital, Department of Children's Heart Center of Fuwai Central China Cardiovascular Hospital, Central China Fuwai Hospital of Zhengzhou University, Zhengzhou, 450003, Henan, China.
| |
Collapse
|
6
|
Lu Y, Phillips CA, Langston MA. A robustness metric for biological data clustering algorithms. BMC Bioinformatics 2019; 20:503. [PMID: 31874625 PMCID: PMC6929270 DOI: 10.1186/s12859-019-3089-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 09/10/2019] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? RESULTS This work introduces a new metric, termed simply "robustness", designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. CONCLUSIONS Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.
Collapse
Affiliation(s)
- Yuping Lu
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| | - Charles A. Phillips
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| | - Michael A. Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| |
Collapse
|
7
|
Chen LP, Yi GY, Zhang Q, He W. Multiclass analysis and prediction with network structured covariates. JOURNAL OF STATISTICAL DISTRIBUTIONS AND APPLICATIONS 2019. [DOI: 10.1186/s40488-019-0094-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
8
|
Abstract
Information systems support and ensure the practical running of the most critical business processes. There exists (or can be reconstructed) a record (log) of the process running in the information system. Computer methods of data mining can be used for analysis of process data utilizing support techniques of machine learning and a complex network analysis. The analysis is usually provided based on quantitative parameters of the running process of the information system. It is not so usual to analyze behavior of the participants of the running process from the process log. Here, we show how data and process mining methods can be used for analyzing the running process and how participants behavior can be analyzed from the process log using network (community or cluster) analyses in the constructed complex network from the SAP business process log. This approach constructs a complex network from the process log in a given context and then finds communities or patterns in this network. Found communities or patterns are analyzed using knowledge of the business process and the environment in which the process operates. The results demonstrate the possibility to cover up not only the quantitative but also the qualitative relations (e.g., hidden behavior of participants) using the process log and specific knowledge of the business case.
Collapse
|
9
|
Catanese HN, Brayton KA, Gebremedhin AH. A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen. BMC Bioinformatics 2018; 19:475. [PMID: 30541438 PMCID: PMC6291930 DOI: 10.1186/s12859-018-2453-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 10/31/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. RESULTS We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model's resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. CONCLUSION We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent.
Collapse
Affiliation(s)
- Helen N. Catanese
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA USA
| | - Kelly A. Brayton
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA USA
- Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA USA
| | - Assefaw H. Gebremedhin
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA USA
| |
Collapse
|
10
|
Li Z, Nie F, Chang X, Nie L, Zhang H, Yang Y. Rank-Constrained Spectral Clustering With Flexible Embedding. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:6073-6082. [PMID: 29993916 DOI: 10.1109/tnnls.2018.2817538] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Spectral clustering (SC) has been proven to be effective in various applications. However, the learning scheme of SC is suboptimal in that it learns the cluster indicator from a fixed graph structure, which usually requires a rounding procedure to further partition the data. Also, the obtained cluster number cannot reflect the ground truth number of connected components in the graph. To alleviate these drawbacks, we propose a rank-constrained SC with flexible embedding framework. Specifically, an adaptive probabilistic neighborhood learning process is employed to recover the block-diagonal affinity matrix of an ideal graph. Meanwhile, a flexible embedding scheme is learned to unravel the intrinsic cluster structure in low-dimensional subspace, where the irrelevant information and noise in high-dimensional data have been effectively suppressed. The proposed method is superior to previous SC methods in that: 1) the block-diagonal affinity matrix learned simultaneously with the adaptive graph construction process, more explicitly induces the cluster membership without further discretization; 2) the number of clusters is guaranteed to converge to the ground truth via a rank constraint on the Laplacian matrix; and 3) the mismatch between the embedded feature and the projected feature allows more freedom for finding the proper cluster structure in the low-dimensional subspace as well as learning the corresponding projection matrix. Experimental results on both synthetic and real-world data sets demonstrate the promising performance of the proposed algorithm.
Collapse
|
11
|
Genome-wide association analysis identifies genetic correlates of immune infiltrates in solid tumors. PLoS One 2017; 12:e0179726. [PMID: 28749946 PMCID: PMC5531551 DOI: 10.1371/journal.pone.0179726] [Citation(s) in RCA: 153] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2016] [Accepted: 06/02/2017] [Indexed: 12/27/2022] Open
Abstract
Therapeutic options for the treatment of an increasing variety of cancers have been expanded by the introduction of a new class of drugs, commonly referred to as checkpoint blocking agents, that target the host immune system to positively modulate anti-tumor immune response. Although efficacy of these agents has been linked to a pre-existing level of tumor immune infiltrate, it remains unclear why some patients exhibit deep and durable responses to these agents while others do not benefit. To examine the influence of tumor genetics on tumor immune state, we interrogated the relationship between somatic mutation and copy number alteration with infiltration levels of 7 immune cell types across 40 tumor cohorts in The Cancer Genome Atlas. Levels of cytotoxic T, regulatory T, total T, natural killer, and B cells, as well as monocytes and M2 macrophages, were estimated using a novel set of transcriptional signatures that were designed to resist interference from the cellular heterogeneity of tumors. Tumor mutational load and estimates of tumor purity were included in our association models to adjust for biases in multi-modal genomic data. Copy number alterations, mutations summarized at the gene level, and position-specific mutations were evaluated for association with tumor immune infiltration. We observed a strong relationship between copy number loss of a large region of chromosome 9p and decreased lymphocyte estimates in melanoma, pancreatic, and head/neck cancers. Mutations in the oncogenes PIK3CA, FGFR3, and RAS/RAF family members, as well as the tumor suppressor TP53, were linked to changes in immune infiltration, usually in restricted tumor types. Associations of specific WNT/beta-catenin pathway genetic changes with immune state were limited, but we noted a link between 9p loss and the expression of the WNT receptor FZD3, suggesting that there are interactions between 9p alteration and WNT pathways. Finally, two different cell death regulators, CASP8 and DIDO1, were often mutated in head/neck tumors that had higher lymphocyte infiltrates. In summary, our study supports the relevance of tumor genetics to questions of efficacy and resistance in checkpoint blockade therapies. It also highlights the need to assess genome-wide influences during exploration of any specific tumor pathway hypothesized to be relevant to therapeutic response. Some of the observed genetic links to immune state, like 9p loss, may influence response to cancer immune therapies. Others, like mutations in cell death pathways, may help guide combination therapeutic approaches.
Collapse
|
12
|
Yalcin D, Hakguder ZM, Otu HH. Bioinformatics approaches to single-cell analysis in developmental biology. Mol Hum Reprod 2015; 22:182-92. [PMID: 26358759 DOI: 10.1093/molehr/gav050] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/04/2015] [Indexed: 12/17/2022] Open
Abstract
Individual cells within the same population show various degrees of heterogeneity, which may be better handled with single-cell analysis to address biological and clinical questions. Single-cell analysis is especially important in developmental biology as subtle spatial and temporal differences in cells have significant associations with cell fate decisions during differentiation and with the description of a particular state of a cell exhibiting an aberrant phenotype. Biotechnological advances, especially in the area of microfluidics, have led to a robust, massively parallel and multi-dimensional capturing, sorting, and lysis of single-cells and amplification of related macromolecules, which have enabled the use of imaging and omics techniques on single cells. There have been improvements in computational single-cell image analysis in developmental biology regarding feature extraction, segmentation, image enhancement and machine learning, handling limitations of optical resolution to gain new perspectives from the raw microscopy images. Omics approaches, such as transcriptomics, genomics and epigenomics, targeting gene and small RNA expression, single nucleotide and structural variations and methylation and histone modifications, rely heavily on high-throughput sequencing technologies. Although there are well-established bioinformatics methods for analysis of sequence data, there are limited bioinformatics approaches which address experimental design, sample size considerations, amplification bias, normalization, differential expression, coverage, clustering and classification issues, specifically applied at the single-cell level. In this review, we summarize biological and technological advancements, discuss challenges faced in the aforementioned data acquisition and analysis issues and present future prospects for application of single-cell analyses to developmental biology.
Collapse
Affiliation(s)
- Dicle Yalcin
- Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0511, USA
| | - Zeynep M Hakguder
- Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0511, USA
| | - Hasan H Otu
- Department of Electrical and Computer Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588-0511, USA
| |
Collapse
|
13
|
Imangaliyev S, Keijser B, Crielaard W, Tsivtsivadze E. Personalized microbial network inference via co-regularized spectral clustering. Methods 2015; 83:28-35. [PMID: 25842007 DOI: 10.1016/j.ymeth.2015.03.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 03/19/2015] [Accepted: 03/24/2015] [Indexed: 01/23/2023] Open
Abstract
We use Human Microbiome Project (HMP) cohort (Peterson et al., 2009) to infer personalized oral microbial networks of healthy individuals. To determine clustering of individuals with similar microbial profiles, co-regularized spectral clustering algorithm is applied to the dataset. For each cluster we discovered, we compute co-occurrence relationships among the microbial species that determine microbial network per cluster of individuals. The results of our study suggest that there are several differences in microbial interactions on personalized network level in healthy oral samples acquired from various niches. Based on the results of co-regularized spectral clustering we discover two groups of individuals with different topology of their microbial interaction network. The results of microbial network inference suggest that niche-wise interactions are different in these two groups. Our study shows that healthy individuals have different microbial clusters according to their oral microbiota. Such personalized microbial networks open a better understanding of the microbial ecology of healthy oral cavities and new possibilities for future targeted medication. The scripts written in scientific Python and in Matlab, which were used for network visualization, are provided for download on the website http://learning-machines.com/.
Collapse
Affiliation(s)
- Sultan Imangaliyev
- Top Institute Food and Nutrition, Wageningen, The Netherlands; Research Group Microbiology and Systems Biology, TNO Earth, Environmental and Life Sciences, Zeist, The Netherlands; Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, Amsterdam, The Netherlands.
| | - Bart Keijser
- Top Institute Food and Nutrition, Wageningen, The Netherlands; Research Group Microbiology and Systems Biology, TNO Earth, Environmental and Life Sciences, Zeist, The Netherlands
| | - Wim Crielaard
- Top Institute Food and Nutrition, Wageningen, The Netherlands; Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, Amsterdam, The Netherlands
| | - Evgeni Tsivtsivadze
- Top Institute Food and Nutrition, Wageningen, The Netherlands; Research Group Microbiology and Systems Biology, TNO Earth, Environmental and Life Sciences, Zeist, The Netherlands
| |
Collapse
|
14
|
Park H, Niida A, Miyano S, Imoto S. Sparse Overlapping Group Lasso for Integrative Multi-Omics Analysis. J Comput Biol 2015; 22:73-84. [PMID: 25629319 DOI: 10.1089/cmb.2014.0197] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Affiliation(s)
- Heewon Park
- Human Genome Center, the Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Atushi Niida
- Human Genome Center, the Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Satoru Miyano
- Human Genome Center, the Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| | - Seiya Imoto
- Human Genome Center, the Institute of Medical Science, the University of Tokyo, Tokyo, Japan
| |
Collapse
|
15
|
Abstract
Whole-genome sequencing, particularly in fungi, has progressed at a tremendous rate. More difficult, however, is experimental testing of the inferences about gene function that can be drawn from comparative sequence analysis alone. We present a genome-wide functional characterization of a sequenced but experimentally understudied budding yeast, Saccharomyces bayanus var. uvarum (henceforth referred to as S. bayanus), allowing us to map changes over the 20 million years that separate this organism from S. cerevisiae. We first created a suite of genetic tools to facilitate work in S. bayanus. Next, we measured the gene-expression response of S. bayanus to a diverse set of perturbations optimized using a computational approach to cover a diverse array of functionally relevant biological responses. The resulting data set reveals that gene-expression patterns are largely conserved, but significant changes may exist in regulatory networks such as carbohydrate utilization and meiosis. In addition to regulatory changes, our approach identified gene functions that have diverged. The functions of genes in core pathways are highly conserved, but we observed many changes in which genes are involved in osmotic stress, peroxisome biogenesis, and autophagy. A surprising number of genes specific to S. bayanus respond to oxidative stress, suggesting the organism may have evolved under different selection pressures than S. cerevisiae. This work expands the scope of genome-scale evolutionary studies from sequence-based analysis to rapid experimental characterization and could be adopted for functional mapping in any lineage of interest. Furthermore, our detailed characterization of S. bayanus provides a valuable resource for comparative functional genomics studies in yeast.
Collapse
|
16
|
Pirim H, Ekşioğlu B, Perkins A, Yüceer Ç. Clustering of High Throughput Gene Expression Data. COMPUTERS & OPERATIONS RESEARCH 2012; 39:3046-3061. [PMID: 23144527 PMCID: PMC3491664 DOI: 10.1016/j.cor.2012.03.008] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community.
Collapse
Affiliation(s)
- Harun Pirim
- Department of Industrial and Systems Engineering, Mississippi State University, P.O. Box 9542, Mississippi State, MS 39762
- Corresponding author. Tel.:+1-662-325-4226;
| | - Burak Ekşioğlu
- Department of Industrial and Systems Engineering, Mississippi State University, P.O. Box 9542, Mississippi State, MS 39762
| | - Andy Perkins
- Department of Computer Science and Engineering, Mississippi State University
| | - Çetin Yüceer
- Department of Forestry, Mississippi State University
| |
Collapse
|
17
|
BELLO-ORGAZ GEMA, MENÉNDEZ HÉCTORD, CAMACHO DAVID. ADAPTIVE K-MEANS ALGORITHM FOR OVERLAPPED GRAPH CLUSTERING. Int J Neural Syst 2012; 22:1250018. [DOI: 10.1142/s0129065712500189] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The graph clustering problem has become highly relevant due to the growing interest of several research communities in social networks and their possible applications. Overlapped graph clustering algorithms try to find subsets of nodes that can belong to different clusters. In social network-based applications it is quite usual for a node of the network to belong to different groups, or communities, in the graph. Therefore, algorithms trying to discover, or analyze, the behavior of these networks needed to handle this feature, detecting and identifying the overlapped nodes. This paper shows a soft clustering approach based on a genetic algorithm where a new encoding is designed to achieve two main goals: first, the automatic adaptation of the number of communities that can be detected and second, the definition of several fitness functions that guide the searching process using some measures extracted from graph theory. Finally, our approach has been experimentally tested using the Eurovision contest dataset, a well-known social-based data network, to show how overlapped communities can be found using our method.
Collapse
Affiliation(s)
- GEMA BELLO-ORGAZ
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| | - HÉCTOR D. MENÉNDEZ
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| | - DAVID CAMACHO
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| |
Collapse
|
18
|
Arefin AS, Riveros C, Berretta R, Moscato P. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs. PLoS One 2012; 7:e44000. [PMID: 22937144 PMCID: PMC3429408 DOI: 10.1371/journal.pone.0044000] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2012] [Accepted: 07/27/2012] [Indexed: 12/05/2022] Open
Abstract
Background The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. Results We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50–60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. Conclusion Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/.
Collapse
Affiliation(s)
- Ahmed Shamsul Arefin
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Carlos Riveros
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
| | - Regina Berretta
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
- Australian Research Council Centre of Excellence in Bioinformatics, Callaghan, New South Wales, Australia
- * E-mail:
| |
Collapse
|
19
|
Abstract
Background A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae. Methods For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method. Results Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods. Conclusions Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.
Collapse
|
20
|
Zhou F, Ma Q, Li G, Xu Y. QServer: a biclustering server for prediction and assessment of co-expressed gene clusters. PLoS One 2012; 7:e32660. [PMID: 22403692 PMCID: PMC3293860 DOI: 10.1371/journal.pone.0032660] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Accepted: 01/30/2012] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Biclustering is a powerful technique for identification of co-expressed gene groups under any (unspecified) substantial subset of given experimental conditions, which can be used for elucidation of transcriptionally co-regulated genes. RESULTS We have previously developed a biclustering algorithm, QUBIC, which can solve more general biclustering problems than previous biclustering algorithms. To fully utilize the analysis power the algorithm provides, we have developed a web server, QServer, for prediction, computational validation and analyses of co-expressed gene clusters. Specifically, the QServer has the following capabilities in addition to biclustering by QUBIC: (i) prediction and assessment of conserved cis regulatory motifs in promoter sequences of the predicted co-expressed genes; (ii) functional enrichment analyses of the predicted co-expressed gene clusters using Gene Ontology (GO) terms, and (iii) visualization capabilities in support of interactive biclustering analyses. QServer supports the biclustering and functional analysis for a wide range of organisms, including human, mouse, Arabidopsis, bacteria and archaea, whose underlying genome database will be continuously updated. CONCLUSION We believe that QServer provides an easy-to-use and highly effective platform useful for hypothesis formulation and testing related to transcription co-regulation.
Collapse
Affiliation(s)
- Fengfeng Zhou
- Research Center for Biomedical Information Technology, Institute of Biomedical and Health Engineering, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, BioEnergy Science Center (BESC), University of Georgia, Athens, Georgia, United States of America
| | - Qin Ma
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, BioEnergy Science Center (BESC), University of Georgia, Athens, Georgia, United States of America
| | - Guojun Li
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, BioEnergy Science Center (BESC), University of Georgia, Athens, Georgia, United States of America
- School of Mathematics, Shandong University, Jinan, China
| | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, BioEnergy Science Center (BESC), University of Georgia, Athens, Georgia, United States of America
- College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
21
|
Judson RS, Mortensen HM, Shah I, Knudsen TB, Elloumi F. Using pathway modules as targets for assay development in xenobiotic screening. ACTA ACUST UNITED AC 2012; 8:531-42. [DOI: 10.1039/c1mb05303e] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
22
|
Comparative microbial modules resource: generation and visualization of multi-species biclusters. PLoS Comput Biol 2011; 7:e1002228. [PMID: 22144874 PMCID: PMC3228777 DOI: 10.1371/journal.pcbi.1002228] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 08/29/2011] [Indexed: 11/24/2022] Open
Abstract
The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures – results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation. Advancing high-throughput experimental technologies are providing access to genome-wide measurements for multiple related species on multiple information levels (e.g. mRNA, protein, interactions, functional assays, etc.). We present a biclustering algorithm and an associated visualization system for generating and exploring regulatory modules derived from analysis of integrated multi-species genomics datasets. We use multi-species-cMonkey, an algorithm of our own construction that can integrate diverse systems-biology datatypes from multiple species to form biclusters, or condition-dependent regulatory modules, that are conserved across both the multiple species analyzed and biclusters that are specific to subsets of the processed species. Our resource is an integrated web and java based system that allows biologists to explore both conserved and species-specific biclusters in the context of the data, associated networks for both species, and existing annotations for both species. Our focus in this work is on the use of the integrated system with examples drawn from exploring modules associated with nitrogen metabolism in two Gram-negative bacteria, E. coli and S. Typhimurium.
Collapse
|
23
|
Dost B, Wu C, Su A, Bafna V. TCLUST: a fast method for clustering genome-scale expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:808-818. [PMID: 20479508 DOI: 10.1109/tcbb.2010.34] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Genes with a common function are often hypothesized to have correlated expression levels in mRNA expression data, motivating the development of clustering algorithms for gene expression data sets. We observe that existing approaches do not scale well for large data sets, and indeed did not converge for the data set considered here. We present a novel clustering method TCLUST that exploits coconnectedness to efficiently cluster large, sparse expression data. We compare our approach with two existing clustering methods CAST and K-means which have been previously applied to clustering of gene-expression data with good performance results. Using a number of metrics, TCLUST is shown to be superior to or at least competitive with the other methods, while being much faster. We have applied this clustering algorithm to a genome-scale gene-expression data set and used gene set enrichment analysis to discover highly significant biological clusters. (Source code for TCLUST is downloadable at http://www.cse.ucsd.edu/~bdost/tclust.)
Collapse
Affiliation(s)
- Banu Dost
- Department of Computer Science and Engineering, University of California, San Diego, CA 92093, USA.
| | | | | | | |
Collapse
|
24
|
A graph clustering algorithm based on a clustering coefficient for weighted graphs. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2010. [DOI: 10.1007/s13173-010-0027-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Abstract
Graph clustering is an important issue for several applications associated with data analysis in graphs. However, the discovery of groups of highly connected nodes that can represent clusters is not an easy task. Many assumptions like the number of clusters and if the clusters are or not balanced, may need to be made before the application of a clustering algorithm. Moreover, without previous information regarding data label, there is no guarantee that the partition found by a clustering algorithm automatically extracts the relevant information present in the data. This paper proposes a new graph clustering algorithm that automatically defines the number of clusters based on a clustering tendency connectivity-based validation measure, also proposed in the paper. According to the computational results, the new algorithm is able to efficiently find graph clustering partitions for complete graphs.
Collapse
|
25
|
Gene expression profiling: classification of mice with left ventricle systolic dysfunction using microarray analysis. Crit Care Med 2010; 38:25-31. [PMID: 19770745 DOI: 10.1097/ccm.0b013e3181b427e8] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
OBJECTIVE We tested the hypothesis that a set of differentially expressed genes could be used to classify mice according to cardiovascular phenotype after prolonged catecholamine stress. DESIGN Prospective, randomized study. SETTING University-based research laboratory. SUBJECTS One hundred seventy-three male mice were studied: wild-type (WT) C57, WT FVB, WT B6129SF2/J, and beta2 adrenergic receptor knockout. INTERVENTIONS Mice of each genotype were randomly assigned to 14-day infusions of isoproterenol (120 microg/g/day) or no treatment. Approximately half of the animals underwent left ventricle pressure volume loop analysis. The remaining animals were killed for extraction of messenger RNA from whole heart preparations for microarray analysis. MEASUREMENTS AND MAIN RESULTS We observed that WT FVB and beta2 adrenergic receptor knockout mice developed systolic dysfunction in response to continuous catecholamine infusion, whereas WT C57 mice developed diastolic dysfunction. Using these mice as the derivation cohort, we identified a set of 83 genes whose differential expression correlated with left ventricle systolic dysfunction. The gene set was then used to accurately classify mice from a separate group (WT B6129SF2/J) into the cohort that developed left ventricle systolic dysfunction after catecholamine stress. CONCLUSIONS The differential expression pattern of 83 genes can be used to accurately classify mice according to physiological phenotype after catecholamine stress.
Collapse
|
26
|
Celton M, Malpertuy A, Lelandais G, de Brevern AG. Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC Genomics 2010; 11:15. [PMID: 20056002 PMCID: PMC2827407 DOI: 10.1186/1471-2164-11-15] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2009] [Accepted: 01/07/2010] [Indexed: 11/17/2022] Open
Abstract
Background Microarray technologies produced large amount of data. In a previous study, we have shown the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. Since, numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, we have evaluated twelve different usable methods, and their influence on the quality of gene clustering. Interestingly we have used several datasets, both kinetic and non kinetic experiments from yeast and human. Results We underline the excellent efficiency of approaches proposed and implemented by Bo and co-workers and especially one based on expected maximization (EM_array). These improvements have been observed also on the imputation of extreme values, the most difficult predictable values. We showed that the imputed MVs have still important effects on the stability of the gene clusters. The improvement on the clustering obtained by hierarchical clustering remains limited and, not sufficient to restore completely the correct gene associations. However, a common tendency can be found between the quality of the imputation method and the gene cluster stability. Even if the comparison between clustering algorithms is a complex task, we observed that k-means approach is more efficient to conserve gene associations. Conclusions More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. Important improvements have so been done since our last study. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. Our study highlights the need for a systematic assessment of imputation methods and so of dedicated benchmarks. A noticeable point is the specific influence of some biological dataset.
Collapse
Affiliation(s)
- Magalie Celton
- INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire, DSIMB, Université Paris Diderot-Paris 7, 2 place Jussieu, Paris, France
| | | | | | | |
Collapse
|
27
|
Mutwil M, Usadel B, Schütte M, Loraine A, Ebenhöh O, Persson S. Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. PLANT PHYSIOLOGY 2010; 152:29-43. [PMID: 19889879 PMCID: PMC2799344 DOI: 10.1104/pp.109.145318] [Citation(s) in RCA: 122] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
A vital quest in biology is comprehensible visualization and interpretation of correlation relationships on a genome scale. Such relationships may be represented in the form of networks, which usually require disassembly into smaller manageable units, or clusters, to facilitate interpretation. Several graph-clustering algorithms that may be used to visualize biological networks are available. However, only some of these support weighted edges, and none provides good control of cluster sizes, which is crucial for comprehensible visualization of large networks. We constructed an interactive coexpression network for the Arabidopsis (Arabidopsis thaliana) genome using a novel Heuristic Cluster Chiseling Algorithm (HCCA) that supports weighted edges and that may control average cluster sizes. Comparative clustering analyses demonstrated that the HCCA performed as well as, or better than, the commonly used Markov, MCODE, and k-means clustering algorithms. We mapped MapMan ontology terms onto coexpressed node vicinities of the network, which revealed transcriptional organization of previously unrelated cellular processes. We further explored the predictive power of this network through mutant analyses and identified six new genes that are essential to plant growth. We show that the HCCA-partitioned network constitutes an ideal "cartographic" platform for visualization of correlation networks. This approach rapidly provides network partitions with relative uniform cluster sizes on a genome-scale level and may thus be used for correlation network layouts also for other species.
Collapse
|
28
|
Zhang KX, Ouellette BFF. Pandora, a pathway and network discovery approach based on common biological evidence. ACTA ACUST UNITED AC 2009; 26:529-35. [PMID: 20031970 PMCID: PMC2820679 DOI: 10.1093/bioinformatics/btp701] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Motivation: Many biological phenomena involve extensive interactions between many of the biological pathways present in cells. However, extraction of all the inherent biological pathways remains a major challenge in systems biology. With the advent of high-throughput functional genomic techniques, it is now possible to infer biological pathways and pathway organization in a systematic way by integrating disparate biological information. Results: Here, we propose a novel integrated approach that uses network topology to predict biological pathways. We integrated four types of biological evidence (protein–protein interaction, genetic interaction, domain–domain interaction and semantic similarity of Gene Ontology terms) to generate a functionally associated network. This network was then used to develop a new pathway finding algorithm to predict biological pathways in yeast. Our approach discovered 195 biological pathways and 31 functionally redundant pathway pairs in yeast. By comparing our identified pathways to three public pathway databases (KEGG, BioCyc and Reactome), we observed that our approach achieves a maximum positive predictive value of 12.8% and improves on other predictive approaches. This study allows us to reconstruct biological pathways and delineates cellular machinery in a systematic view. Availability: The method has been implemented in Perl and is available for downloading from http://www.oicr.on.ca/research/ouellette/pandora. It is distributed under the terms of GPL (http://opensource.org/licenses/gpl-2.0.php) Contact:francis@oicr.on.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kelvin Xi Zhang
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada
| | | |
Collapse
|
29
|
Selga E, Oleaga C, Ramírez S, de Almagro MC, Noé V, Ciudad CJ. Networking of differentially expressed genes in human cancer cells resistant to methotrexate. Genome Med 2009; 1:83. [PMID: 19732436 PMCID: PMC2768990 DOI: 10.1186/gm83] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2009] [Revised: 07/31/2009] [Accepted: 09/04/2009] [Indexed: 12/14/2022] Open
Abstract
Background The need for an integrated view of data obtained from high-throughput technologies gave rise to network analyses. These are especially useful to rationalize how external perturbations propagate through the expression of genes. To address this issue in the case of drug resistance, we constructed biological association networks of genes differentially expressed in cell lines resistant to methotrexate (MTX). Methods Seven cell lines representative of different types of cancer, including colon cancer (HT29 and Caco2), breast cancer (MCF-7 and MDA-MB-468), pancreatic cancer (MIA PaCa-2), erythroblastic leukemia (K562) and osteosarcoma (Saos-2), were used. The differential expression pattern between sensitive and MTX-resistant cells was determined by whole human genome microarrays and analyzed with the GeneSpring GX software package. Genes deregulated in common between the different cancer cell lines served to generate biological association networks using the Pathway Architect software. Results Dikkopf homolog-1 (DKK1) is a highly interconnected node in the network generated with genes in common between the two colon cancer cell lines, and functional validations of this target using small interfering RNAs (siRNAs) showed a chemosensitization toward MTX. Members of the UDP-glucuronosyltransferase 1A (UGT1A) family formed a network of genes differentially expressed in the two breast cancer cell lines. siRNA treatment against UGT1A also showed an increase in MTX sensitivity. Eukaryotic translation elongation factor 1 alpha 1 (EEF1A1) was overexpressed among the pancreatic cancer, leukemia and osteosarcoma cell lines, and siRNA treatment against EEF1A1 produced a chemosensitization toward MTX. Conclusions Biological association networks identified DKK1, UGT1As and EEF1A1 as important gene nodes in MTX-resistance. Treatments using siRNA technology against these three genes showed chemosensitization toward MTX.
Collapse
Affiliation(s)
- Elisabet Selga
- Department of Biochemistry and Molecular Biology, School of Pharmacy, University of Barcelona, Diagonal Avenue, E-08028 Barcelona, Spain.
| | | | | | | | | | | |
Collapse
|
30
|
Li G, Ma Q, Tang H, Paterson AH, Xu Y. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res 2009; 37:e101. [PMID: 19509312 PMCID: PMC2731891 DOI: 10.1093/nar/gkp491] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called 'scaling patterns', a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/ approximately maqin/bicluster. A server version of QUBIC is also available upon request.
Collapse
Affiliation(s)
- Guojun Li
- Department of Biochemistry and Molecular Biology, Computational Systems Biology Laboratory, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | |
Collapse
|
31
|
Abstract
The genetic variation that occurs naturally in a population is a powerful resource for studying how genotype affects phenotype. Each allele is a perturbation of the biological system, and genetic crosses, through the processes of recombination and segregation, randomize the distribution of these alleles among the progeny of a cross. The randomized genetic perturbations affect traits directly and indirectly, and the similarities and differences between traits in their responses to common perturbations allow inferences about whether variation in a trait is a cause of a phenotype (such as disease) or whether the trait variation is, instead, an effect of that phenotype. It is then possible to use this information about causes and effects to build models of probabilistic 'causal networks'. These networks are beginning to define the outlines of the 'genotype-phenotype map'.
Collapse
|
32
|
Zhu Y, Li H, Miller DJ, Wang Z, Xuan J, Clarke R, Hoffman EP, Wang Y. caBIG VISDA: modeling, visualization, and discovery for cluster analysis of genomic data. BMC Bioinformatics 2008; 9:383. [PMID: 18801195 PMCID: PMC2566986 DOI: 10.1186/1471-2105-9-383] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2008] [Accepted: 09/18/2008] [Indexed: 12/31/2022] Open
Abstract
Background The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. Results In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. Conclusion VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.
Collapse
Affiliation(s)
- Yitan Zhu
- Department of Electrical and Computer Engineering, Virginia Polytechnic and State University, Arlington, VA 22203, USA.
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Chen X, Liang S, Zheng W, Liao Z, Shang T, Ma W. Meta-analysis of nasopharyngeal carcinoma microarray data explores mechanism of EBV-regulated neoplastic transformation. BMC Genomics 2008; 9:322. [PMID: 18605998 PMCID: PMC2491640 DOI: 10.1186/1471-2164-9-322] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2008] [Accepted: 07/07/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Epstein-Barr virus (EBV) presumably plays an important role in the pathogenesis of nasopharyngeal carcinoma (NPC), but the molecular mechanism of EBV-dependent neoplastic transformation is not well understood. The combination of bioinformatics with evidences from biological experiments paved a new way to gain more insights into the molecular mechanism of cancer. RESULTS We profiled gene expression using a meta-analysis approach. Two sets of meta-genes were obtained. Meta-A genes were identified by finding those commonly activated/deactivated upon EBV infection/reactivation. These genes could be key players for pathways de-regulated by EBV during latent infection and lytic proliferation. Meta-B genes were obtained from differential genes commonly expressed in NPC and PEL (primary effusion lymphoma). We then integrated meta-A, meta-B and associated factors into an interaction network using acquired information. Our analysis suggests that NPC transformation depends on timely regulation of DEK, CDK inhibitor(s), p53, RB and several transcriptional cascades, interconnected by E2F, AP-1, NF-kappaB, STAT3 among others during latent and lytic cycles. CONCLUSION In conclusion, our meta-analysis strategy re-analyzed EBV-related tumor data sets and identified sets of meta-genes possibly involved in maintaining latent or switching to lytic cycles of EBV in NPC. The results of this analysis may shed new lights to further our understanding of the EBV-led neoplastic transformation.
Collapse
Affiliation(s)
- Xia Chen
- Institute of Genetic Engineering, Southern Medical University, Guangzhou, PR China.
| | | | | | | | | | | |
Collapse
|
34
|
Swindell WR. Genes regulated by caloric restriction have unique roles within transcriptional networks. Mech Ageing Dev 2008; 129:580-92. [PMID: 18634819 DOI: 10.1016/j.mad.2008.06.001] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Revised: 06/09/2008] [Accepted: 06/15/2008] [Indexed: 02/06/2023]
Abstract
Caloric restriction (CR) has received much interest as an intervention that delays age-related disease and increases lifespan. Whole-genome microarrays have been used to identify specific genes underlying these effects, and in mice, this has led to the identification of genes with expression responses to CR that are shared across multiple tissue types. Such CR-regulated genes represent strong candidates for future investigation, but have been understood only as a list, without regard to their broader role within transcriptional networks. In this study, co-expression and network properties of CR-regulated genes were investigated using data generated by more than 600 Affymetrix microarrays. This analysis identified groups of co-expressed genes and regulatory factors associated with the mammalian CR response, and uncovered surprising network properties of CR-regulated genes. Genes downregulated by CR were highly connected and located in dense network regions. In contrast, CR-upregulated genes were weakly connected and positioned in sparse network regions. Some network properties were mirrored by CR-regulated genes from invertebrate models, suggesting an evolutionary basis for the observed patterns. These findings contribute to a systems-level picture of how CR influences transcription within mammalian cells, and point towards a comprehensive understanding of CR in terms of its influence on biological networks.
Collapse
Affiliation(s)
- William R Swindell
- Department of Pathology, University of Michigan, Ann Arbor, MI 48109-2200, USA.
| |
Collapse
|
35
|
Chen G, Larsen P, Almasri E, Dai Y. Rank-based edge reconstruction for scale-free genetic regulatory networks. BMC Bioinformatics 2008; 9:75. [PMID: 18237422 PMCID: PMC2275249 DOI: 10.1186/1471-2105-9-75] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 01/31/2008] [Indexed: 11/12/2022] Open
Abstract
Background The reconstruction of genetic regulatory networks from microarray gene expression data has been a challenging task in bioinformatics. Various approaches to this problem have been proposed, however, they do not take into account the topological characteristics of the targeted networks while reconstructing them. Results In this study, an algorithm that explores the scale-free topology of networks was proposed based on the modification of a rank-based algorithm for network reconstruction. The new algorithm was evaluated with the use of both simulated and microarray gene expression data. The results demonstrated that the proposed algorithm outperforms the original rank-based algorithm. In addition, in comparison with the Bayesian Network approach, the results show that the proposed algorithm gives much better recovery of the underlying network when sample size is much smaller relative to the number of genes. Conclusion The proposed algorithm is expected to be useful in the reconstruction of biological networks whose degree distributions follow the scale-free topology.
Collapse
Affiliation(s)
- Guanrao Chen
- Department of Computer Science (MC152), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA.
| | | | | | | |
Collapse
|