1
|
Smith ML, Barrett ME. Development and validation of the Upstream Social Interaction Risk Scale (U-SIRS-13): a scale to assess threats to social connectedness among older adults. Front Public Health 2024; 12:1454847. [PMID: 39351036 PMCID: PMC11439676 DOI: 10.3389/fpubh.2024.1454847] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Accepted: 09/02/2024] [Indexed: 10/04/2024] Open
Abstract
Background Social interactions are essential to social connectedness among older adults. While many scales have been developed to measure various aspects of social connectedness, most are narrow in scope, which may not be optimally encompassing, practical, or relevant for use with older adults across clinical and community settings. Efforts are needed to create more sensitive scales that can identify "upstream risk," which may facilitate timey referral and/or intervention. Objective The purposes of this study were to: (1) develop and validate a brief scale to measure threats to social connectedness among older adults in the context of their social interactions; and (2) offer practical scoring and implementation recommendations for utilization in research and practice contexts. Methods A sequential process was used to develop the initial instrument used in this study, which was then methodologically reduced to create a brief 13-item scale. Relevant, existing scales and measures were identified and compiled, which were then critically assessed by a combination of research and practice experts to optimize the pool of relevant items that assess threats to social connectedness while reducing potential redundancies. Then, a national sample of 4,082 older adults ages 60 years and older completed a web-based questionnaire containing the initial 36 items about social connection. Several data analysis methods were applied to assess the underlying dimensionality of the data and construct measures of different factors related to risk, including item response theory (IRT) modeling, clustering techniques, and structural equation modeling (SEM). Results IRT modeling reduced the initial 36 items to create the 13-item Upstream Social Interaction Risk Scale (U-SIRS-13) with strong model fit. The dimensionality assessment using different clustering algorithms supported a 2-factor solution to classify risk. The SEM predicting highest risk items fit exceptionally well (RMSEA = 0.048; CFI = 0.954). For the 13-item scale, theta scores generated from IRT were strongly correlated with the summed count of items binarily identifying risk (r = 0.896, p < 0.001), thus supporting the use of practical scoring techniques for research and practice (Cronbach's alpha = 0.80). Conclusion The U-SIRS-13 is a multidimensional scale with strong face, content, and construct validity. Findings support its practical utility to identify threats to social connectedness among older adults posed by limited physical opportunities for social interactions and lacking emotional fulfillment from social interactions.
Collapse
Affiliation(s)
- Matthew Lee Smith
- Center for Community Health and Aging, Texas A&M University, College Station, TX, United States
- Department of Health Behavior, School of Public Health, Texas A&M University, College Station, TX, United States
| | - Matthew E Barrett
- Center for Community Health and Aging, Texas A&M University, College Station, TX, United States
| |
Collapse
|
2
|
Venn B, Leifeld T, Zhang P, Mühlhaus T. Temporal classification of short time series data. BMC Bioinformatics 2024; 25:30. [PMID: 38233793 PMCID: PMC10792935 DOI: 10.1186/s12859-024-05636-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/03/2024] [Indexed: 01/19/2024] Open
Abstract
MOTIVATION Within the frame of their genetic capacity, organisms are able to modify their molecular state to cope with changing environmental conditions or induced genetic disposition. As high throughput methods are becoming increasingly affordable, time series analysis techniques are applied frequently to study the complex dynamic interplay between genes, proteins, and metabolites at the physiological and molecular level. Common analysis approaches fail to simultaneously include (i) information about the replicate variance and (ii) the limited number of responses/shapes that a biological system is typically able to take. RESULTS We present a novel approach to model and classify short time series signals, conceptually based on a classical time series analysis, where the dependency of the consecutive time points is exploited. Constrained spline regression with automated model selection separates between noise and signal under the assumption that highly frequent changes are less likely to occur, simultaneously preserving information about the detected variance. This enables a more precise representation of the measured information and improves temporal classification in order to identify biologically interpretable correlations among the data. AVAILABILITY AND IMPLEMENTATION An open source F# implementation of the presented method and documentation of its usage is freely available in the TempClass repository, https://github.com/CSBiology/TempClass [58].
Collapse
Affiliation(s)
- Benedikt Venn
- Computational Systems Biology, RPTU Kaiserslautern, 67663, Kaiserslautern, Germany
| | - Thomas Leifeld
- Institute of Automatic Control, RPTU Kaiserslautern, 67663, Kaiserslautern, Germany
| | - Ping Zhang
- Institute of Automatic Control, RPTU Kaiserslautern, 67663, Kaiserslautern, Germany
| | - Timo Mühlhaus
- Computational Systems Biology, RPTU Kaiserslautern, 67663, Kaiserslautern, Germany.
| |
Collapse
|
3
|
Till SE, Lu Y, Reinholz AK, Boos AM, Krych AJ, Okoroha KR, Camp CL. Artificial Intelligence Can Define and Predict the "Optimal Observed Outcome" After Anterior Shoulder Instability Surgery: An Analysis of 200 Patients With 11-Year Mean Follow-Up. Arthrosc Sports Med Rehabil 2023; 5:100773. [PMID: 37520500 PMCID: PMC10382895 DOI: 10.1016/j.asmr.2023.100773] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 06/14/2023] [Indexed: 08/01/2023] Open
Abstract
Purpose The purpose of this study was to use unsupervised machine learning clustering to define the "optimal observed outcome" after surgery for anterior shoulder instability (ASI) and to identify predictors for achieving it. Methods Medical records, images, and operative reports were reviewed for patients <40 years old undergoing surgery for ASI. Four unsupervised machine learning clustering algorithms partitioned subjects into "optimal observed outcome" or "suboptimal outcome" based on combinations of actually observed outcomes. Demographic, clinical, and treatment variables were compared between groups using descriptive statistics and Kaplan-Meier survival curves. Variables were assessed for prognostic value through multivariate stepwise logistic regression. Results Two hundred patients with a mean follow-up of 11 years were included. Of these, 146 (64%) obtained the "optimal observed outcome," characterized by decreased: postoperative pain (23% vs 52%; P < 0.001), recurrent instability (12% vs 41%; P < 0.001), revision surgery (10% vs 24%; P = 0.015), osteoarthritis (OA) (5% vs 19%; P = 0.005), and restricted motion (161° vs 168°; P = 0.001). Forty-one percent of patients had a "perfect outcome," defined as ideal performance across all outcomes. Time from initial instability to presentation (odds ratio [OR] = 0.96; 95% confidence interval [CI], 0.92-0.98; P = 0.006) and habitual/voluntary instability (OR = 0.17; 95% CI, 0.04-0.77; P = 0.020) were negative predictors of achieving the "optimal observed outcome." A predilection toward subluxations rather than dislocations before surgery (OR = 1.30; 95% CI, 1.02-1.65; P = 0.030) was a positive predictor. Type of surgery performed was not a significant predictor. Conclusion After surgery for ASI, 64% of patients achieved the "optimal observed outcome" defined as minimal postoperative pain, no recurrent instability or OA, low revision surgery rates, and increased range of motion, of whom only 41% achieved a "perfect outcome." Positive predictors were shorter time to presentation and predilection toward preoperative subluxations over dislocations. Level of Evidence Retrospective cohort, level IV.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Christopher L. Camp
- Address correspondence to Christopher L. Camp, M.D., Mayo Clinic, Department of Orthopedic Surgery, 200 First St. SW, Rochester, MN 55905, U.S.A.
| |
Collapse
|
4
|
Demir Karaman E, Işık Z. Multi-Omics Data Analysis Identifies Prognostic Biomarkers across Cancers. Med Sci (Basel) 2023; 11:44. [PMID: 37489460 PMCID: PMC10366886 DOI: 10.3390/medsci11030044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 06/18/2023] [Accepted: 06/20/2023] [Indexed: 07/26/2023] Open
Abstract
Combining omics data from different layers using integrative methods provides a better understanding of the biology of a complex disease such as cancer. The discovery of biomarkers related to cancer development or prognosis helps to find more effective treatment options. This study integrates multi-omics data of different cancer types with a network-based approach to explore common gene modules among different tumors by running community detection methods on the integrated network. The common modules were evaluated by several biological metrics adapted to cancer. Then, a new prognostic scoring method was developed by weighting mRNA expression, methylation, and mutation status of genes. The survival analysis pointed out statistically significant results for GNG11, CBX2, CDKN3, ARHGEF10, CLN8, SEC61G and PTDSS1 genes. The literature search reveals that the identified biomarkers are associated with the same or different types of cancers. Our method does not only identify known cancer-specific biomarker genes, but also proposes new potential biomarkers. Thus, this study provides a rationale for identifying new gene targets and expanding treatment options across cancer types.
Collapse
Affiliation(s)
- Ezgi Demir Karaman
- Department of Computer Engineering, Institute of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey
| | - Zerrin Işık
- Department of Computer Engineering, Faculty of Engineering, Dokuz Eylul University, Izmir 35390, Turkey
| |
Collapse
|
5
|
Smith RN, Rosales IA, Tomaszewski KT, Mahowald GT, Araujo-Medina M, Acheampong E, Bruce A, Rios A, Otsuka T, Tsuji T, Hotta K, Colvin R. Utility of Banff Human Organ Transplant Gene Panel in Human Kidney Transplant Biopsies. Transplantation 2023; 107:1188-1199. [PMID: 36525551 PMCID: PMC10132999 DOI: 10.1097/tp.0000000000004389] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
BACKGROUND Microarray transcript analysis of human renal transplantation biopsies has successfully identified the many patterns of graft rejection. To evaluate an alternative, this report tests whether gene expression from the Banff Human Organ Transplant (B-HOT) probe set panel, derived from validated microarrays, can identify the relevant allograft diagnoses directly from archival human renal transplant formalin-fixed paraffin-embedded biopsies. To test this hypothesis, principal components (PCs) of gene expressions were used to identify allograft diagnoses, to classify diagnoses, and to determine whether the PC data were rich enough to identify diagnostic subtypes by clustering, which are all needed if the B-HOT panel can substitute for microarrays. METHODS RNA was isolated from routine, archival formalin-fixed paraffin-embedded tissue renal biopsy cores with both rejection and nonrejection diagnoses. The B-HOT panel expression of 770 genes was analyzed by PCs, which were then tested to determine their ability to identify diagnoses. RESULTS PCs of microarray gene sets identified the Banff categories of renal allograft diagnoses, modeled well the aggregate diagnoses, showing a similar correspondence with the pathologic diagnoses as microarrays. Clustering of the PCs identified diagnostic subtypes including non-chronic antibody-mediated rejection with high endothelial expression. PCs of cell types and pathways identified new mechanistic patterns including differential expression of B and plasma cells. CONCLUSIONS Using PCs of gene expression from the B-Hot panel confirms the utility of the B-HOT panel to identify allograft diagnoses and is similar to microarrays. The B-HOT panel will accelerate and expand transcript analysis and will be useful for longitudinal and outcome studies.
Collapse
Affiliation(s)
- Rex N Smith
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Ivy A Rosales
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Kristen T Tomaszewski
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| | - Grace T Mahowald
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Milagros Araujo-Medina
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Ellen Acheampong
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Amy Bruce
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Andrea Rios
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
| | - Takuya Otsuka
- Department of Surgical Pathology, Hokkaido University Hospital, Sapporo, Japan
| | - Takahiro Tsuji
- Department of Pathology, Sapporo City General Hospital, Sapporo, Japan
| | - Kiyohiko Hotta
- Department of Urology, Hokkaido University Hospital, Sapporo, Japan
| | - Robert Colvin
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA
- Center for Transplantation Sciences, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
6
|
Dovrou A, Bei E, Sfakianakis S, Marias K, Papanikolaou N, Zervakis M. Synergies of Radiomics and Transcriptomics in Lung Cancer Diagnosis: A Pilot Study. Diagnostics (Basel) 2023; 13:738. [PMID: 36832225 PMCID: PMC9955510 DOI: 10.3390/diagnostics13040738] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 02/10/2023] [Accepted: 02/10/2023] [Indexed: 02/17/2023] Open
Abstract
Radiotranscriptomics is an emerging field that aims to investigate the relationships between the radiomic features extracted from medical images and gene expression profiles that contribute in the diagnosis, treatment planning, and prognosis of cancer. This study proposes a methodological framework for the investigation of these associations with application on non-small-cell lung cancer (NSCLC). Six publicly available NSCLC datasets with transcriptomics data were used to derive and validate a transcriptomic signature for its ability to differentiate between cancer and non-malignant lung tissue. A publicly available dataset of 24 NSCLC-diagnosed patients, with both transcriptomic and imaging data, was used for the joint radiotranscriptomic analysis. For each patient, 749 Computed Tomography (CT) radiomic features were extracted and the corresponding transcriptomics data were provided through DNA microarrays. The radiomic features were clustered using the iterative K-means algorithm resulting in 77 homogeneous clusters, represented by meta-radiomic features. The most significant differentially expressed genes (DEGs) were selected by performing Significance Analysis of Microarrays (SAM) and 2-fold change. The interactions among the CT imaging features and the selected DEGs were investigated using SAM and a Spearman rank correlation test with a False Discovery Rate (FDR) of 5%, leading to the extraction of 73 DEGs significantly correlated with radiomic features. These genes were used to produce predictive models of the meta-radiomics features, defined as p-metaomics features, by performing Lasso regression. Of the 77 meta-radiomic features, 51 can be modeled in terms of the transcriptomic signature. These significant radiotranscriptomics relationships form a reliable basis to biologically justify the radiomics features extracted from anatomic imaging modalities. Thus, the biological value of these radiomic features was justified via enrichment analysis on their transcriptomics-based regression models, revealing closely associated biological processes and pathways. Overall, the proposed methodological framework provides joint radiotranscriptomics markers and models to support the connection and complementarities between the transcriptome and the phenotype in cancer, as demonstrated in the case of NSCLC.
Collapse
Affiliation(s)
- Aikaterini Dovrou
- Digital Image and Signal Processing Laboratory, School of Electrical and Computer Engineering (ECE), Technical University of Crete, GR-73100 Chania, Greece
| | - Ekaterini Bei
- Digital Image and Signal Processing Laboratory, School of Electrical and Computer Engineering (ECE), Technical University of Crete, GR-73100 Chania, Greece
| | - Stelios Sfakianakis
- Computational BioMedicine Laboratory, Institute of Computer Science, Foundation for Research and Technology-Hellas, GR-70013 Heraklion, Greece
| | - Kostas Marias
- Computational BioMedicine Laboratory, Institute of Computer Science, Foundation for Research and Technology-Hellas, GR-70013 Heraklion, Greece
- Department of Electrical and Computer Engineering, Hellenic Mediterranean University, GR-71410 Heraklion, Greece
| | - Nickolas Papanikolaou
- Computational Clinical Imaging Group, Champalimaud Clinical Centre, Champalimaud Foundation, Avenida Brasilia, 1400-038 Lisbon, Portugal
| | - Michalis Zervakis
- Digital Image and Signal Processing Laboratory, School of Electrical and Computer Engineering (ECE), Technical University of Crete, GR-73100 Chania, Greece
| |
Collapse
|
7
|
Esnault C, Rollot M, Guilmin P, Zucker JD. Qluster: An easy-to-implement generic workflow for robust clustering of health data. Front Artif Intell 2023; 5:1055294. [PMID: 36814808 PMCID: PMC9939832 DOI: 10.3389/frai.2022.1055294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 12/22/2022] [Indexed: 02/08/2023] Open
Abstract
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
Collapse
Affiliation(s)
| | | | | | - Jean-Daniel Zucker
- Sorbonne University, IRD, UMMISCO, Bondy, France
- Sorbonne University, INSERM, NUTRIOMICS, Paris, France
| |
Collapse
|
8
|
Isik Z, Leblebici A, Demir Karaman E, Karaca C, Ellidokuz H, Koc A, Ellidokuz EB, Basbinar Y. In silico identification of novel biomarkers for key players in transition from normal colon tissue to adenomatous polyps. PLoS One 2022; 17:e0267973. [PMID: 35486660 PMCID: PMC9053805 DOI: 10.1371/journal.pone.0267973] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Accepted: 04/19/2022] [Indexed: 11/18/2022] Open
Abstract
Adenomatous polyps of the colon are the most common neoplastic polyps. Although most of adenomatous polyps do not show malign transformation, majority of colorectal carcinomas originate from neoplastic polyps. Therefore, understanding of this transformation process would help in both preventive therapies and evaluation of malignancy risks. This study uncovers alterations in gene expressions as potential biomarkers that are revealed by integration of several network-based approaches. In silico analysis performed on a unified microarray cohort, which is covering 150 normal colon and adenomatous polyp samples. Significant gene modules were obtained by a weighted gene co-expression network analysis. Gene modules with similar profiles were mapped to a colon tissue specific functional interaction network. Several clustering algorithms run on the colon-specific network and the most significant sub-modules between the clusters were identified. The biomarkers were selected by filtering differentially expressed genes which also involve in significant biological processes and pathways. Biomarkers were also validated on two independent datasets based on their differential gene expressions. To the best of our knowledge, such a cascaded network analysis pipeline was implemented for the first time on a large collection of normal colon and polyp samples. We identified significant increases in TLR4 and MSX1 expressions as well as decrease in chemokine profiles with mostly pro-tumoral activities. These biomarkers might appear as both preventive targets and biomarkers for risk evaluation. As a result, this research proposes novel molecular markers that might be alternative to endoscopic approaches for diagnosis of adenomatous polyps.
Collapse
Affiliation(s)
- Zerrin Isik
- Faculty of Engineering, Department of Computer Engineering, Dokuz Eylul University, Izmir, Turkey
| | - Asım Leblebici
- Department of Translational Oncology, Institute of Health Sciences, Dokuz Eylul University, Izmir, Turkey
| | - Ezgi Demir Karaman
- Department of Computer Engineering, Institute of Natural and Applied Sciences, Dokuz Eylul University, Izmir, Turkey
| | - Caner Karaca
- Department of Translational Oncology, Institute of Health Sciences, Dokuz Eylul University, Izmir, Turkey
| | - Hulya Ellidokuz
- Department of Preventive Oncology, Institute of Oncology, Dokuz Eylul University, Izmir, Turkey
| | - Altug Koc
- Gentan Genetic Medical Genetics Diagnosis Center, Izmir, Turkey
| | - Ender Berat Ellidokuz
- Faculty of Medicine, Department of Gastroenterology, Dokuz Eylul University, Izmir, Turkey
| | - Yasemin Basbinar
- Department of Translational Oncology, Institute of Oncology, Dokuz Eylul University, Izmir, Turkey
| |
Collapse
|
9
|
Identifying large scale interaction atlases using probabilistic graphs and external knowledge. J Clin Transl Sci 2022; 6:e27. [PMID: 35321220 PMCID: PMC8922291 DOI: 10.1017/cts.2022.18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 12/29/2021] [Accepted: 02/07/2022] [Indexed: 11/17/2022] Open
Abstract
Introduction: Reconstruction of gene interaction networks from experimental data provides a deep understanding of the underlying biological mechanisms. The noisy nature of the data and the large size of the network make this a very challenging task. Complex approaches handle the stochastic nature of the data but can only do this for small networks; simpler, linear models generate large networks but with less reliability. Methods: We propose a divide-and-conquer approach using probabilistic graph representations and external knowledge. We cluster the experimental data and learn an interaction network for each cluster, which are merged using the interaction network for the representative genes selected for each cluster. Results: We generated an interaction atlas for 337 human pathways yielding a network of 11,454 genes with 17,777 edges. Simulated gene expression data from this atlas formed the basis for reconstruction. Based on the area under the curve of the precision-recall curve, the proposed approach outperformed the baseline (random classifier) by ∼15-fold and conventional methods by ∼5–17-fold. The performance of the proposed workflow is significantly linked to the accuracy of the clustering step that tries to identify the modularity of the underlying biological mechanisms. Conclusions: We provide an interaction atlas generation workflow optimizing the algorithm/parameter selection. The proposed approach integrates external knowledge in the reconstruction of the interactome using probabilistic graphs. Network characterization and understanding long-range effects in interaction atlases provide means for comparative analysis with implications in biomarker discovery and therapeutic approaches. The proposed workflow is freely available at http://otulab.unl.edu/atlas.
Collapse
|
10
|
Using Simulated Pest Models and Biological Clustering Validation to Improve Zoning Methods in Site-Specific Pest Management. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12041900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Site-specific pest management (SSPM) is a component of precision agriculture that relies on spatially enabled agronomic data to facilitate pest control practices within management zones rather than whole fields. Recent integration of high-resolution environmental data, multivariate clustering algorithms, and species distribution modeling has facilitated the development of a novel approach to SSPM that bases zone delineation on environmentally independent subfield units with individual potential to host pest populations (eSSPM). Although the potential benefits of eSSPM are clear, methods currently described for its implementation still demand further evaluation. To offer clear insight into this matter, we used field-level environmental data from a Tahiti lime orchard and realistic simulations of six citrus pests to: (1) generate a series of virtual (i.e., controlled) infestation scenarios suitable for methodological testing purposes, (2) evaluate the utility of nested (i.e., within-cluster) partitioning essays to improve the accuracy of current eSSPM methods, and (3) implement two biological clustering validators to evaluate the performance of 10 clustering algorithms and choose appropriate numbers of management zones during field partitioning essays. Our results demonstrate that: (1) nested partitioning essays outperform zoning methods previously described in eSSPM, (2) more than one clustering algorithm tend to be necessary to generate field partition models that optimize site-specific pest control practices within crop fields, and (3) biological clustering validation is an essential addition to eSSPM zoning methods. Finally, the generated evidence was integrated into an improved workflow for within-field zone delineation with pest control purposes.
Collapse
|
11
|
Tenekeci S, Isik Z. Integrative Biological Network Analysis to Identify Shared Genes in Metabolic Disorders. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:522-530. [PMID: 32396100 DOI: 10.1109/tcbb.2020.2993301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Identification of common molecular mechanisms in interrelated diseases is essential for better prognoses and targeted therapies. However, complexity of metabolic pathways makes it difficult to discover common disease genes underlying metabolic disorders; and it requires more sophisticated bioinformatics models that combine different types of biological data and computational methods. Accordingly, we built an integrative network analysis model to identify shared disease genes in metabolic syndrome (MS), type 2 diabetes (T2D), and coronary artery disease (CAD). We constructed weighted gene co-expression networks by combining gene expression, protein-protein interaction, and gene ontology data from multiple sources. For 90 different configurations of disease networks, we detected the significant modules by using MCL, SPICi, and Linkcomm graph clustering algorithms. We also performed a comparative evaluation on disease modules to determine the best method providing the highest biological validity. By overlapping the disease modules, we identified 22 shared genes for MS-CAD and T2D-CAD. Moreover, 19 out of these genes were directly or indirectly associated with relevant diseases in the previous medical studies. This study does not only demonstrate the performance of different biological data sources and computational methods in disease-gene discovery, but also offers potential insights into common genetic mechanisms of the metabolic disorders.
Collapse
|
12
|
Nowak J, Eng RC, Matz T, Waack M, Persson S, Sampathkumar A, Nikoloski Z. A network-based framework for shape analysis enables accurate characterization of leaf epidermal cells. Nat Commun 2021; 12:458. [PMID: 33469016 PMCID: PMC7815848 DOI: 10.1038/s41467-020-20730-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 12/17/2020] [Indexed: 01/29/2023] Open
Abstract
Cell shape is crucial for the function and development of organisms. Yet, versatile frameworks for cell shape quantification, comparison, and classification remain underdeveloped. Here, we introduce a visibility graph representation of shapes that facilitates network-driven characterization and analyses across shapes encountered in different domains. Using the example of complex shape of leaf pavement cells, we show that our framework accurately quantifies cell protrusions and invaginations and provides additional functionality in comparison to the contending approaches. We further show that structural properties of the visibility graphs can be used to quantify pavement cell shape complexity and allow for classification of plants into their respective phylogenetic clades. Therefore, the visibility graphs provide a robust and unique framework to accurately quantify and classify the shape of different objects.
Collapse
Affiliation(s)
- Jacqueline Nowak
- School of Biosciences, University of Melbourne, Parkville, VIC, 3010, Australia
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, 14476, Potsdam, Germany
- Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Ryan Christopher Eng
- Plant Cell Biology and Microscopy, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Timon Matz
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, 14476, Potsdam, Germany
- Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Matti Waack
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, 14476, Potsdam, Germany
- Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Staffan Persson
- School of Biosciences, University of Melbourne, Parkville, VIC, 3010, Australia
- Joint International Research Laboratory of Metabolic & Developmental Sciences, State Key Laboratory of Hybrid Rice, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Department for Plant and Environmental Sciences, University of Copenhagen, 1871, Frederiksberg C, Denmark
- Copenhagen Plant Science Center, University of Copenhagen, 1871, Frederiksberg C, Denmark
| | - Arun Sampathkumar
- Plant Cell Biology and Microscopy, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany
| | - Zoran Nikoloski
- Bioinformatics, Institute of Biochemistry and Biology, University of Potsdam, 14476, Potsdam, Germany.
- Systems Biology and Mathematical Modelling, Max Planck Institute of Molecular Plant Physiology, 14476, Potsdam, Germany.
| |
Collapse
|
13
|
Parraga-Alava J, Inostroza-Ponta M. Influence of the go-based semantic similarity measures in multi-objective gene clustering algorithm performance. J Bioinform Comput Biol 2020; 18:2050038. [PMID: 33148094 DOI: 10.1142/s0219720020500389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Using a prior biological knowledge of relationships and genetic functions for gene similarity, from repository such as the Gene Ontology (GO), has shown good results in multi-objective gene clustering algorithms. In this scenario and to obtain useful clustering results, it would be helpful to know which measure of biological similarity between genes should be employed to yield meaningful clusters that have both similar expression patterns (co-expression) and biological homogeneity. In this paper, we studied the influence of the four most used GO-based semantic similarity measures in the performance of a multi-objective gene clustering algorithm. We used four publicly available datasets and carried out comparative studies based on performance metrics for the multi-objective optimization field and clustering performance indexes. In most of the cases, using Jiang-Conrath and Wang similarities stand in terms of multi-objective metrics. In clustering properties, Resnik similarity allows to achieve the best values of compactness and separation and therefore of co-expression of groups of genes. Meanwhile, in biological homogeneity, the Wang similarity reports greater number of significant GO terms. However, statistical, visual, and biological significance tests showed that none of the GO-based semantic similarity measures stand out above the rest in order to significantly improve the performance of the multi-objective gene clustering algorithm.
Collapse
Affiliation(s)
- Jorge Parraga-Alava
- Facultad de Ciencias Informáticas, Universidad Técnica de Manabí, Avenida José María Urbina, Portoviejo 130105, Ecuador
| | - Mario Inostroza-Ponta
- Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Avenida Libertador General Bernardo O'Higgins, Santiago 9170020, Chile
| |
Collapse
|
14
|
Kang K, Kim HH, Choi Y. Tiotropium is Predicted to be a Promising Drug for COVID-19 Through Transcriptome-Based Comprehensive Molecular Pathway Analysis. Viruses 2020; 12:E776. [PMID: 32698440 PMCID: PMC7412475 DOI: 10.3390/v12070776] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 07/10/2020] [Accepted: 07/17/2020] [Indexed: 12/12/2022] Open
Abstract
The coronavirus disease 2019 (COVID-19) outbreak caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) affects almost everyone in the world in many ways. We previously predicted antivirals (atazanavir, remdesivir and lopinavir/ritonavir) and non-antiviral drugs (tiotropium and rapamycin) that may inhibit the replication complex of SARS-CoV-2 using our molecular transformer-drug target interaction (MT-DTI) deep-learning-based drug-target affinity prediction model. In this study, we dissected molecular pathways upregulated in SARS-CoV-2-infected normal human bronchial epithelial (NHBE) cells by analyzing an RNA-seq data set with various bioinformatics approaches, such as gene ontology, protein-protein interaction-based network and gene set enrichment analyses. The results indicated that the SARS-CoV-2 infection strongly activates TNF and NFκB-signaling pathways through significant upregulation of the TNF, IL1B, IL6, IL8, NFKB1, NFKB2 and RELB genes. In addition to these pathways, lung fibrosis, keratinization/cornification, rheumatoid arthritis, and negative regulation of interferon-gamma production pathways were also significantly upregulated. We observed that these pathologic features of SARS-CoV-2 are similar to those observed in patients with chronic obstructive pulmonary disease (COPD). Intriguingly, tiotropium, as predicted by MT-DTI, is currently used as a therapeutic intervention in COPD patients. Treatment with tiotropium has been shown to improve pulmonary function by alleviating airway inflammation. Accordingly, a literature search summarized that tiotropium reduced expressions of IL1B, IL6, IL8, RELA, NFKB1 and TNF in vitro or in vivo, and many of them have been known to be deregulated in COPD patients. These results suggest that COVID-19 is similar to an acute mode of COPD caused by the SARS-CoV-2 infection, and therefore tiotropium may be effective for COVID-19 patients.
Collapse
Affiliation(s)
- Keunsoo Kang
- Department of Microbiology, College of Science & Technology, Dankook University, Cheonan 31116, Korea;
| | - Hoo Hyun Kim
- Department of Microbiology, College of Science & Technology, Dankook University, Cheonan 31116, Korea;
| | - Yoonjung Choi
- Deargen Inc., Daejeon, Yuseong-gu, Munji-dong 103-6, Korea
| |
Collapse
|
15
|
Dutta P, Saha S, Pai S, Kumar A. A Protein Interaction Information-based Generative Model for Enhancing Gene Clustering. Sci Rep 2020; 10:665. [PMID: 31959782 PMCID: PMC6971242 DOI: 10.1038/s41598-020-57437-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 12/20/2019] [Indexed: 11/18/2022] Open
Abstract
In the field of computational bioinformatics, identifying a set of genes which are responsible for a particular cellular mechanism, is very much essential for tasks such as medical diagnosis or disease gene identification. Accurately grouping (clustering) the genes is one of the important tasks in understanding the functionalities of the disease genes. In this regard, ensemble clustering becomes a promising approach to combine different clustering solutions to generate almost accurate gene partitioning. Recently, researchers have used generative model as a smart ensemble method to produce the right consensus solution. In the current paper, we develop a protein-protein interaction-based generative model that can efficiently perform a gene clustering. Utilizing protein interaction information as the generative model's latent variable enables enhance the generative model's efficiency in inferring final probabilistic labels. The proposed generative model utilizes different weak supervision sources rather utilizing any ground truth information. For weak supervision sources, we use a multi-objective optimization based clustering technique together with the world's largest gene ontology based knowledge-base named Gene Ontology Consortium(GOC). These weakly supervised labels are supplied to a generative model that eventually assigns all genes to probabilistic labels. The comparative study with respect to silhouette score, Biological Homogeneity Index (BHI) and Biological Stability Index (BSI) proves that the proposed generative model outperforms than other state-of-the-art techniques.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Sanket Pai
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| | - Aviral Kumar
- Department of Chemical Science and Technology, Indian Institute of Technology Patna, Bihta, 801103, India
| |
Collapse
|
16
|
Lu Y, Phillips CA, Langston MA. A robustness metric for biological data clustering algorithms. BMC Bioinformatics 2019; 20:503. [PMID: 31874625 PMCID: PMC6929270 DOI: 10.1186/s12859-019-3089-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 09/10/2019] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? RESULTS This work introduces a new metric, termed simply "robustness", designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. CONCLUSIONS Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.
Collapse
Affiliation(s)
- Yuping Lu
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| | - Charles A. Phillips
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| | - Michael A. Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, 37996 TN USA
| |
Collapse
|
17
|
Alzheimer's disease clinical variants show distinct regional patterns of neurofibrillary tangle accumulation. Acta Neuropathol 2019; 138:597-612. [PMID: 31250152 DOI: 10.1007/s00401-019-02036-6] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 06/16/2019] [Accepted: 06/16/2019] [Indexed: 10/26/2022]
Abstract
The clinical spectrum of Alzheimer's disease (AD) extends well beyond the classic amnestic-predominant syndrome. The previous studies have suggested differential neurofibrillary tangle (NFT) burden between amnestic and logopenic primary progressive aphasia presentations of AD. In this study, we explored the regional distribution of NFT pathology and its relationship to AD presentation across five different clinical syndromes. We assessed NFT density throughout six selected neocortical and hippocampal regions using thioflavin-S fluorescent microscopy in a well-characterized clinicopathological cohort of pure AD cases enriched for atypical clinical presentations. Subjects underwent apolipoprotein E genotyping and neuropsychological testing. Main cognitive domains (executive, visuospatial, language, and memory function) were assessed using an established composite z score. Our results showed that NFT regional burden aligns with the clinical presentation and region-specific cognitive scores. Cortical, but not hippocampal, NFT burden was higher among atypical clinical variants relative to the amnestic syndrome. In analyses of specific clinical variants, logopenic primary progressive aphasia showed higher NFT density in the superior temporal gyrus (p = 0.0091), and corticobasal syndrome showed higher NFT density in the primary motor cortex (p = 0.0205) relative to the amnestic syndrome. Higher NFT burden in the angular gyrus and CA1 sector of the hippocampus were independently associated with worsening visuospatial dysfunction. In addition, unbiased hierarchical clustering based on regional NFT densities identified three groups characterized by a low overall NFT burden, high overall burden, and cortical-predominant burden, respectively, which were found to differ in sex ratio, age, disease duration, and clinical presentation. In comparison, the typical, hippocampal sparing, and limbic-predominant subtypes derived from a previously proposed algorithm did not reproduce the same degree of clinical relevance in this sample. Overall, our results suggest domain-specific functional consequences of regional NFT accumulation. Mapping these consequences presents an opportunity to increase understanding of the neuropathological framework underlying atypical clinical manifestations.
Collapse
|
18
|
Barido-Sottani J, Chapman SD, Kosman E, Mushegian AR. Measuring similarity between gene interaction profiles. BMC Bioinformatics 2019; 20:435. [PMID: 31438841 PMCID: PMC6704681 DOI: 10.1186/s12859-019-3024-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 08/09/2019] [Indexed: 11/14/2022] Open
Abstract
Background Gene and protein interaction data are often represented as interaction networks, where nodes stand for genes or gene products and each edge stands for a relationship between a pair of gene nodes. Commonly, that relationship within a pair is specified by high similarity between profiles (vectors) of experimentally defined interactions of each of the two genes with all other genes in the genome; only gene pairs that interact with similar sets of genes are linked by an edge in the network. The tight groups of genes/gene products that work together in a cell can be discovered by the analysis of those complex networks. Results We show that the choice of the similarity measure between pairs of gene vectors impacts the properties of networks and of gene modules detected within them. We re-analyzed well-studied data on yeast genetic interactions, constructed four genetic networks using four different similarity measures, and detected gene modules in each network using the same algorithm. The four networks induced different numbers of putative functional gene modules, and each similarity measure induced some unique modules. In an example of a putative functional connection suggested by comparing genetic interaction vectors, we predict a link between SUN-domain proteins and protein glycosylation in the endoplasmic reticulum. Conclusions The discovery of molecular modules in genetic networks is sensitive to the way of measuring similarity between profiles of gene interactions in a cell. In the absence of a formal way to choose the “best” measure, it is advisable to explore the measures with different mathematical properties, which may identify different sets of connections between genes. Electronic supplementary material The online version of this article (10.1186/s12859-019-3024-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Joëlle Barido-Sottani
- Stowers Institute for Medical Research, Kansas City, MO, USA.,École Polytechnique, Route de Saclay, Palaiseau, France.,Present Address: Department of Ecology, Evolution and Organismal Biology, Iowa State University, Ames, Iowa, USA
| | - Samuel D Chapman
- Stowers Institute for Medical Research, Kansas City, MO, USA.,Present Address: Booz Allen Hamilton, McLean, Virginia, USA
| | - Evsey Kosman
- Institute for Cereal Crops Improvement, School of Plant Sciences and Food Security, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Arcady R Mushegian
- Stowers Institute for Medical Research, Kansas City, MO, USA. .,Department of Microbiology, Molecular Genetics and Immunology, Kansas University Medical Center, Kansas City, Kansas, USA. .,Present Address: Division of Molecular and Cellular Biosciences, National Science Foundation, Alexandria, Virginia, USA.
| |
Collapse
|
19
|
A Review of Computational Methods for Clustering Genes with Similar Biological Functions. Processes (Basel) 2019. [DOI: 10.3390/pr7090550] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters.
Collapse
|
20
|
Kim J, Stanescu DE, Won KJ. CellBIC: bimodality-based top-down clustering of single-cell RNA sequencing data reveals hierarchical structure of the cell type. Nucleic Acids Res 2019; 46:e124. [PMID: 30102368 PMCID: PMC6265269 DOI: 10.1093/nar/gky698] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Accepted: 07/23/2018] [Indexed: 01/08/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful tool to study heterogeneity and dynamic changes in cell populations. Clustering scRNA-seq is essential in identifying new cell types and studying their characteristics. We develop CellBIC (single Cell BImodal Clustering) to cluster scRNA-seq data based on modality in the gene expression distribution. Compared with classical bottom-up approaches that rely on a distance metric, CellBIC performs hierarchical clustering in a top-down manner. CellBIC outperformed the bottom-up hierarchical clustering approach and other recently developed clustering algorithms while maintaining the hierarchical structure of cells. Importantly, CellBIC identifies type 2 diabetes and age specific β cell signatures characterized by SIX3 and CDH2, respectively.
Collapse
Affiliation(s)
- Junil Kim
- Institute for Diabetes, Obesity and Metabolism, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.,Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.,Biotech Research and Innovation Centre (BRIC), University of Copenhagen, 2200 Copenhagen, Denmark
| | - Diana E Stanescu
- Institute for Diabetes, Obesity and Metabolism, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.,Division of Endocrinology and Diabetes, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kyoung Jae Won
- Institute for Diabetes, Obesity and Metabolism, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.,Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.,Biotech Research and Innovation Centre (BRIC), University of Copenhagen, 2200 Copenhagen, Denmark
| |
Collapse
|
21
|
Yang M, Chen J, Xu L, Shi X, Zhou X, An R, Wang X. A Network Pharmacology Approach to Uncover the Molecular Mechanisms of Herbal Formula Ban-Xia-Xie-Xin-Tang. EVIDENCE-BASED COMPLEMENTARY AND ALTERNATIVE MEDICINE : ECAM 2018; 2018:4050714. [PMID: 30410554 PMCID: PMC6206573 DOI: 10.1155/2018/4050714] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 10/03/2018] [Indexed: 02/07/2023]
Abstract
Ban-Xia-Xie-Xin-Tang (BXXXT) is a classical formula from Shang-Han-Lun which is one of the earliest books of TCM clinical practice. In this work, we investigated the therapeutic mechanisms of BXXXT for the treatment of multiple diseases using a network pharmacology approach. Here three BXXXT representative diseases (colitis, diabetes mellitus, and gastric cancer) were discussed, and we focus on in silico methods that integrate drug-likeness screening, target prioritizing, and multilayer network extending. A total of 140 core targets and 72 representative compounds were finally identified to elucidate the pharmacology of BXXXT formula. After constructing multilayer networks, a good overlap between BXXXT nodes and disease nodes was observed at each level, and the network-based proximity analysis shows that the relevance between the formula targets and disease genes was significant according to the shortest path distance (SPD) and a random walk with restart (RWR) based scores for each disease. We found that there were 22 key pathways significantly associated with BXXXT, and the therapeutic effects of BXXXT were likely addressed by regulating a combination of targets in a modular pattern. Furthermore, the synergistic effects among BXXXT herbs were highlighted by elucidating the molecular mechanisms of individual herbs, and the traditional theory of "Jun-Chen-Zuo-Shi" of TCM formula was effectively interpreted from a network perspective. The proposed approach provides an effective strategy to uncover the mechanisms of action and combinatorial rules of BXXXT formula in a holistic manner.
Collapse
Affiliation(s)
- Ming Yang
- Department of Pharmacy, Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai, China
- Department of Chemistry, College of Pharmacy, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Jialei Chen
- Department of Pharmacy, Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai, China
| | - Liwen Xu
- Department of Pharmacy, Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai, China
| | - Xiufeng Shi
- Department of Pharmacy, Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai, China
| | - Xin Zhou
- Department of Pharmacy, Longhua Hospital Affiliated to Shanghai University of TCM, Shanghai, China
| | - Rui An
- Department of Chemistry, College of Pharmacy, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| | - Xinhong Wang
- Department of Chemistry, College of Pharmacy, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| |
Collapse
|
22
|
Biological networks integration based on dense module identification for gene prioritization from microarray data. GENE REPORTS 2018. [DOI: 10.1016/j.genrep.2018.07.008] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
23
|
Endovascular Biopsy and Endothelial Cell Gene Expression Analysis of Dialysis Arteriovenous Fistulas: A Feasibility Study. J Vasc Interv Radiol 2018; 29:1403-1409.e2. [PMID: 30174159 DOI: 10.1016/j.jvir.2018.04.034] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Revised: 04/10/2018] [Accepted: 04/22/2018] [Indexed: 02/07/2023] Open
Abstract
PURPOSE To demonstrate feasibility of endothelial cell (EC) biopsy from dialysis arteriovenous fistulas (AVFs) with the use of guidewires and to characterize gene expression differences between ECs from stenotic and nonstenotic outflow vein segments. MATERIALS AND METHODS Nine consecutive patients undergoing fistulography for AVF dysfunction from June to August 2016 were enrolled. ECs were biopsied with the use of guidewires from venous outflow stenoses and control outflow veins central to the stenoses. ECs were sorted with the use of flow cytometry, and the Fluidigm Biomark HD system was used for single-cell quantitative polymerase chain reaction (qPCR) analysis of gene expression. Forty-eight genes were assessed and were selected based on different cellular functions and previous literature. Linear mixed models (LMMs) were used to identify differential gene expression between the groups, and self-organizing maps (SOMs) were used to identify cell clusters based on gene coexpression profiles. RESULTS A total of 219 and 213 ECs were sampled from venous outflow stenoses and control vein segments, respectively. There were no immediate biopsy-related complications. Forty-eight cells per patient were sorted for qPCR analysis. LMM identified 7 genes with different levels of expression at stenotic segments (P < .05), including AGTR-2, HMOX-2, MTHFR, SERPINC-1, SERPINE-1, SMAD-4, and VWF. SOM analysis identified 4 cell clusters with unique gene expression profiles, each containing stenotic and control ECs. CONCLUSIONS EC biopsy from dialysis AVFs with the use of guidewires is feasible. Gene expression data suggest that genes involved in multiple cellular functions are dysregulated in stenotic areas. SOMs identified 4 unique clusters of cells, indicating EC phenotypic heterogeneity in outflow veins.
Collapse
|
24
|
Saelens W, Cannoodt R, Saeys Y. A comprehensive evaluation of module detection methods for gene expression data. Nat Commun 2018; 9:1090. [PMID: 29545622 PMCID: PMC5854612 DOI: 10.1038/s41467-018-03424-4] [Citation(s) in RCA: 148] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 02/12/2018] [Indexed: 12/19/2022] Open
Abstract
A critical step in the analysis of large genome-wide gene expression datasets is the use of module detection methods to group genes into co-expression modules. Because of limitations of classical clustering methods, numerous alternative module detection methods have been proposed, which improve upon clustering by handling co-expression in only a subset of samples, modelling the regulatory network, and/or allowing overlap between modules. In this study we use known regulatory networks to do a comprehensive and robust evaluation of these different methods. Overall, decomposition methods outperform all other strategies, while we do not find a clear advantage of biclustering and network inference-based approaches on large gene expression datasets. Using our evaluation workflow, we also investigate several practical aspects of module detection, such as parameter estimation and the use of alternative similarity measures, and conclude with recommendations for the further development of these methods. Modules composed of groups of genes with similar expression profiles tend to be functionally related and co-regulated. Here, Saelens et al evaluate the performance of 42 computational methods and provide practical guidelines for module detection in gene expression data.
Collapse
Affiliation(s)
- Wouter Saelens
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium.,Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium
| | - Robrecht Cannoodt
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium.,Center for Medical Genetics, Ghent University Hospital, 9000, Ghent, Belgium
| | - Yvan Saeys
- Data Mining and Modelling for Biomedicine, VIB Center for Inflammation Research, 9052, Ghent, Belgium. .,Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000, Ghent, Belgium.
| |
Collapse
|
25
|
Leale G, Baya AE, Milone DH, Granitto PM, Stegmayer G. Inferring Unknown Biological Function by Integration of GO Annotations and Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:168-180. [PMID: 27723603 DOI: 10.1109/tcbb.2016.2615960] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. We present here a novel computational approach for inferring biological function for a set of genes with previously unknown function, given a set of genes with well-known information. This approach is based on the premise that genes with similar behaviour should be grouped together. This is known as the guilt-by-association principle. Thus, it is possible to take advantage of clustering techniques to obtain groups of unknown genes that are co-clustered with genes that have well-known semantic information (GO annotations). Meaningful knowledge to infer unknown semantic information can therefore be provided by these well-known genes. We provide a method to explore the potential function of new genes according to those currently annotated. The results obtained indicate that the proposed approach could be a useful and effective tool when used by biologists to guide the inference of biological functions for recently discovered genes. Our work sets an important landmark in the field of identifying unknown gene functions through clustering, using an external source of biological input. A simple web interface to this proposal can be found at http://fich.unl.edu.ar/sinc/webdemo/gamma-am/.
Collapse
|
26
|
Dutta P, Saha S. Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput Biol Med 2017; 89:31-43. [PMID: 28783536 DOI: 10.1016/j.compbiomed.2017.07.015] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Revised: 07/28/2017] [Accepted: 07/28/2017] [Indexed: 11/29/2022]
Abstract
One of the crucial problems in the field of functional genomics is to identify a set of genes which are responsible for a particular cellular mechanism. The current work explores the usage of a multi-objective optimization based genetic clustering technique to classify genes into groups with respect to their functional similarities and biological relevance. Our contribution is two-fold: firstly a new quality measure to compute the goodness of gene-clusters namely protein-protein interaction confidence score is developed. This utilizes the confidence scores of the protein-protein interaction networks to measure the similarity between genes of a particular cluster with respect to their biochemical protein products. Secondly, a multi-objective based clustering approach is developed which intelligently uses integrated information of expression values of microarray dataset and protein-protein interaction confidence scores to select both statistically and biologically relevant genes. For that very purpose, some biological cluster validity indices, viz. biological homogeneity index and protein-protein interaction confidence score, along with two traditional internal cluster validity indices, viz. fuzzy partition coefficient and Pakhira-Bandyopadhyay-Maulik-index, are simultaneously optimized during the clustering process. Experimental results on three real-life gene expression datasets show that the addition of new objective capturing protein-protein interaction information aids in clustering the genes as compared to the existing techniques. The observations are further supported by biological and statistical significance tests.
Collapse
Affiliation(s)
- Pratik Dutta
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihar, India.
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihar, India.
| |
Collapse
|
27
|
Endovascular Biopsy: In Vivo Cerebral Aneurysm Endothelial Cell Sampling and Gene Expression Analysis. Transl Stroke Res 2017; 9:20-33. [PMID: 28900857 DOI: 10.1007/s12975-017-0560-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Revised: 07/31/2017] [Accepted: 08/01/2017] [Indexed: 10/18/2022]
Abstract
There is limited data describing endothelial cell (EC) gene expression between aneurysms and arteries partly because of risks associated with surgical tissue collection. Endovascular biopsy (EB) is a lower risk alternative to conventional surgical methods, though no such efforts have been attempted for aneurysms. We sought (1) to establish the feasibility of EB to isolate viable ECs by fluorescence-activated cell sorting (FACS), (2) to characterize the differences in gene expression by anatomic location and rupture status using single-cell qPCR, and (3) to demonstrate the utility of unsupervised clustering algorithms to identify cell subpopulations. EB was performed in 10 patients (5 ruptured, 5 non-ruptured). FACS was used to isolate the ECs and single-cell qPCR was used to quantify the expression of 48 genes. Linear mixed models and exploratory multilevel component analysis (MCA) and self-organizing maps (SOMs) were performed to identify possible subpopulations of cells. ECs were collected from all aneurysms and there were no adverse events. A total of 437 ECs was collected, 94 (22%) of which were aneurysmal cells and 319 (73%) demonstrated EC-specific gene expression. Ruptured aneurysm cells, relative controls, yielded a median p value of 0.40 with five genes (10%) with p values < 0.05. The five genes (TIE1, ENG, VEGFA, MMP2, and VWF) demonstrated uniformly reduced expression relative the remaining ECs. MCA and SOM analyses identified a population of outlying cells characterized by cell marker gene expression profiles different from endothelial cells. After removal of these cells, no cell clustering based on genetic co-expressivity was found to differentiate aneurysm cells from control cells. Endovascular sampling is a reliable method for cell collection for brain aneurysm gene analysis and may serve as a technique to further vascular molecular research. There is utility in combining mixed and clustering methods, despite no specific subpopulation identified in this trial.
Collapse
|
28
|
Ji G, Lin Q, Long Y, Ye C, Ye W, Wu X. PAcluster: Clustering polyadenylation site data using canonical correlation analysis. J Bioinform Comput Biol 2017; 15:1750018. [PMID: 28874086 DOI: 10.1142/s0219720017500184] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Alternative polyadenylation (APA) is a pervasive mechanism that contributes to gene regulation. Increasing sequenced poly(A) sites are placing new demands for the development of computational methods to investigate APA regulation. Cluster analysis is important to identify groups of co-expressed genes. However, clustering of poly(A) sites has not been extensively studied in APA, where most APA studies failed to consider the distribution, abundance, and variation of APA sites in each gene. Here we constructed a two-layer model based on canonical correlation analysis (CCA) to explore the underlying biological mechanisms in APA regulation. The first layer quantifies the general correlation of APA sites across various conditions between each gene and the second layer identifies genes with statistically significant correlation on their APA patterns to infer APA-specific gene clusters. Using hierarchical clustering, we comprehensively compared our method with four other widely used distance measures based on three performance indexes. Results showed that our method significantly enhanced the clustering performance for both synthetic and real poly(A) site data and could generate clusters with more biological meaning. We have implemented the CCA-based method as a publically available R package called PAcluster, which provides an efficient solution to the clustering of large APA-specific biological dataset.
Collapse
Affiliation(s)
- Guoli Ji
- * Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China
| | - Qianmin Lin
- * Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China
| | - Yuqi Long
- * Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China
| | - Congting Ye
- † College of the Environment and Ecology, Xiamen University, Xiamen, Fujian, P. R. China
| | - Wenbin Ye
- * Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China
| | - Xiaohui Wu
- * Department of Automation, Xiamen University, Xiamen, Fujian, P. R. China
| |
Collapse
|
29
|
Khan A, Katanic D, Thakar J. Meta-analysis of cell- specific transcriptomic data using fuzzy c-means clustering discovers versatile viral responsive genes. BMC Bioinformatics 2017; 18:295. [PMID: 28587632 PMCID: PMC5461682 DOI: 10.1186/s12859-017-1669-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 05/03/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Despite advances in the gene-set enrichment analysis methods; inadequate definitions of gene-sets cause a major limitation in the discovery of novel biological processes from the transcriptomic datasets. Typically, gene-sets are obtained from publicly available pathway databases, which contain generalized definitions frequently derived by manual curation. Recently unsupervised clustering algorithms have been proposed to identify gene-sets from transcriptomics datasets deposited in public domain. These data-driven definitions of the gene-sets can be context-specific revealing novel biological mechanisms. However, the previously proposed algorithms for identification of data-driven gene-sets are based on hard clustering which do not allow overlap across clusters, a characteristic that is predominantly observed across biological pathways. RESULTS We developed a pipeline using fuzzy-C-means (FCM) soft clustering approach to identify gene-sets which recapitulates topological characteristics of biological pathways. Specifically, we apply our pipeline to derive gene-sets from transcriptomic data measuring response of monocyte derived dendritic cells and A549 epithelial cells to influenza infections. Our approach apply Ward's method for the selection of initial conditions, optimize parameters of FCM algorithm for human cell-specific transcriptomic data and identify robust gene-sets along with versatile viral responsive genes. CONCLUSION We validate our gene-sets and demonstrate that by identifying genes associated with multiple gene-sets, FCM clustering algorithm significantly improves interpretation of transcriptomic data facilitating investigation of novel biological processes by leveraging on transcriptomic data available in the public domain. We develop an interactive 'Fuzzy Inference of Gene-sets (FIGS)' package (GitHub: https://github.com/Thakar-Lab/FIGS ) to facilitate use of of pipeline. Future extension of FIGS across different immune cell-types will improve mechanistic investigation followed by high-throughput omics studies.
Collapse
Affiliation(s)
- Atif Khan
- Department of Microbiology and Immunology, University of Rochester, Rochester, NY, 14642, USA
| | - Dejan Katanic
- Department of Microbiology and Immunology, University of Rochester, Rochester, NY, 14642, USA
| | - Juilee Thakar
- Department of Microbiology and Immunology, University of Rochester, Rochester, NY, 14642, USA.
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, 14642, USA.
- , 601 Elmwood Avenue, Rochester, NY, 14618, USA.
| |
Collapse
|
30
|
Exploratory analysis of local gene groups in breast cancer guided by biological networks. HEALTH AND TECHNOLOGY 2017. [DOI: 10.1007/s12553-016-0155-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
31
|
|
32
|
Cifola I, Lionetti M, Pinatel E, Todoerti K, Mangano E, Pietrelli A, Fabris S, Mosca L, Simeon V, Petrucci MT, Morabito F, Offidani M, Di Raimondo F, Falcone A, Caravita T, Battaglia C, De Bellis G, Palumbo A, Musto P, Neri A. Whole-exome sequencing of primary plasma cell leukemia discloses heterogeneous mutational patterns. Oncotarget 2016; 6:17543-58. [PMID: 26046463 PMCID: PMC4627327 DOI: 10.18632/oncotarget.4028] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2015] [Accepted: 05/11/2015] [Indexed: 02/04/2023] Open
Abstract
Primary plasma cell leukemia (pPCL) is a rare and aggressive form of plasma cell dyscrasia and may represent a valid model for high-risk multiple myeloma (MM). To provide novel information concerning the mutational profile of this disease, we performed the whole-exome sequencing of a prospective series of 12 pPCL cases included in a Phase II multicenter clinical trial and previously characterized at clinical and molecular levels. We identified 1, 928 coding somatic non-silent variants on 1, 643 genes, with a mean of 166 variants per sample, and only few variants and genes recurrent in two or more samples. An excess of C > T transitions and the presence of two main mutational signatures (related to APOBEC over-activity and aging) occurring in different translocation groups were observed. We identified 14 candidate cancer driver genes, mainly involved in cell-matrix adhesion, cell cycle, genome stability, RNA metabolism and protein folding. Furthermore, integration of mutation data with copy number alteration profiles evidenced biallelically disrupted genes with potential tumor suppressor functions. Globally, cadherin/Wnt signaling, extracellular matrix and cell cycle checkpoint resulted the most affected functional pathways. Sequencing results were finally combined with gene expression data to better elucidate the biological relevance of mutated genes. This study represents the first whole-exome sequencing screen of pPCL and evidenced a remarkable genetic heterogeneity of mutational patterns. This may provide a contribution to the comprehension of the pathogenetic mechanisms associated with this aggressive form of PC dyscrasia and potentially with high-risk MM.
Collapse
Affiliation(s)
- Ingrid Cifola
- Institute for Biomedical Technologies, National Research Council, Milan, Italy
| | - Marta Lionetti
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy.,Hematology, Foundation IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Eva Pinatel
- Institute for Biomedical Technologies, National Research Council, Milan, Italy
| | - Katia Todoerti
- Laboratory of Pre-Clinical and Translational Research, IRCCS-CROB, Referral Cancer Center of Basilicata, Rionero in Vulture (PZ), Italy
| | - Eleonora Mangano
- Institute for Biomedical Technologies, National Research Council, Milan, Italy
| | | | - Sonia Fabris
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy.,Hematology, Foundation IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Laura Mosca
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy.,Hematology, Foundation IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Vittorio Simeon
- Laboratory of Pre-Clinical and Translational Research, IRCCS-CROB, Referral Cancer Center of Basilicata, Rionero in Vulture (PZ), Italy
| | - Maria Teresa Petrucci
- Hematology, Department of Cellular Biotechnologies and Hematology, La Sapienza University, Rome, Italy
| | | | - Massimo Offidani
- Hematologic Clinic, Azienda Ospedaliero-Universitaria Ospedali Riuniti di Ancona, Ancona, Italy
| | - Francesco Di Raimondo
- Department of Biomedical Sciences, Division of Hematology, Ospedale Ferrarotto, University of Catania, Catania, Italy
| | - Antonietta Falcone
- Hematology Unit, IRCCS "Casa Sollievo della Sofferenza" Hospital, San Giovanni Rotondo, Italy
| | - Tommaso Caravita
- Department of Hematology, Ospedale S. Eugenio, Tor Vergata University, Rome, Italy
| | - Cristina Battaglia
- Institute for Biomedical Technologies, National Research Council, Milan, Italy.,Department of Medical Biotechnology and Translational Medicine, University of Milan, Milan, Italy
| | - Gianluca De Bellis
- Institute for Biomedical Technologies, National Research Council, Milan, Italy
| | - Antonio Palumbo
- Division of Hematology, University of Torino, A.O.U. San Giovanni Battista, Torino, Italy
| | - Pellegrino Musto
- Scientific Direction, IRCCS-CROB, Referral Cancer Center of Basilicata, Rionero in Vulture (PZ), Italy
| | - Antonino Neri
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy.,Hematology, Foundation IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| |
Collapse
|
33
|
Li H, Li C, Hu J, Fan X. A Resampling Based Clustering Algorithm for Replicated Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1295-1303. [PMID: 26671802 DOI: 10.1109/tcbb.2015.2403320] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In gene expression data analysis, clustering is a fruitful exploratory technique to reveal the underlying molecular mechanism by identifying groups of co-expressed genes. To reduce the noise, usually multiple experimental replicates are performed. An integrative analysis of the full replicate data, instead of reducing the data to the mean profile, carries the promise of yielding more precise and robust clusters. In this paper, we propose a novel resampling based clustering algorithm for genes with replicated expression measurements. Assuming those replicates are exchangeable, we formulate the problem in the bootstrap framework, and aim to infer the consensus clustering based on the bootstrap samples of replicates. In our approach, we adopt the mixed effect model to accommodate the heterogeneous variances and implement a quasi-MCMC algorithm to conduct statistical inference. Experiments demonstrate that by taking advantage of the full replicate data, our algorithm produces more reliable clusters and has robust performance in diverse scenarios, especially when the data is subject to multiple sources of variance.
Collapse
|
34
|
Knowledge-Based Analysis for Detecting Key Signaling Events from Time-Series Phosphoproteomics Data. PLoS Comput Biol 2015; 11:e1004403. [PMID: 26252020 PMCID: PMC4529189 DOI: 10.1371/journal.pcbi.1004403] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2015] [Accepted: 06/11/2015] [Indexed: 12/24/2022] Open
Abstract
Cell signaling underlies transcription/epigenetic control of a vast majority of cell-fate decisions. A key goal in cell signaling studies is to identify the set of kinases that underlie key signaling events. In a typical phosphoproteomics study, phosphorylation sites (substrates) of active kinases are quantified proteome-wide. By analyzing the activities of phosphorylation sites over a time-course, the temporal dynamics of signaling cascades can be elucidated. Since many substrates of a given kinase have similar temporal kinetics, clustering phosphorylation sites into distinctive clusters can facilitate identification of their respective kinases. Here we present a knowledge-based CLUster Evaluation (CLUE) approach for identifying the most informative partitioning of a given temporal phosphoproteomics data. Our approach utilizes prior knowledge, annotated kinase-substrate relationships mined from literature and curated databases, to first generate biologically meaningful partitioning of the phosphorylation sites and then determine key kinases associated with each cluster. We demonstrate the utility of the proposed approach on two time-series phosphoproteomics datasets and identify key kinases associated with human embryonic stem cell differentiation and insulin signaling pathway. The proposed approach will be a valuable resource in the identification and characterizing of signaling networks from phosphoproteomics data. A key goal in cell signaling studies is to identify the set of kinases that underlie key signaling events. Mass spectrometry-based technologies have emerged as a powerful tool to profile proteome-wide phosphorylation events in vivo at a single amino acid resolution with high precision. However, development of algorithms to analyze and identify signaling events from high-throughput phosphoproteomics data is still in its infancy. Here we propose a knowledge-based CLUster Evaluation (CLUE) approach for identifying key signaling cascades from time-series phosphoproteomics data. Our approach utilizes known kinase-substrate annotations from curated phosphoproteomics databases to first determine the optimal clustering of the phosphorylation sites and then identify enriched kinase(s). We apply CLUE on time-series phosphoproteomics datasets and identify key kinases associated with human embryonic stem cell differentiation and insulin signaling pathway.
Collapse
|
35
|
Ye N, Yin H, Liu J, Dai X, Yin T. GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature. BIOMED RESEARCH INTERNATIONAL 2015; 2015:853734. [PMID: 26199946 PMCID: PMC4496643 DOI: 10.1155/2015/853734] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 05/20/2015] [Accepted: 06/11/2015] [Indexed: 12/21/2022]
Abstract
The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.
Collapse
Affiliation(s)
- Ning Ye
- The Southern Modern Forestry Collaborative Innovation Center, Nanjing Forestry University, Nanjing 210037, China
- College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
| | - Hengfu Yin
- Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Fuyang, Zhejiang 311400, China
- Key Laboratory of Forest genetics and breeding, Chinese Academy of Forestry, Fuyang, Zhejiang 311400, China
| | - Jingjing Liu
- The Southern Modern Forestry Collaborative Innovation Center, Nanjing Forestry University, Nanjing 210037, China
- College of Forest Resources and Environment, Nanjing Forestry University, Nanjing 210037, China
| | - Xiaogang Dai
- The Southern Modern Forestry Collaborative Innovation Center, Nanjing Forestry University, Nanjing 210037, China
- College of Forest Resources and Environment, Nanjing Forestry University, Nanjing 210037, China
| | - Tongming Yin
- The Southern Modern Forestry Collaborative Innovation Center, Nanjing Forestry University, Nanjing 210037, China
- College of Forest Resources and Environment, Nanjing Forestry University, Nanjing 210037, China
| |
Collapse
|
36
|
Berenstein AJ, Piñero J, Furlong LI, Chernomoretz A. Mining the modular structure of protein interaction networks. PLoS One 2015; 10:e0122477. [PMID: 25856434 PMCID: PMC4391834 DOI: 10.1371/journal.pone.0122477] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 02/11/2015] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Cluster-based descriptions of biological networks have received much attention in recent years fostered by accumulated evidence of the existence of meaningful correlations between topological network clusters and biological functional modules. Several well-performing clustering algorithms exist to infer topological network partitions. However, due to respective technical idiosyncrasies they might produce dissimilar modular decompositions of a given network. In this contribution, we aimed to analyze how alternative modular descriptions could condition the outcome of follow-up network biology analysis. METHODOLOGY We considered a human protein interaction network and two paradigmatic cluster recognition algorithms, namely: the Clauset-Newman-Moore and the infomap procedures. We analyzed to what extent both methodologies yielded different results in terms of granularity and biological congruency. In addition, taking into account Guimera's cartographic role characterization of network nodes, we explored how the adoption of a given clustering methodology impinged on the ability to highlight relevant network meso-scale connectivity patterns. RESULTS As a case study we considered a set of aging related proteins and showed that only the high-resolution modular description provided by infomap, could unveil statistically significant associations between them and inter/intra modular cartographic features. Besides reporting novel biological insights that could be gained from the discovered associations, our contribution warns against possible technical concerns that might affect the tools used to mine for interaction patterns in network biology studies. In particular our results suggested that sub-optimal partitions from the strict point of view of their modularity levels might still be worth being analyzed when meso-scale features were to be explored in connection with external source of biological knowledge.
Collapse
Affiliation(s)
- Ariel José Berenstein
- Departamento de Física, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires and Instituto de Física de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas, Pabellón 1, Ciudad Universitaria, Buenos Aires, Argentina
| | - Janet Piñero
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Universitat Pompeu Fabra (UPF), Carrer del Dr. Aiguader, 88, 08003—Barcelona, Spain
| | - Laura Inés Furlong
- Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Universitat Pompeu Fabra (UPF), Carrer del Dr. Aiguader, 88, 08003—Barcelona, Spain
| | - Ariel Chernomoretz
- Departamento de Física, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires and Instituto de Física de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas, Pabellón 1, Ciudad Universitaria, Buenos Aires, Argentina
- Laboratorio de Biología de Sistemas Integrativa, Fundación Instituto Leloir, Buenos Aires, Argentina
| |
Collapse
|
37
|
Chang JS, Kim Y, Kim SH, Hwang S, Kim J, Chung IW, Kim YS, Jung HY. Differences in the internal structure of hallucinatory experiences between clinical and nonclinical populations. Psychiatry Res 2015; 226:204-10. [PMID: 25619435 DOI: 10.1016/j.psychres.2014.12.051] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Revised: 12/30/2014] [Accepted: 12/31/2014] [Indexed: 10/24/2022]
Abstract
We investigated differential patterns of hallucinatory experiences between nonclinical and clinical samples. A total of 223 nonclinical individuals (108 females) and 111 subjects with schizophrenia (54 females) completed the Launay-Slade Hallucination Scale-Revised (LSHS-R) and Perceptual Aberration Scale (PAS). The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) was used for the nonclinical group, and the Positive and Negative Syndrome Scale (PANSS) hallucination item was used for the clinical group. Cronbach's alpha values showed good internal consistency for the LSHS-R. In the two groups, significant associations were found between LSHS-R and PAS scores. Two factors were extracted through a principal component analysis (PCA) in the nonclinical group, and three factors were identified in the clinical group. The results of a hierarchical cluster analysis (HCA) revealed that a perception-cognition dimension was clear cluster discriminating element for the nonclinical group, whereas alterations in perception-cognition dimension were characteristic in cluster structure of the clinical group. Our findings suggest that the nature of hallucinatory experiences may differ qualitatively between a nonclinical population and subjects with schizophrenia. Perceptual or cognitive aberrations may add a psychopathologic dimension to hallucinatory experiences. Exploring the internal structure of hallucinatory experiences may provide explanatory insight into these experiences in the general population.
Collapse
Affiliation(s)
- Jae Seung Chang
- Department of Psychiatry and Institute of Clinical Psychopharmacology, Dongguk University Ilsan Hospital, Goyang, Gyeonggi, Republic of Korea
| | - Yeni Kim
- Department of Adolescent Psychiatry, Seoul National Hospital, Seoul, Republic of Korea
| | - Se Hyun Kim
- Department of Psychiatry and Institute of Clinical Psychopharmacology, Dongguk University Ilsan Hospital, Goyang, Gyeonggi, Republic of Korea
| | - Samuel Hwang
- Department of Psychology, Chonnam University, Gwangju, Republic of Korea
| | - Jayoun Kim
- Biomedical Research Institute, Seoul National University Bundang Hospital, Seongnam, Gyeonggi, Republic of Korea
| | - In-Won Chung
- Department of Psychiatry and Institute of Clinical Psychopharmacology, Dongguk University Ilsan Hospital, Goyang, Gyeonggi, Republic of Korea
| | - Yong Sik Kim
- Department of Psychiatry and Institute of Clinical Psychopharmacology, Dongguk University Ilsan Hospital, Goyang, Gyeonggi, Republic of Korea
| | - Hee-Yeon Jung
- Department of Psychiatry, SMG-SNU Boramae Medical Center, Seoul, Republic of Korea; Department of Psychiatry and Behavioral Science and Institute of Human Behavioral Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
38
|
Omranian N, Mueller-Roeber B, Nikoloski Z. Segmentation of biological multivariate time-series data. Sci Rep 2015; 5:8937. [PMID: 25758050 PMCID: PMC5390911 DOI: 10.1038/srep08937] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 02/06/2015] [Indexed: 11/15/2022] Open
Abstract
Time-series data from multicomponent systems capture the dynamics of the ongoing processes and reflect the interactions between the components. The progression of processes in such systems usually involves check-points and events at which the relationships between the components are altered in response to stimuli. Detecting these events together with the implicated components can help understand the temporal aspects of complex biological systems. Here we propose a regularized regression-based approach for identifying breakpoints and corresponding segments from multivariate time-series data. In combination with techniques from clustering, the approach also allows estimating the significance of the determined breakpoints as well as the key components implicated in the emergence of the breakpoints. Comparative analysis with the existing alternatives demonstrates the power of the approach to identify biologically meaningful breakpoints in diverse time-resolved transcriptomics data sets from the yeast Saccharomyces cerevisiae and the diatom Thalassiosira pseudonana.
Collapse
Affiliation(s)
- Nooshin Omranian
- Department of Molecular Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, Haus 20, 14476 Potsdam, Germany
- Systems Biology and Mathematical Modelling Group, Max Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Bernd Mueller-Roeber
- Department of Molecular Biology, University of Potsdam, Karl-Liebknecht-Str. 24-25, Haus 20, 14476 Potsdam, Germany
| | - Zoran Nikoloski
- Systems Biology and Mathematical Modelling Group, Max Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| |
Collapse
|
39
|
Vavoulis DV, Francescatto M, Heutink P, Gough J. DGEclust: differential expression analysis of clustered count data. Genome Biol 2015; 16:39. [PMID: 25853652 PMCID: PMC4365804 DOI: 10.1186/s13059-015-0604-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2014] [Accepted: 02/03/2015] [Indexed: 11/10/2022] Open
Abstract
We present a statistical methodology, DGEclust, for differential expression analysis of digital expression data. Our method treats differential expression as a form of clustering, thus unifying these two concepts. Furthermore, it simultaneously addresses the problem of how many clusters are supported by the data and uncertainty in parameter estimation. DGEclust successfully identifies differentially expressed genes under a number of different scenarios, maintaining a low error rate and an excellent control of its false discovery rate with reasonable computational requirements. It is formulated to perform particularly well on low-replicated data and be applicable to multi-group data. DGEclust is available at http://dvav.github.io/dgeclust/.
Collapse
Affiliation(s)
| | - Margherita Francescatto
- />Genome Biology of Neurodegenerative Diseases, Deutsches Zentrum für Neurodegenerative Erkrankungen, Tübingen, Germany
| | - Peter Heutink
- />Genome Biology of Neurodegenerative Diseases, Deutsches Zentrum für Neurodegenerative Erkrankungen, Tübingen, Germany
| | - Julian Gough
- />Department of Computer Science, University of Bristol, Bristol, UK
| |
Collapse
|
40
|
Milone DH, Stegmayer G, López M, Kamenetzky L, Carrari F. Improving clustering with metabolic pathway data. BMC Bioinformatics 2014; 15:101. [PMID: 24717120 PMCID: PMC4002909 DOI: 10.1186/1471-2105-15-101] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2013] [Accepted: 03/25/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. RESULTS A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. CONCLUSIONS Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/. The source code and the data sets supporting the results of this article are available at http://sourceforge.net/projects/sourcesinc/files/bsom.
Collapse
Affiliation(s)
- Diego H Milone
- Research Center for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Ciudad Universitaria UNL, (3000) Santa Fe, Argentina.
| | | | | | | | | |
Collapse
|
41
|
Sirinukunwattana K, Savage RS, Bari MF, Snead DRJ, Rajpoot NM. Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics. PLoS One 2013; 8:e75748. [PMID: 24194826 PMCID: PMC3806770 DOI: 10.1371/journal.pone.0075748] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2013] [Accepted: 08/19/2013] [Indexed: 11/29/2022] Open
Abstract
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/
Collapse
Affiliation(s)
| | - Richard S. Savage
- Warwick Systems Biology Centre, The University of Warwick, Coventry, United Kingdom
| | - Muhammad F. Bari
- Department of Pathology, University Hospitals Coventry & Warwickshire, Coventry, United Kingdom
- Divisions of Reproduction and Metabolic & Vascular Health, Warwick Medical School, Coventry, United Kingdom
| | - David R. J. Snead
- Department of Pathology, University Hospitals Coventry & Warwickshire, Coventry, United Kingdom
- Divisions of Reproduction and Metabolic & Vascular Health, Warwick Medical School, Coventry, United Kingdom
| | - Nasir M. Rajpoot
- Department of Computer Science, The University of Warwick, Coventry, United Kingdom
- Department of Computer Science and Engineering, Qatar University, Doha, Qatar
- * E-mail:
| |
Collapse
|
42
|
Marx H, Lemeer S, Klaeger S, Rattei T, Kuster B. MScDB: a mass spectrometry-centric protein sequence database for proteomics. J Proteome Res 2013; 12:2386-98. [PMID: 23627461 DOI: 10.1021/pr400215r] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms. MScDB retains ~99% of the identifications in comparison to common databases despite a 3-48% increase in the theoretical peptide search space (but comparable protein sequence space). In addition, MScDB enables cross-species applications such as human/mouse graft models, and our results suggest that the uncertainty in protein assignments to one species can be smaller than 20%.
Collapse
Affiliation(s)
- Harald Marx
- Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany
| | | | | | | | | |
Collapse
|
43
|
Darkins R, Cooke EJ, Ghahramani Z, Kirk PDW, Wild DL, Savage RS. Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm. PLoS One 2013; 8:e59795. [PMID: 23565168 PMCID: PMC3614914 DOI: 10.1371/journal.pone.0059795] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Accepted: 02/19/2013] [Indexed: 11/19/2022] Open
Abstract
We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.
Collapse
Affiliation(s)
- Robert Darkins
- Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - Emma J. Cooke
- Department of Chemistry, University of Warwick, Coventry, United Kingdom
| | - Zoubin Ghahramani
- Department of Engineering, University of Cambridge, Cambridge, United Kingdom
| | - Paul D. W. Kirk
- Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - David L. Wild
- Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - Richard S. Savage
- Systems Biology Centre, University of Warwick, Coventry, United Kingdom
- * E-mail:
| |
Collapse
|
44
|
The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach Learn 2013. [DOI: 10.1007/s10994-013-5334-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
45
|
Pestian J, Matykiewicz P, Holland-Bouley K, Standridge S, Spencer M, Glauser T. Selecting anti-epileptic drugs: a pediatric epileptologist's view, a computer's view. Acta Neurol Scand 2013; 127:208-15. [PMID: 22998126 PMCID: PMC3574228 DOI: 10.1111/ane.12002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/25/2012] [Indexed: 01/13/2023]
Abstract
OBJECTIVE To identify which clinical characteristics are important to include in clinical decision support systems developed for Antiepileptic Drug (AEDs) selection. METHODS Twenty-three epileptologists from the Childhood Absence Epilepsy network completed a survey related to AED selection. Using cluster analysis their responses where classified into subject matter groups and weighted for importance. RESULTS Five distinct subject matter groups were identified and their relative weighting for importance were determined: disease characteristics (weight 4.8 ± 0.049), drug toxicities (3.82 ± 0.098), medical history (3.12 ± 0.102), systemic characteristics (2.57 ± 0.048) and genetic characteristics (1.08 ± 0.046). CONCLUSION Research about prescribing patterns exists but research on how such data can be used to train advanced technology is novel. As machine learning algorithms becomes more and more prevalent in clinical decisions support systems, developing methods for determining which data should be part of those algorithms is equally important.
Collapse
Affiliation(s)
- J Pestian
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA.
| | | | | | | | | | | |
Collapse
|
46
|
Verbanck M, Lê S, Pagès J. A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data. BMC Bioinformatics 2013; 14:42. [PMID: 23387364 PMCID: PMC3635920 DOI: 10.1186/1471-2105-14-42] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 01/18/2013] [Indexed: 12/03/2022] Open
Abstract
Background Gene clustering algorithms are massively used by biologists when analysing omics data. Classical gene clustering strategies are based on the use of expression data only, directly as in Heatmaps, or indirectly as in clustering based on coexpression networks for instance. However, the classical strategies may not be sufficient to bring out all potential relationships amongst genes. Results We propose a new unsupervised gene clustering algorithm based on the integration of external biological knowledge, such as Gene Ontology annotations, into expression data. We introduce a new distance between genes which consists in integrating biological knowledge into the analysis of expression data. Therefore, two genes are close if they have both similar expression profiles and similar functional profiles at once. Then a classical algorithm (e.g. K-means) is used to obtain gene clusters. In addition, we propose an automatic evaluation procedure of gene clusters. This procedure is based on two indicators which measure the global coexpression and biological homogeneity of gene clusters. They are associated with hypothesis testing which allows to complement each indicator with a p-value. Our clustering algorithm is compared to the Heatmap clustering and the clustering based on gene coexpression network, both on simulated and real data. In both cases, it outperforms the other methodologies as it provides the highest proportion of significantly coexpressed and biologically homogeneous gene clusters, which are good candidates for interpretation. Conclusion Our new clustering algorithm provides a higher proportion of good candidates for interpretation. Therefore, we expect the interpretation of these clusters to help biologists to formulate new hypothesis on the relationships amongst genes.
Collapse
Affiliation(s)
- Marie Verbanck
- Applied Mathematics Department, Agrocampus Ouest, 65, rue de Saint-Brieuc, Rennes, France.
| | | | | |
Collapse
|
47
|
Mukhopadhyay A, Maulik U, Bandyopadhyay S. An Interactive Approach to Multiobjective Clustering of Gene Expression Patterns. IEEE Trans Biomed Eng 2013; 60:35-41. [DOI: 10.1109/tbme.2012.2220765] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
48
|
Sîrbu A, Kerr G, Crane M, Ruskin HJ. RNA-Seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering. PLoS One 2012; 7:e50986. [PMID: 23251411 PMCID: PMC3518479 DOI: 10.1371/journal.pone.0050986] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 10/30/2012] [Indexed: 01/13/2023] Open
Abstract
With the fast development of high-throughput sequencing technologies, a new generation of genome-wide gene expression measurements is under way. This is based on mRNA sequencing (RNA-seq), which complements the already mature technology of microarrays, and is expected to overcome some of the latter's disadvantages. These RNA-seq data pose new challenges, however, as strengths and weaknesses have yet to be fully identified. Ideally, Next (or Second) Generation Sequencing measures can be integrated for more comprehensive gene expression investigation to facilitate analysis of whole regulatory networks. At present, however, the nature of these data is not very well understood. In this paper we study three alternative gene expression time series datasets for the Drosophila melanogaster embryo development, in order to compare three measurement techniques: RNA-seq, single-channel and dual-channel microarrays. The aim is to study the state of the art for the three technologies, with a view of assessing overlapping features, data compatibility and integration potential, in the context of time series measurements. This involves using established tools for each of the three different technologies, and technical and biological replicates (for RNA-seq and microarrays, respectively), due to the limited availability of biological RNA-seq replicates for time series data. The approach consists of a sensitivity analysis for differential expression and clustering. In general, the RNA-seq dataset displayed highest sensitivity to differential expression. The single-channel data performed similarly for the differentially expressed genes common to gene sets considered. Cluster analysis was used to identify different features of the gene space for the three datasets, with higher similarities found for the RNA-seq and single-channel microarray dataset.
Collapse
Affiliation(s)
- Alina Sîrbu
- Centre for Scientific Computing and Complex Systems Modelling, Dublin City University, Dublin, Ireland.
| | | | | | | |
Collapse
|
49
|
Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. ACTA ACUST UNITED AC 2012; 28:3290-7. [PMID: 23047558 PMCID: PMC3519452 DOI: 10.1093/bioinformatics/bts595] [Citation(s) in RCA: 158] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact:D.L.Wild@warwick.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Paul Kirk
- Systems Biology Centre, University of Warwick, Coventry, CV4 7AL, UK
| | | | | | | | | |
Collapse
|
50
|
Gao C, Weisman D, Gou N, Ilyin V, Gu AZ. Analyzing high dimensional toxicogenomic data using consensus clustering. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2012; 46:8413-8421. [PMID: 22703334 DOI: 10.1021/es3000454] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Rapid development of high-throughput toxicogenomics technologies has created new approaches to screen environmental samples for mechanistic toxicity assessment. However, challenges remain in the analysis, especially clustering of the resulting high-dimensional data. Because of the lack of commonly accepted validation methods, it is difficult to compare clustering results between studies or to identify the key experimental or data features that impact the clustering results. We applied consensus clustering (CC), an approach that clusters the input data repeatedly through iterative resampling, and identifies frequently occurring high-confidence clusters. We used CC to analyze a set of high dimensional transcriptomics data with temporal resolution, which were generated using our E. coli whole-cell array system for a diverse variety of toxicants at different dose concentrations. The CC analysis allowed us to evaluate the clustering results' robustness and sensitivity against a number of conditions that represent the common variations in high-throughput experiments, including noisy data, subsets of treatments, subsets of reporter genes, and subsets of time points. We demonstrated the value of utilizing rich time-series data and underscored the importance of careful selection of sampling times for a given experimental system. The results also indicated that temporal data compression using our proposed Transcriptional Effect Level Index (TELI) concept followed by CC largely conserved the cluster resolution. We also found that for our cellular stress response ensemble-based high-throughput transcriptomics assay platform, the size and composition of the reporter gene set are critical factors that affect the resulting coherency of clusters. Taken together, these results demonstrated that more robust consensus clustering such as CC may be valuable in analyzing high-dimensional toxicogenomic data sets.
Collapse
Affiliation(s)
- Ce Gao
- Department of Civil and Environmental Engineering, Northeastern University, Boston, Massachusetts 02115, USA
| | | | | | | | | |
Collapse
|