1
|
Moutinho D, Mendes VM, Caula A, Madeira SC, Baldeiras I, Guerreiro M, Cardoso S, Gobom J, Zetterberg H, Santana I, De Mendonça A, Aidos H, Manadas B. Pathophysiological subtypes of mild cognitive impairment due to Alzheimer's disease identified by CSF proteomics. Transl Neurodegener 2024; 13:19. [PMID: 38594717 PMCID: PMC11003166 DOI: 10.1186/s40035-024-00412-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 03/22/2024] [Indexed: 04/11/2024] Open
Affiliation(s)
- Daniela Moutinho
- Faculty of Medicine, University of Lisbon, 1649-028, Lisbon, Portugal
| | - Vera M Mendes
- CNC - Center for Neuroscience and Cell Biology, University of Coimbra, 3004-504, Coimbra, Portugal
- CIBB - Centre for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
| | - Alessandro Caula
- LASIGE, Faculty of Sciences, University of Lisbon, 1649-028, Lisbon, Portugal
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Sara C Madeira
- LASIGE, Faculty of Sciences, University of Lisbon, 1649-028, Lisbon, Portugal
| | - Inês Baldeiras
- CNC - Center for Neuroscience and Cell Biology, University of Coimbra, 3004-504, Coimbra, Portugal
- CIBB - Centre for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Faculty of Medicine, University of Coimbra, Coimbra, Portugal
| | - Manuela Guerreiro
- Faculty of Medicine, University of Lisbon, 1649-028, Lisbon, Portugal
| | - Sandra Cardoso
- Faculty of Medicine, University of Lisbon, 1649-028, Lisbon, Portugal
| | - Johan Gobom
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy at the University of Gothenburg, S-431 80, Mölndal, Sweden
- Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, S-431 80, Mölndal, Sweden
| | - Henrik Zetterberg
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Academy at the University of Gothenburg, S-431 80, Mölndal, Sweden
- Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, S-431 80, Mölndal, Sweden
- Department of Neurodegenerative Disease, UCL Institute of Neurology, Queen Square, London, WC1N 3BG, UK
- Kong Center for Neurodegenerative Diseases, Clear Water Bay, Hong Kong, China
- Wisconsin Alzheimer's Disease Research Center, School of Medicine and Public Health, University of Wisconsin, University of Wisconsin-Madison, Madison, WI, 53792, USA
- UK Dementia Research Institute at UCL, London, WC1N 3BG, UK
| | - Isabel Santana
- CNC - Center for Neuroscience and Cell Biology, University of Coimbra, 3004-504, Coimbra, Portugal
- CIBB - Centre for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal
- Faculty of Medicine, University of Coimbra, Coimbra, Portugal
- Department of Neurology, Hospital and University Centre of Coimbra, Coimbra, Portugal
| | | | - Helena Aidos
- LASIGE, Faculty of Sciences, University of Lisbon, 1649-028, Lisbon, Portugal
| | - Bruno Manadas
- CNC - Center for Neuroscience and Cell Biology, University of Coimbra, 3004-504, Coimbra, Portugal.
- CIBB - Centre for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
2
|
Castanho EN, Lobo JP, Henriques R, Madeira SC. Correction: G-bic: generating synthetic benchmarks for biclustering. BMC Bioinformatics 2024; 25:16. [PMID: 38212689 PMCID: PMC10782781 DOI: 10.1186/s12859-023-05628-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2024] Open
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisboa, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisboa, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisboa, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisboa, Portugal
| |
Collapse
|
3
|
Abstract
BACKGROUND Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Collapse
Affiliation(s)
- Eduardo N Castanho
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| | - João P Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| |
Collapse
|
4
|
Tavazzi E, Longato E, Vettoretti M, Aidos H, Trescato I, Roversi C, Martins AS, Castanho EN, Branco R, Soares DF, Guazzo A, Birolo G, Pala D, Bosoni P, Chiò A, Manera U, de Carvalho M, Miranda B, Gromicho M, Alves I, Bellazzi R, Dagliati A, Fariselli P, Madeira SC, Di Camillo B. Artificial intelligence and statistical methods for stratification and prediction of progression in amyotrophic lateral sclerosis: A systematic review. Artif Intell Med 2023; 142:102588. [PMID: 37316101 DOI: 10.1016/j.artmed.2023.102588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 04/14/2023] [Accepted: 05/16/2023] [Indexed: 06/16/2023]
Abstract
BACKGROUND Amyotrophic Lateral Sclerosis (ALS) is a fatal neurodegenerative disorder characterised by the progressive loss of motor neurons in the brain and spinal cord. The fact that ALS's disease course is highly heterogeneous, and its determinants not fully known, combined with ALS's relatively low prevalence, renders the successful application of artificial intelligence (AI) techniques particularly arduous. OBJECTIVE This systematic review aims at identifying areas of agreement and unanswered questions regarding two notable applications of AI in ALS, namely the automatic, data-driven stratification of patients according to their phenotype, and the prediction of ALS progression. Differently from previous works, this review is focused on the methodological landscape of AI in ALS. METHODS We conducted a systematic search of the Scopus and PubMed databases, looking for studies on data-driven stratification methods based on unsupervised techniques resulting in (A) automatic group discovery or (B) a transformation of the feature space allowing patient subgroups to be identified; and for studies on internally or externally validated methods for the prediction of ALS progression. We described the selected studies according to the following characteristics, when applicable: variables used, methodology, splitting criteria and number of groups, prediction outcomes, validation schemes, and metrics. RESULTS Of the starting 1604 unique reports (2837 combined hits between Scopus and PubMed), 239 were selected for thorough screening, leading to the inclusion of 15 studies on patient stratification, 28 on prediction of ALS progression, and 6 on both stratification and prediction. In terms of variables used, most stratification and prediction studies included demographics and features derived from the ALSFRS or ALSFRS-R scores, which were also the main prediction targets. The most represented stratification methods were K-means, and hierarchical and expectation-maximisation clustering; while random forests, logistic regression, the Cox proportional hazard model, and various flavours of deep learning were the most widely used prediction methods. Predictive model validation was, albeit unexpectedly, quite rarely performed in absolute terms (leading to the exclusion of 78 eligible studies), with the overwhelming majority of included studies resorting to internal validation only. CONCLUSION This systematic review highlighted a general agreement in terms of input variable selection for both stratification and prediction of ALS progression, and in terms of prediction targets. A striking lack of validated models emerged, as well as a general difficulty in reproducing many published studies, mainly due to the absence of the corresponding parameter lists. While deep learning seems promising for prediction applications, its superiority with respect to traditional methods has not been established; there is, instead, ample room for its application in the subfield of patient stratification. Finally, an open question remains on the role of new environmental and behavioural variables collected via novel, real-time sensors.
Collapse
Affiliation(s)
- Erica Tavazzi
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Enrico Longato
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Martina Vettoretti
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Helena Aidos
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Isotta Trescato
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Chiara Roversi
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Andreia S Martins
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Eduardo N Castanho
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Ruben Branco
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Diogo F Soares
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Alessandro Guazzo
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy
| | - Giovanni Birolo
- Department of Medical Sciences, University of Torino, Corso Dogliotti 14, Turin, 10126, Italy
| | - Daniele Pala
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, Pavia, 27100, Italy
| | - Pietro Bosoni
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, Pavia, 27100, Italy
| | - Adriano Chiò
- Department of Neurosciences "Rita Levi Montalcini", University of Turin, Via Cherasco 15, Turin, 10126, Italy
| | - Umberto Manera
- Department of Neurosciences "Rita Levi Montalcini", University of Turin, Via Cherasco 15, Turin, 10126, Italy
| | - Mamede de Carvalho
- Faculdade de Medicina, Instituto de Medicina Molecular João Lobo Antunes, Universidade de Lisboa, Av. Prof. Egas Moniz, Lisbon, 1649-028, Portugal
| | - Bruno Miranda
- Faculdade de Medicina, Instituto de Medicina Molecular João Lobo Antunes, Universidade de Lisboa, Av. Prof. Egas Moniz, Lisbon, 1649-028, Portugal
| | - Marta Gromicho
- Faculdade de Medicina, Instituto de Medicina Molecular João Lobo Antunes, Universidade de Lisboa, Av. Prof. Egas Moniz, Lisbon, 1649-028, Portugal
| | - Inês Alves
- Faculdade de Medicina, Instituto de Medicina Molecular João Lobo Antunes, Universidade de Lisboa, Av. Prof. Egas Moniz, Lisbon, 1649-028, Portugal
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, Pavia, 27100, Italy
| | - Arianna Dagliati
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, Pavia, 27100, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Corso Dogliotti 14, Turin, 10126, Italy
| | - Sara C Madeira
- LASIGE and Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, Lisbon, 1749-016, Portugal
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Via Gradenigo 6/b, Padua, 35131, Italy; Department of Comparative Biomedicine and Food Science, University of Padova, Agripolis, Viale dell'Università, 16, Legnaro (PD), 35020, Italy.
| |
Collapse
|
5
|
Soares DF, Henriques R, Gromicho M, de Carvalho M, Madeira SC. Triclustering-based classification of longitudinal data for prognostic prediction: targeting relevant clinical endpoints in amyotrophic lateral sclerosis. Sci Rep 2023; 13:6182. [PMID: 37061549 PMCID: PMC10105751 DOI: 10.1038/s41598-023-33223-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 04/10/2023] [Indexed: 04/17/2023] Open
Abstract
This work proposes a new class of explainable prognostic models for longitudinal data classification using triclusters. A new temporally constrained triclustering algorithm, termed TCtriCluster, is proposed to comprehensively find informative temporal patterns common to a subset of patients in a subset of features (triclusters), and use them as discriminative features within a state-of-the-art classifier with guarantees of interpretability. The proposed approach further enhances prediction with the potentialities of model explainability by revealing clinically relevant disease progression patterns underlying prognostics, describing features used for classification. The proposed methodology is used in the Amyotrophic Lateral Sclerosis (ALS) Portuguese cohort (N = 1321), providing the first comprehensive assessment of the prognostic limits of five notable clinical endpoints: need for non-invasive ventilation (NIV); need for an auxiliary communication device; need for percutaneous endoscopic gastrostomy (PEG); need for a caregiver; and need for a wheelchair. Triclustering-based predictors outperform state-of-the-art alternatives, being able to predict the need for auxiliary communication device (within 180 days) and the need for PEG (within 90 days) with an AUC above 90%. The approach was validated in clinical practice, supporting healthcare professionals in understanding the link between the highly heterogeneous patterns of ALS disease progression and the prognosis.
Collapse
Affiliation(s)
- Diogo F Soares
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| | - Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Marta Gromicho
- Instituto de Medicina Molecular and Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Mamede de Carvalho
- Instituto de Medicina Molecular and Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
6
|
Martins AS, Gromicho M, Pinto S, de Carvalho M, Madeira SC. Learning Prognostic Models Using Disease Progression Patterns: Predicting the Need for Non-Invasive Ventilation in Amyotrophic Lateral Sclerosis. IEEE/ACM Trans Comput Biol Bioinform 2022; 19:2572-2583. [PMID: 33961562 DOI: 10.1109/tcbb.2021.3078362] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Amyotrophic Lateral Sclerosis is a devastating neurodegenerative disease causing rapid degeneration of motor neurons and usually leading to death by respiratory failure. Since there is no cure, treatment's goal is to improve symptoms and prolong survival. Non-invasive Ventilation (NIV) is an effective treatment, leading to extended life expectancy and improved quality of life. In this scenario, it is paramount to predict its need in order to allow preventive or timely administration. In this work, we propose to use itemset mining together with sequential pattern mining to unravel disease presentation patterns together with disease progression patterns by analysing, respectively, static data collected at diagnosis and longitudinal data from patient follow-up. The goal is to use these static and temporal patterns as features in prognostic models, enabling to take disease progression into account in predictions and promoting model interpretability. As case study, we predict the need for NIV within 90, 180 and 365 days (short, mid and long-term predictions). The learnt prognostic models are promising. Pattern evaluation through growth rate suggests bulbar function and phrenic nerve response amplitude, additionally to respiratory function, are significant features towards determining patient evolution. This confirms clinical knowledge regarding relevant biomarkers of disease progression towards respiratory insufficiency.
Collapse
|
7
|
Soares DF, Henriques R, Gromicho M, de Carvalho M, Madeira SC. Learning prognostic models using a mixture of biclustering and triclustering: Predicting the need for non-invasive ventilation in Amyotrophic Lateral Sclerosis. J Biomed Inform 2022; 134:104172. [PMID: 36055638 DOI: 10.1016/j.jbi.2022.104172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 03/31/2022] [Accepted: 08/15/2022] [Indexed: 11/26/2022]
Abstract
Longitudinal cohort studies to study disease progression generally combine temporal features produced under periodic assessments (clinical follow-up) with static features associated with single-time assessments, genetic, psychophysiological, and demographic profiles. Subspace clustering, including biclustering and triclustering stances, enables the discovery of local and discriminative patterns from such multidimensional cohort data. These patterns, highly interpretable, are relevant to identifying groups of patients with similar traits or progression patterns. Despite their potential, their use for improving predictive tasks in clinical domains remains unexplored. In this work, we propose to learn predictive models from static and temporal data using discriminative patterns, obtained via biclustering and triclustering, as features within a state-of-the-art classifier, thus enhancing model interpretation. triCluster is extended to find time-contiguous triclusters in temporal data (temporal patterns) and a biclustering algorithm to discover coherent patterns in static data. The transformed data space, composed of bicluster and tricluster features, capture local and cross-variable associations with discriminative power, yielding unique statistical properties of interest. As a case study, we applied our methodology to follow-up data from Portuguese patients with Amyotrophic Lateral Sclerosis (ALS) to predict the need for non-invasive ventilation (NIV) since the last appointment. The results showed that, in general, our methodology outperformed baseline results using the original features. Furthermore, the bicluster/tricluster-based patterns used by the classifier can be used by clinicians to understand the models by highlighting relevant prognostic patterns.
Collapse
Affiliation(s)
- Diogo F Soares
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| | - Rui Henriques
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Marta Gromicho
- Instituto de Medicina Molecular, Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Mamede de Carvalho
- Instituto de Medicina Molecular, Instituto de Fisiologia, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
8
|
Abstract
BACKGROUND The effectiveness of biclustering, simultaneous clustering of rows and columns in a data matrix, was shown in gene expression data analysis. Several researchers recognize its potentialities in other research areas. Nevertheless, the last two decades have witnessed the development of a significant number of biclustering algorithms targeting gene expression data analysis and a lack of consistent studies exploring the capacities of biclustering outside this traditional application domain. RESULTS This work evaluates the potential use of biclustering in fMRI time series data, targeting the Region × Time dimensions by comparing seven state-in-the-art biclustering and three traditional clustering algorithms on artificial and real data. It further proposes a methodology for biclustering evaluation beyond gene expression data analysis. The results discuss the use of different search strategies in both artificial and real fMRI time series showed the superiority of exhaustive biclustering approaches, obtaining the most homogeneous biclusters. However, their high computational costs are a challenge, and further work is needed for the efficient use of biclustering in fMRI data analysis. CONCLUSIONS This work pinpoints avenues for the use of biclustering in spatio-temporal data analysis, in particular neurosciences applications. The proposed evaluation methodology showed evidence of the effectiveness of biclustering in finding local patterns in fMRI time series data. Further work is needed regarding scalability to promote the application in real scenarios.
Collapse
Affiliation(s)
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
9
|
Gromicho M, Leão T, Oliveira Santos M, Pinto S, Carvalho AM, Madeira SC, de Carvalho M. Dynamic Bayesian Networks for stratification of disease progression in Amyotrophic Lateral Sclerosis. Eur J Neurol 2022; 29:2201-2210. [DOI: 10.1111/ene.15357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 03/31/2022] [Indexed: 11/27/2022]
Affiliation(s)
- Marta Gromicho
- Instituto de Medicina Molecular Faculdade de Medicina Universidade de Lisboa Lisbon Portugal
| | - Tiago Leão
- Instituto Superior Técnico Universidade de Lisboa Lisbon Portugal
| | - Miguel Oliveira Santos
- Instituto de Medicina Molecular Faculdade de Medicina Universidade de Lisboa Lisbon Portugal
- Department of Neurosciences and Mental Health Centro Hospitalar Universitário de Lisboa‐Norte Lisbon Portugal
| | - Susana Pinto
- Instituto de Medicina Molecular Faculdade de Medicina Universidade de Lisboa Lisbon Portugal
| | - Alexandra M. Carvalho
- Instituto de Telecomunicações and Lisbon ELLIS Unit (LUMLIS) Instituto Superior Técnico Universidade de Lisboa Lisbon Portugal
| | - Sara C. Madeira
- LASIGE Faculdade de Ciências Universidade de Lisboa Lisbon Portugal
| | - Mamede de Carvalho
- Instituto de Medicina Molecular Faculdade de Medicina Universidade de Lisboa Lisbon Portugal
- Department of Neurosciences and Mental Health Centro Hospitalar Universitário de Lisboa‐Norte Lisbon Portugal
| |
Collapse
|
10
|
Leão T, Madeira SC, Gromicho M, de Carvalho M, Carvalho AM. Learning dynamic Bayesian networks from time-dependent and time-independent data: Unraveling disease progression in Amyotrophic Lateral Sclerosis. J Biomed Inform 2021; 117:103730. [PMID: 33737206 DOI: 10.1016/j.jbi.2021.103730] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 01/17/2021] [Accepted: 02/25/2021] [Indexed: 10/21/2022]
Abstract
Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease causing patients to quickly lose motor neurons. The disease is characterized by a fast functional impairment and ventilatory decline, leading most patients to die from respiratory failure. To estimate when patients should get ventilatory support, it is helpful to adequately profile the disease progression. For this purpose, we use dynamic Bayesian networks (DBNs), a machine learning model, that graphically represents the conditional dependencies among variables. However, the standard DBN framework only includes dynamic (time-dependent) variables, while most ALS datasets have dynamic and static (time-independent) observations. Therefore, we propose the sdtDBN framework, which learns optimal DBNs with static and dynamic variables. Besides learning DBNs from data, with polynomial-time complexity in the number of variables, the proposed framework enables the user to insert prior knowledge and to make inference in the learned DBNs. We use sdtDBNs to study the progression of 1214 patients from a Portuguese ALS dataset. First, we predict the values of every functional indicator in the patients' consultations, achieving results competitive with state-of-the-art studies. Then, we determine the influence of each variable in patients' decline before and after getting ventilatory support. This insightful information can lead clinicians to pay particular attention to specific variables when evaluating the patients, thus improving prognosis. The case study with ALS shows that sdtDBNs are a promising predictive and descriptive tool, which can also be applied to assess the progression of other diseases, given time-dependent and time-independent clinical observations.
Collapse
Affiliation(s)
- Tiago Leão
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal.
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Marta Gromicho
- Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Mamede de Carvalho
- Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal; Department of Neurosciences and Mental Health, Centro Hospitalar Universitário de Lisboa-Norte, Lisbon, Portugal
| | - Alexandra M Carvalho
- Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal; Lisbon ELLIS Unit (Lisbon Unit for Learning and Intelligent Systems), Portugal.
| |
Collapse
|
11
|
Lobo J, Henriques R, Madeira SC. G-Tric: generating three-way synthetic datasets with triclustering solutions. BMC Bioinformatics 2021; 22:16. [PMID: 33413095 PMCID: PMC7789692 DOI: 10.1186/s12859-020-03925-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Accepted: 12/07/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations [Formula: see text] features [Formula: see text] contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. RESULTS G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. CONCLUSIONS Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric's potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.
Collapse
Affiliation(s)
- João Lobo
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal
| | - Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1900-001, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 016, 1749-016, Lisbon, Portugal.
| |
Collapse
|
12
|
Pereira T, Cardoso S, Guerreiro M, Mendonça A, Madeira SC. Targeting the uncertainty of predictions at patient-level using an ensemble of classifiers coupled with calibration methods, Venn-ABERS, and Conformal Predictors: A case study in AD. J Biomed Inform 2020; 101:103350. [DOI: 10.1016/j.jbi.2019.103350] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 11/25/2019] [Accepted: 12/01/2019] [Indexed: 10/25/2022]
|
13
|
Pereira T, Ferreira FL, Cardoso S, Silva D, de Mendonça A, Guerreiro M, Madeira SC. Neuropsychological predictors of conversion from mild cognitive impairment to Alzheimer's disease: a feature selection ensemble combining stability and predictability. BMC Med Inform Decis Mak 2018; 18:137. [PMID: 30567554 PMCID: PMC6299964 DOI: 10.1186/s12911-018-0710-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Accepted: 11/21/2018] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Predicting progression from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is an utmost open issue in AD-related research. Neuropsychological assessment has proven to be useful in identifying MCI patients who are likely to convert to dementia. However, the large battery of neuropsychological tests (NPTs) performed in clinical practice and the limited number of training examples are challenge to machine learning when learning prognostic models. In this context, it is paramount to pursue approaches that effectively seek for reduced sets of relevant features. Subsets of NPTs from which prognostic models can be learnt should not only be good predictors, but also stable, promoting generalizable and explainable models. METHODS We propose a feature selection (FS) ensemble combining stability and predictability to choose the most relevant NPTs for prognostic prediction in AD. First, we combine the outcome of multiple (filter and embedded) FS methods. Then, we use a wrapper-based approach optimizing both stability and predictability to compute the number of selected features. We use two large prospective studies (ADNI and the Portuguese Cognitive Complaints Cohort, CCC) to evaluate the approach and assess the predictive value of a large number of NPTs. RESULTS The best subsets of features include approximately 30 and 20 (from the original 79 and 40) features, for ADNI and CCC data, respectively, yielding stability above 0.89 and 0.95, and AUC above 0.87 and 0.82. Most NPTs learnt using the proposed feature selection ensemble have been identified in the literature as strong predictors of conversion from MCI to AD. CONCLUSIONS The FS ensemble approach was able to 1) identify subsets of stable and relevant predictors from a consensus of multiple FS methods using baseline NPTs and 2) learn reliable prognostic models of conversion from MCI to AD using these subsets of features. The machine learning models learnt from these features outperformed the models trained without FS and achieved competitive results when compared to commonly used FS algorithms. Furthermore, the selected features are derived from a consensus of methods thus being more robust, while releasing users from choosing the most appropriate FS method to be used in their classification task.
Collapse
Affiliation(s)
- Telma Pereira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | | | - Sandra Cardoso
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Dina Silva
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Faro, Portugal
| | - Alexandre de Mendonça
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Manuela Guerreiro
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C. Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - for the Alzheimer’s Disease Neuroimaging Initiative
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Faro, Portugal
| |
Collapse
|
14
|
Henriques R, Ferreira FL, Madeira SC. Erratum to: BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics 2017; 18:162. [PMID: 28279148 PMCID: PMC5345246 DOI: 10.1186/s12859-017-1573-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 02/27/2017] [Indexed: 11/10/2022] Open
|
15
|
Pereira T, Lemos L, Cardoso S, Silva D, Rodrigues A, Santana I, de Mendonça A, Guerreiro M, Madeira SC. Predicting progression of mild cognitive impairment to dementia using neuropsychological data: a supervised learning approach using time windows. BMC Med Inform Decis Mak 2017; 17:110. [PMID: 28724366 PMCID: PMC5517828 DOI: 10.1186/s12911-017-0497-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2017] [Accepted: 06/28/2017] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Predicting progression from a stage of Mild Cognitive Impairment to dementia is a major pursuit in current research. It is broadly accepted that cognition declines with a continuum between MCI and dementia. As such, cohorts of MCI patients are usually heterogeneous, containing patients at different stages of the neurodegenerative process. This hampers the prognostic task. Nevertheless, when learning prognostic models, most studies use the entire cohort of MCI patients regardless of their disease stages. In this paper, we propose a Time Windows approach to predict conversion to dementia, learning with patients stratified using time windows, thus fine-tuning the prognosis regarding the time to conversion. METHODS In the proposed Time Windows approach, we grouped patients based on the clinical information of whether they converted (converter MCI) or remained MCI (stable MCI) within a specific time window. We tested time windows of 2, 3, 4 and 5 years. We developed a prognostic model for each time window using clinical and neuropsychological data and compared this approach with the commonly used in the literature, where all patients are used to learn the models, named as First Last approach. This enables to move from the traditional question "Will a MCI patient convert to dementia somewhere in the future" to the question "Will a MCI patient convert to dementia in a specific time window". RESULTS The proposed Time Windows approach outperformed the First Last approach. The results showed that we can predict conversion to dementia as early as 5 years before the event with an AUC of 0.88 in the cross-validation set and 0.76 in an independent validation set. CONCLUSIONS Prognostic models using time windows have higher performance when predicting progression from MCI to dementia, when compared to the prognostic approach commonly used in the literature. Furthermore, the proposed Time Windows approach is more relevant from a clinical point of view, predicting conversion within a temporal interval rather than sometime in the future and allowing clinicians to timely adjust treatments and clinical appointments.
Collapse
Affiliation(s)
- Telma Pereira
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
- INESC-ID, R. Alves Redol 9, 1000–029 Lisbon, Portugal
| | - Luís Lemos
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
- INESC-ID, R. Alves Redol 9, 1000–029 Lisbon, Portugal
| | - Sandra Cardoso
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Algarve, Portugal
| | - Dina Silva
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Algarve, Portugal
| | - Ana Rodrigues
- Faculdade de Medicina, Universidade de Coimbra, Coimbra, Portugal
| | - Isabel Santana
- Faculdade de Medicina, Universidade de Coimbra, Coimbra, Portugal
- Departamento de Neurologia, Centro Hospitalar e Universitário de Coimbra, Coimbra, Portugal
| | - Alexandre de Mendonça
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Algarve, Portugal
| | - Manuela Guerreiro
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Algarve, Portugal
| | - Sara C. Madeira
- INESC-ID, R. Alves Redol 9, 1000–029 Lisbon, Portugal
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, R. Ernesto de Vasconcelos, 1749–016 Lisbon, Portugal
| |
Collapse
|
16
|
|
17
|
Henriques R, Ferreira FL, Madeira SC. BicPAMS: software for biological data analysis with pattern-based biclustering. BMC Bioinformatics 2017; 18:82. [PMID: 28153040 PMCID: PMC5290636 DOI: 10.1186/s12859-017-1493-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Accepted: 01/21/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific. METHODS To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1-30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1-12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39-48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces. RESULTS Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data. CONCLUSIONS BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data. SOFTWARE BicPAMS and its tutorial available in http://www.bicpams.com .
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | | | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
18
|
Henriques R, Madeira SC. BiC2PAM: constraint-guided biclustering for biological data analysis with domain knowledge. Algorithms Mol Biol 2016; 11:23. [PMID: 27651825 PMCID: PMC5024481 DOI: 10.1186/s13015-016-0085-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Accepted: 08/16/2016] [Indexed: 11/10/2022] Open
Abstract
Background Biclustering has been largely used in biological data analysis, enabling the discovery of putative functional modules from omic and network data. Despite the recognized importance of incorporating domain knowledge to guide biclustering and guarantee a focus on relevant and non-trivial biclusters, this possibility has not yet been comprehensively addressed. This results from the fact that the majority of existing algorithms are only able to deliver sub-optimal solutions with restrictive assumptions on the structure, coherency and quality of biclustering solutions, thus preventing the up-front satisfaction of knowledge-driven constraints. Interestingly, in recent years, a clearer understanding of the synergies between pattern mining and biclustering gave rise to a new class of algorithms, termed as pattern-based biclustering algorithms. These algorithms, able to efficiently discover flexible biclustering solutions with optimality guarantees, are thus positioned as good candidates for knowledge incorporation. In this context, this work aims to bridge the current lack of solid views on the use of background knowledge to guide (pattern-based) biclustering tasks. Methods This work extends (pattern-based) biclustering algorithms to guarantee the satisfiability of constraints derived from background knowledge and to effectively explore efficiency gains from their incorporation. In this context, we first show the relevance of constraints with succinct, (anti-)monotone and convertible properties for the analysis of expression data and biological networks. We further show how pattern-based biclustering algorithms can be adapted to effectively prune of the search space in the presence of such constraints, as well as be guided in the presence of biological annotations. Relying on these contributions, we propose BiClustering with Constraints using PAttern Mining (BiC2PAM), an extension of BicPAM and BicNET biclustering algorithms. Results Experimental results on biological data demonstrate the importance of incorporating knowledge within biclustering to foster efficiency and enable the discovery of non-trivial biclusters with heightened biological relevance. Conclusions This work provides the first comprehensive view and sound algorithm for biclustering biological data with constraints derived from user expectations, knowledge repositories and/or literature.
Collapse
|
19
|
Henriques R, Madeira SC. BicNET: Flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 2016; 11:14. [PMID: 27213009 PMCID: PMC4875761 DOI: 10.1186/s13015-016-0074-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 04/22/2016] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Despite the recognized importance of module discovery in biological networks to enhance our understanding of complex biological systems, existing methods generally suffer from two major drawbacks. First, there is a focus on modules where biological entities are strongly connected, leading to the discovery of trivial/well-known modules and to the inaccurate exclusion of biological entities with subtler yet relevant roles. Second, there is a generalized intolerance towards different forms of noise, including uncertainty associated with less-studied biological entities (in the context of literature-driven networks) and experimental noise (in the context of data-driven networks). Although state-of-the-art biclustering algorithms are able to discover modules with varying coherency and robustness to noise, their application for the discovery of non-dense modules in biological networks has been poorly explored and it is further challenged by efficiency bottlenecks. METHODS This work proposes Biclustering NETworks (BicNET), a biclustering algorithm to discover non-trivial yet coherent modules in weighted biological networks with heightened efficiency. Three major contributions are provided. First, we motivate the relevance of discovering network modules given by constant, symmetric, plaid and order-preserving biclustering models. Second, we propose an algorithm to discover these modules and to robustly handle noisy and missing interactions. Finally, we provide new searches to tackle time and memory bottlenecks by effectively exploring the inherent structural sparsity of network data. RESULTS Results in synthetic network data confirm the soundness, efficiency and superiority of BicNET. The application of BicNET on protein interaction and gene interaction networks from yeast, E. coli and Human reveals new modules with heightened biological significance. CONCLUSIONS BicNET is, to our knowledge, the first method enabling the efficient unsupervised analysis of large-scale network data for the discovery of coherent modules with parameterizable homogeneity.
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Sara C. Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
20
|
Carreiro AV, Mendonça A, de Carvalho M, Madeira SC. Integrative biomarker discovery in neurodegenerative diseases. Wiley Interdiscip Rev Syst Biol Med 2016; 8:268. [PMID: 27103503 DOI: 10.1002/wsbm.1339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Affiliation(s)
- André V Carreiro
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Alexandre Mendonça
- Dementia Clinics, Institute of Molecular Medicine and Faculty of Medicine, Universidade de Lisboa, Lisboa, Portugal
| | - Mamede de Carvalho
- Translational Clinical Physiology Unit, Institute of Molecular Medicine and Faculty of Medicine, Universidade de Lisboa, Lisboa, Portugal
| | - Sara C Madeira
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
21
|
Carreiro AV, Amaral PMT, Pinto S, Tomás P, de Carvalho M, Madeira SC. Prognostic models based on patient snapshots and time windows: Predicting disease progression to assisted ventilation in Amyotrophic Lateral Sclerosis. J Biomed Inform 2015; 58:133-144. [PMID: 26455265 DOI: 10.1016/j.jbi.2015.09.021] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 09/18/2015] [Accepted: 09/23/2015] [Indexed: 12/12/2022]
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a devastating disease and the most common neurodegenerative disorder of young adults. ALS patients present a rapidly progressive motor weakness. This usually leads to death in a few years by respiratory failure. The correct prediction of respiratory insufficiency is thus key for patient management. In this context, we propose an innovative approach for prognostic prediction based on patient snapshots and time windows. We first cluster temporally-related tests to obtain snapshots of the patient's condition at a given time (patient snapshots). Then we use the snapshots to predict the probability of an ALS patient to require assisted ventilation after k days from the time of clinical evaluation (time window). This probability is based on the patient's current condition, evaluated using clinical features, including functional impairment assessments and a complete set of respiratory tests. The prognostic models include three temporal windows allowing to perform short, medium and long term prognosis regarding progression to assisted ventilation. Experimental results show an area under the receiver operating characteristics curve (AUC) in the test set of approximately 79% for time windows of 90, 180 and 365 days. Creating patient snapshots using hierarchical clustering with constraints outperforms the state of the art, and the proposed prognostic model becomes the first non population-based approach for prognostic prediction in ALS. The results are promising and should enhance the current clinical practice, largely supported by non-standardized tests and clinicians' experience.
Collapse
Affiliation(s)
- André V Carreiro
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Portugal.
| | - Pedro M T Amaral
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Portugal
| | - Susana Pinto
- Translational Clinical Physiology Unit, Institute of Molecular Medicine, Faculty of Medicine, Universidade de Lisboa, Portugal
| | - Pedro Tomás
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Portugal
| | - Mamede de Carvalho
- Translational Clinical Physiology Unit, Institute of Molecular Medicine, Faculty of Medicine, Universidade de Lisboa, Portugal
| | - Sara C Madeira
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Portugal.
| |
Collapse
|
22
|
Carreiro AV, Mendonça A, de Carvalho M, Madeira SC. Integrative biomarker discovery in neurodegenerative diseases. Wiley Interdiscip Rev Syst Biol Med 2015; 7:357-79. [PMID: 26136395 DOI: 10.1002/wsbm.1310] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Revised: 05/22/2015] [Accepted: 05/27/2015] [Indexed: 12/12/2022]
Abstract
Data mining has been widely applied in biomarker discovery resulting in significant findings of different clinical and biological biomarkers. With developments in technology, from genomics to proteomics analysis, a deluge of data has become available, as well as standardized data repositories. Nonetheless, researchers are still facing important challenges in analyzing the data, especially when considering the complexity of pathways involved in biological processes and diseases. Data from single sources appear unable to explain complex processes, such as those involved in brain-related disorders, including Alzheimer's disease, Parkinson's disease and amyotrophic lateral sclerosis, thus raising the need for a more comprehensive perspective. A possible solution relies on data and model integration, where several data types are combined to provide complementary views. This in turn can result in the discovery of previously unknown biomarkers by unraveling otherwise hidden relationships between data from different sources, and/or validate such composite biomarkers in more powerful predictive models.
Collapse
Affiliation(s)
- André V Carreiro
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| | - Alexandre Mendonça
- Dementia Clinics, Institute of Molecular Medicine and Faculty of Medicine, Universidade de Lisboa, Lisboa, Portugal
| | - Mamede de Carvalho
- Translational Clinical Physiology Unit, Institute of Molecular Medicine and Faculty of Medicine, Universidade de Lisboa, Lisboa, Portugal
| | - Sara C Madeira
- INESC-ID Lisbon and Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
| |
Collapse
|
23
|
Henriques R, Madeira SC. Biclustering with Flexible Plaid Models to Unravel Interactions between Biological Processes. IEEE/ACM Trans Comput Biol Bioinform 2015; 12:738-752. [PMID: 26357312 DOI: 10.1109/tcbb.2014.2388206] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genes can participate in multiple biological processes at a time and thus their expression can be seen as a composition of the contributions from the active processes. Biclustering under a plaid assumption allows the modeling of interactions between transcriptional modules or biclusters (subsets of genes with coherence across subsets of conditions) by assuming an additive composition of contributions in their overlapping areas. Despite the biological interest of plaid models, few biclustering algorithms consider plaid effects and, when they do, they place restrictions on the allowed types and structures of biclusters, and suffer from robustness problems by seizing exact additive matchings. We propose BiP (Biclustering using Plaid models), a biclustering algorithm with relaxations to allow expression levels to change in overlapping areas according to biologically meaningful assumptions (weighted and noise-tolerant composition of contributions). BiP can be used over existing biclustering solutions (seizing their benefits) as it is able to recover excluded areas due to unaccounted plaid effects and detect noisy areas non-explained by a plaid assumption, thus producing an explanatory model of overlapping transcriptional activity. Experiments on synthetic data support BiP's efficiency and effectiveness. The learned models from expression data unravel meaningful and non-trivial functional interactions between biological processes associated with putative regulatory modules.
Collapse
|
24
|
Maruta C, Pereira T, Madeira SC, De Mendonça A, Guerreiro M. Classification of primary progressive aphasia: Do unsupervised data mining methods support a logopenic variant? Amyotroph Lateral Scler Frontotemporal Degener 2015; 16:147-59. [PMID: 25871701 DOI: 10.3109/21678421.2015.1026266] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Our objective was to test whether data mining techniques, through an unsupervised learning approach, support the three-group diagnostic model of primary progressive aphasia (PPA) versus the existence of two main/classic groups. A series of 155 PPA patients observed in a clinical setting and subjected to at least one neuropsychological/language assessment was studied. Several demographic, clinical and neuropsychological attributes, grouped in distinct sets, were introduced in unsupervised learning methods (Expectation Maximization, K-Means, X-Means, Hierarchical Clustering and Consensus Clustering). Results demonstrated that unsupervised learning methods revealed two main groups consistently obtained throughout all the analyses (with different algorithms and different set of attributes). One group included most of the agrammatic/non-fluent and some logopenic cases while the other was mainly composed of semantic and logopenic cases. Clustering the patients in a larger number of groups (k > 2) revealed some clusters composed mostly of non-fluent or of semantic cases. However, we could not evidence any group chiefly composed of logopenic cases. In conclusion, unsupervised data mining approaches do not support a clear distinction of logopenic PPA as a separate variant.
Collapse
Affiliation(s)
- Carolina Maruta
- Laboratory of Language Research, Institute of Molecular Medicine, Faculty of Medicine, University of Lisbon , Portugal
| | | | | | | | | |
Collapse
|
25
|
Henriques R, Madeira SC. Pattern-Based Biclustering with Constraints for Gene Expression Data Analysis. Progress in Artificial Intelligence 2015. [DOI: 10.1007/978-3-319-23485-4_34] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
26
|
Abstract
BACKGROUND Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed. METHODS The proposed BicPAM biclustering approach integrates existing principles made available by state-of-the-art pattern-based approaches with two new contributions. First, BicPAM is the first efficient attempt to exhaustively mine non-constant types of biclusters, including additive and multiplicative coherencies in the presence or absence of symmetries. Second, BicPAM provides strategies to effectively compose different biclustering structures and to handle arbitrary levels of noise inherent to data and with discretization procedures. RESULTS Results show BicPAM's superiority against its peers and its ability to retrieve unique types of biclusters of interest, to efficiently deliver exhaustive solutions and to successfully recover planted biclusters in datasets with varying levels of missing values and noise. Its application over gene expression data leads to unique solutions with heightened biological relevance. CONCLUSIONS BicPAM approaches integrate existing disperse efforts towards pattern-based biclustering and provides the first critical strategies to efficiently discover exhaustive solutions of biclusters with shifting, scaling and symmetric assumptions with varying quality and underlying structures. Additionally, BicPAM dynamically adapts its behavior to mine data with different levels of missing values and noise.
Collapse
Affiliation(s)
- Rui Henriques
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
27
|
Gonçalves JP, Madeira SC. LateBiclustering: Efficient Heuristic Algorithm for Time-Lagged Bicluster Identification. IEEE/ACM Trans Comput Biol Bioinform 2014; 11:801-813. [PMID: 26356854 DOI: 10.1109/tcbb.2014.2312007] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Identifying patterns in temporal data is key to uncover meaningful relationships in diverse domains, from stock trading to social interactions. Also of great interest are clinical and biological applications, namely monitoring patient response to treatment or characterizing activity at the molecular level. In biology, researchers seek to gain insight into gene functions and dynamics of biological processes, as well as potential perturbations of these leading to disease, through the study of patterns emerging from gene expression time series. Clustering can group genes exhibiting similar expression profiles, but focuses on global patterns denoting rather broad, unspecific responses. Biclustering reveals local patterns, which more naturally capture the intricate collaboration between biological players, particularly under a temporal setting. Despite the general biclustering formulation being NP-hard, considering specific properties of time series has led to efficient solutions for the discovery of temporally aligned patterns. Notably, the identification of biclusters with time-lagged patterns, suggestive of transcriptional cascades, remains a challenge due to the combinatorial explosion of delayed occurrences. Herein, we propose LateBiclustering, a sensible heuristic algorithm enabling a polynomial rather than exponential time solution for the problem. We show that it identifies meaningful time-lagged biclusters relevant to the response of Saccharomyces cerevisiae to heat stress.
Collapse
|
28
|
Abstract
Background Biclustering is a critical task for biomedical applications. Order-preserving biclusters, submatrices where the values of rows induce the same linear ordering across columns, capture local regularities with constant, shifting, scaling and sequential assumptions. Additionally, biclustering approaches relying on pattern mining output deliver exhaustive solutions with an arbitrary number and positioning of biclusters. However, existing order-preserving approaches suffer from robustness, scalability and/or flexibility issues. Additionally, they are not able to discover biclusters with symmetries and parameterizable levels of noise. Results We propose new biclustering algorithms to perform flexible, exhaustive and noise-tolerant biclustering based on sequential patterns (BicSPAM). Strategies are proposed to allow for symmetries and to seize efficiency gains from item-indexable properties and/or from partitioning methods with conservative distance guarantees. Results show BicSPAM ability to capture symmetries, handle planted noise, and scale in terms of memory and time. BicSPAM also achieves the best match-scores for the recovery of hidden biclusters in synthetic datasets with varying noise distributions and levels of missing values. Finally, results on gene expression data lead to complete solutions, delivering new biclusters corresponding to putative modules with heightened biological relevance. Conclusions BicSPAM provides an exhaustive way to discover flexible structures of order-preserving biclusters. To the best of our knowledge, BicSPAM is the first attempt to deal with order-preserving biclusters that allow for symmetries and that are robust to varying levels of noise.
Collapse
Affiliation(s)
- Rui Henriques
- Knowledge Discovery and BIOInformatics group (KDBIO), INESC-ID, and Computer Science and Engineering (CSE) Department, Instituto Superior Técnico, Universidade de Lisboa, Av, Rovisco Pais, 1, 1049-001 Lisboa, Portugal.
| | | |
Collapse
|
29
|
Henriques R, Antunes C, Madeira SC. Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns. New Frontiers in Mining Complex Patterns 2014. [DOI: 10.1007/978-3-319-08407-7_7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
30
|
Carreiro AV, Ferreira AJ, Figueiredo MAT, Madeira SC. Towards a Classification Approach using Meta-Biclustering: Impact of Discretization in the Analysis of Expression Time Series. J Integr Bioinform 2012. [DOI: 10.1515/jib-2012-207] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary Biclustering has been recognized as a remarkably effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms, essential to understanding complex biomedical processes, such as disease progression and drug response. In this work, we propose a classification approach based on meta-biclusters (a set of similar biclusters) applied to prognostic prediction. We use real clinical expression time series to predict the response of patients with multiple sclerosis to treatment with Interferon-β. As compared to previous approaches, the main advantages of this strategy are the interpretability of the results and the reduction of data dimensionality, due to biclustering. This would allow the identification of the genes and time points which are most promising for explaining different types of response profiles, according to clinical knowledge. We assess the impact of different unsupervised and supervised discretization techniques on the classification accuracy. The experimental results show that, in many cases, the use of these discretization methods improves the classification accuracy, as compared to the use of the original features.
Collapse
Affiliation(s)
- André V. Carreiro
- 1KDBIO group, INESC-ID, Lisbon Portugal
- 2Instituto Superior Técnico, Technical University of Lisbon, Portugal
| | - Artur J. Ferreira
- 3Instituto de Telecomunicações, Lisbon Portugal
- 4Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal
| | - Mário A. T. Figueiredo
- 5Instituto Superior Técnico, Technical University of Lisbon Portugal
- 6Instituto de Telecomunicações, Lisbon, Portugal
| | - Sara C. Madeira
- 1KDBIO group, INESC-ID, Lisbon Portugal
- 2Instituto Superior Técnico, Technical University of Lisbon, Portugal
| |
Collapse
|
31
|
Abstract
Disease gene prioritization aims to suggest potential implications of genes in disease susceptibility. Often accomplished in a guilt-by-association scheme, promising candidates are sorted according to their relatedness to known disease genes. Network-based methods have been successfully exploiting this concept by capturing the interaction of genes or proteins into a score. Nonetheless, most current approaches yield at least some of the following limitations: (1) networks comprise only curated physical interactions leading to poor genome coverage and density, and bias toward a particular source; (2) scores focus on adjacencies (direct links) or the most direct paths (shortest paths) within a constrained neighborhood around the disease genes, ignoring potentially informative indirect paths; (3) global clustering is widely applied to partition the network in an unsupervised manner, attributing little importance to prior knowledge; (4) confidence weights and their contribution to edge differentiation and ranking reliability are often disregarded. We hypothesize that network-based prioritization related to local clustering on graphs and considering full topology of weighted gene association networks integrating heterogeneous sources should overcome the above challenges. We term such a strategy Interactogeneous. We conducted cross-validation tests to assess the impact of network sources, alternative path inclusion and confidence weights on the prioritization of putative genes for 29 diseases. Heat diffusion ranking proved the best prioritization method overall, increasing the gap to neighborhood and shortest paths scores mostly on single source networks. Heterogeneous associations consistently delivered superior performance over single source data across the majority of methods. Results on the contribution of confidence weights were inconclusive. Finally, the best Interactogeneous strategy, heat diffusion ranking and associations from the STRING database, was used to prioritize genes for Parkinson’s disease. This method effectively recovered known genes and uncovered interesting candidates which could be linked to pathogenic mechanisms of the disease.
Collapse
Affiliation(s)
- Joana P. Gonçalves
- Knowledge Discovery and Bioinformatics Group, INESC-ID, Lisbon, Portugal
- Computer Science and Engineering Department, Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
- * E-mail: (JPG); (SCM)
| | - Alexandre P. Francisco
- Knowledge Discovery and Bioinformatics Group, INESC-ID, Lisbon, Portugal
- Computer Science and Engineering Department, Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
| | - Yves Moreau
- Electrical Engineering Department, Katholieke Universiteit Leuven, Leuven, Belgium
| | - Sara C. Madeira
- Knowledge Discovery and Bioinformatics Group, INESC-ID, Lisbon, Portugal
- Computer Science and Engineering Department, Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
- * E-mail: (JPG); (SCM)
| |
Collapse
|
32
|
Carreiro AV, Ferreira AJ, Figueiredo MAT, Madeira SC. Towards a classification approach using meta-biclustering: impact of discretization in the analysis of expression time series. J Integr Bioinform 2012; 9:207. [PMID: 22829578 DOI: 10.2390/biecoll-jib-2012-207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2012] [Revised: 07/07/2012] [Accepted: 07/24/2012] [Indexed: 06/01/2023] Open
Abstract
Biclustering has been recognized as a remarkably effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms, essential to understanding complex biomedical processes, such as disease progression and drug response. In this work, we propose a classification approach based on meta-biclusters (a set of similar biclusters) applied to prognostic prediction. We use real clinical expression time series to predict the response of patients with multiple sclerosis to treatment with Interferon-β. As compared to previous approaches, the main advantages of this strategy are the interpretability of the results and the reduction of data dimensionality, due to biclustering. This would allow the identification of the genes and time points which are most promising for explaining different types of response profiles, according to clinical knowledge. We assess the impact of different unsupervised and supervised discretization techniques on the classification accuracy. The experimental results show that, in many cases, the use of these discretization methods improves the classification accuracy, as compared to the use of the original features.
Collapse
|
33
|
Gonçalves JP, Moreau Y, Madeira SC. AliBiMotif: Integrating alignment and biclustering to unravel transcription factor binding sites in DNA sequences. INT J DATA MIN BIOIN 2012; 6:196-215. [DOI: 10.1504/ijdmb.2012.048198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
34
|
Carreiro AV, Anunciação O, Carriço JA, Madeira SC. Prognostic Prediction through Biclustering-Based Classification of Clinical Gene Expression Time Series. J Integr Bioinform 2011. [DOI: 10.1515/jib-2011-175] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary The constant drive towards a more personalized medicine led to an increasing interest in temporal gene expression analyzes. It is now broadly accepted that considering a temporal perspective represents a great advantage to better understand disease progression and treatment results at a molecular level. In this context, biclustering algorithms emerged as an important tool to discover local expression patterns in biomedical applications, and CCC-Biclustering arose as an efficient algorithm relying on the temporal nature of data to identify all maximal temporal patterns in gene expression time series. In this work, CCC-Biclustering was integrated in new biclustering-based classifiers for prognostic prediction. As case study we analyzed multiple gene expression time series in order to classify the response of Multiple Sclerosis patients to the standard treatment with Interferon-β, to which nearly half of the patients reveal a negative response. In this scenario, using an effective predictive model of a patient’s response would avoid useless and possibly harmful therapies for the non-responder group. The results revealed interesting potentialities to be further explored in classification problems involving other (clinical) time series.
Collapse
Affiliation(s)
- André V. Carreiro
- 1Instituto Superior Técnico, Technical University of Lisbon, and Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
| | - Orlando Anunciação
- 1Instituto Superior Técnico, Technical University of Lisbon, and Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
| | - João A. Carriço
- 2Molecular Microbiology and Infection Unit, IMM and Faculty of Medicine, University of Lisbon, Portugal
| | - Sara C. Madeira
- 3Instituto Superior Técnico, Technical University of Lisbon, Portugal and Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
| |
Collapse
|
35
|
Gonçalves JP, Francisco AP, Mira NP, Teixeira MC, Sá-Correia I, Oliveira AL, Madeira SC. TFRank: network-based prioritization of regulatory associations underlying transcriptional responses. ACTA ACUST UNITED AC 2011; 27:3149-57. [PMID: 21965816 DOI: 10.1093/bioinformatics/btr546] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Uncovering mechanisms underlying gene expression control is crucial to understand complex cellular responses. Studies in gene regulation often aim to identify regulatory players involved in a biological process of interest, either transcription factors coregulating a set of target genes or genes eventually controlled by a set of regulators. These are frequently prioritized with respect to a context-specific relevance score. Current approaches rely on relevance measures accounting exclusively for direct transcription factor-target interactions, namely overrepresentation of binding sites or target ratios. Gene regulation has, however, intricate behavior with overlapping, indirect effect that should not be neglected. In addition, the rapid accumulation of regulatory data already enables the prediction of large-scale networks suitable for higher level exploration by methods based on graph theory. A paradigm shift is thus emerging, where isolated and constrained analyses will likely be replaced by whole-network, systemic-aware strategies. RESULTS We present TFRank, a graph-based framework to prioritize regulatory players involved in transcriptional responses within the regulatory network of an organism, whereby every regulatory path containing genes of interest is explored and incorporated into the analysis. TFRank selected important regulators of yeast adaptation to stress induced by quinine and acetic acid, which were missed by a direct effect approach. Notably, they reportedly confer resistance toward the chemicals. In a preliminary study in human, TFRank unveiled regulators involved in breast tumor growth and metastasis when applied to genes whose expression signatures correlated with short interval to metastasis.
Collapse
|
36
|
Nitsch D, Tranchevent LC, Gonçalves JP, Vogt JK, Madeira SC, Moreau Y. PINTA: a web server for network-based gene prioritization from expression data. Nucleic Acids Res 2011; 39:W334-8. [PMID: 21602267 PMCID: PMC3125740 DOI: 10.1093/nar/gkr289] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
PINTA (available at http://www.esat.kuleuven.be/pinta/; this web site is free and open to all users and there is no login requirement) is a web resource for the prioritization of candidate genes based on the differential expression of their neighborhood in a genome-wide protein–protein interaction network. Our strategy is meant for biological and medical researchers aiming at identifying novel disease genes using disease specific expression data. PINTA supports both candidate gene prioritization (starting from a user defined set of candidate genes) as well as genome-wide gene prioritization and is available for five species (human, mouse, rat, worm and yeast). As input data, PINTA only requires disease specific expression data, whereas various platforms (e.g. Affymetrix) are supported. As a result, PINTA computes a gene ranking and presents the results as a table that can easily be browsed and downloaded by the user.
Collapse
Affiliation(s)
- Daniela Nitsch
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, 3001 Leuven, Belgium
| | | | | | | | | | | |
Collapse
|
37
|
Carreiro AV, Anunciação O, Carriço JA, Madeira SC. Biclustering-Based Classification of Clinical Expression Time Series: A Case Study in Patients with Multiple Sclerosis. 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011) 2011. [DOI: 10.1007/978-3-642-19914-1_31] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
38
|
Medina I, Carbonell J, Pulido L, Madeira SC, Goetz S, Conesa A, Tárraga J, Pascual-Montano A, Nogales-Cadenas R, Santoyo J, García F, Marbà M, Montaner D, Dopazo J. Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling. Nucleic Acids Res 2010; 38:W210-3. [PMID: 20478823 PMCID: PMC2896184 DOI: 10.1093/nar/gkq388] [Citation(s) in RCA: 265] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Babelomics is a response to the growing necessity of integrating and analyzing different types of genomic data in an environment that allows an easy functional interpretation of the results. Babelomics includes a complete suite of methods for the analysis of gene expression data that include normalization (covering most commercial platforms), pre-processing, differential gene expression (case-controls, multiclass, survival or continuous values), predictors, clustering; large-scale genotyping assays (case controls and TDTs, and allows population stratification analysis and correction). All these genomic data analysis facilities are integrated and connected to multiple options for the functional interpretation of the experiments. Different methods of functional enrichment or gene set enrichment can be used to understand the functional basis of the experiment analyzed. Many sources of biological information, which include functional (GO, KEGG, Biocarta, Reactome, etc.), regulatory (Transfac, Jaspar, ORegAnno, miRNAs, etc.), text-mining or protein–protein interaction modules can be used for this purpose. Finally a tool for the de novo functional annotation of sequences has been included in the system. This provides support for the functional analysis of non-model species. Mirrors of Babelomics or command line execution of their individual components are now possible. Babelomics is available at http://www.babelomics.org.
Collapse
Affiliation(s)
- Ignacio Medina
- Bioinformatics Department, Centro de Investigación Príncipe Felipe, Autopista del Saler 16, Valencia, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Madeira SC, Teixeira MC, Sá-Correia I, Oliveira AL. Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm. IEEE/ACM Trans Comput Biol Bioinform 2010; 7:153-165. [PMID: 20150677 DOI: 10.1109/tcbb.2008.34] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general.
Collapse
Affiliation(s)
- Sara C Madeira
- Universidade da Beira Interior, Covilhã, KDBIO Group, INESC-ID, Lisbon, Portugal.
| | | | | | | |
Collapse
|
40
|
Gonçalves JP, Madeira SC, Oliveira AL. BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data. BMC Res Notes 2009; 2:124. [PMID: 19583847 PMCID: PMC2720980 DOI: 10.1186/1756-0500-2-124] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 07/07/2009] [Indexed: 11/10/2022] Open
Abstract
Background The ability to monitor changes in expression patterns over time, and to observe the emergence of coherent temporal responses using expression time series, is critical to advance our understanding of complex biological processes. Biclustering has been recognized as an effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms. The general biclustering problem is NP-hard. In the case of time series this problem is tractable, and efficient algorithms can be used. However, there is still a need for specialized applications able to take advantage of the temporal properties inherent to expression time series, both from a computational and a biological perspective. Findings BiGGEsTS makes available state-of-the-art biclustering algorithms for analyzing expression time series. Gene Ontology (GO) annotations are used to assess the biological relevance of the biclusters. Methods for preprocessing expression time series and post-processing results are also included. The analysis is additionally supported by a visualization module capable of displaying informative representations of the data, including heatmaps, dendrograms, expression charts and graphs of enriched GO terms. Conclusion BiGGEsTS is a free open source graphical software tool for revealing local coexpression of genes in specific intervals of time, while integrating meaningful information on gene annotations. It is freely available at: . We present a case study on the discovery of transcriptional regulatory modules in the response of Saccharomyces cerevisiae to heat stress.
Collapse
Affiliation(s)
- Joana P Gonçalves
- Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Rua Alves Redol, Apartado 13069, 1000-029 Lisboa, Portugal.
| | | | | |
Collapse
|
41
|
Madeira SC, Oliveira AL. A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms Mol Biol 2009; 4:8. [PMID: 19497096 PMCID: PMC2709627 DOI: 10.1186/1748-7188-4-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2008] [Accepted: 06/04/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters. METHODS In this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters. RESULTS We present results in real data showing the effectiveness of e-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series. DISCUSSION The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms. AVAILABILITY A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in http://kdbio.inesc-id.pt/software/e-ccc-biclustering.
Collapse
Affiliation(s)
- Sara C Madeira
- Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
- Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
- University of Beira Interior, Covilhã, Portugal
| | - Arlindo L Oliveira
- Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal
- Instituto Superior Técnico, Technical University of Lisbon, Lisbon, Portugal
| |
Collapse
|
42
|
Madeira SC, Oliveira AL. A Linear Time Biclustering Algorithm for Time Series Gene Expression Data. Lecture Notes in Computer Science 2005. [DOI: 10.1007/11557067_4] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
43
|
Abstract
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.
Collapse
Affiliation(s)
- Sara C Madeira
- University of Beira Interior, Rua Marquês D'Avila e Bolama, Covilhã, Portugal.
| | | |
Collapse
|
44
|
Madeira SC, Oliveira AL, Conceição CS. A Data Mining Approach to Credit Risk Evaluation and Behaviour Scoring. Progress in Artificial Intelligence 2003. [DOI: 10.1007/978-3-540-24580-3_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|