1
|
Murmu S, Sinha D, Chaurasia H, Sharma S, Das R, Jha GK, Archak S. A review of artificial intelligence-assisted omics techniques in plant defense: current trends and future directions. FRONTIERS IN PLANT SCIENCE 2024; 15:1292054. [PMID: 38504888 PMCID: PMC10948452 DOI: 10.3389/fpls.2024.1292054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 01/24/2024] [Indexed: 03/21/2024]
Abstract
Plants intricately deploy defense systems to counter diverse biotic and abiotic stresses. Omics technologies, spanning genomics, transcriptomics, proteomics, and metabolomics, have revolutionized the exploration of plant defense mechanisms, unraveling molecular intricacies in response to various stressors. However, the complexity and scale of omics data necessitate sophisticated analytical tools for meaningful insights. This review delves into the application of artificial intelligence algorithms, particularly machine learning and deep learning, as promising approaches for deciphering complex omics data in plant defense research. The overview encompasses key omics techniques and addresses the challenges and limitations inherent in current AI-assisted omics approaches. Moreover, it contemplates potential future directions in this dynamic field. In summary, AI-assisted omics techniques present a robust toolkit, enabling a profound understanding of the molecular foundations of plant defense and paving the way for more effective crop protection strategies amidst climate change and emerging diseases.
Collapse
Affiliation(s)
- Sneha Murmu
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Dipro Sinha
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Himanshushekhar Chaurasia
- Central Institute for Research on Cotton Technology, Indian Council of Agricultural Research (ICAR), Mumbai, India
| | - Soumya Sharma
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Ritwika Das
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Girish Kumar Jha
- Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research (ICAR), New Delhi, India
| | - Sunil Archak
- National Bureau of Plant Genetic Resources, Indian Council of Agricultural Research (ICAR), New Delhi, India
| |
Collapse
|
2
|
Olavarria K, Becker MV, Sousa DZ, van Loosdrecht MC, Wahl SA. Design and thermodynamic analysis of a pathway enabling anaerobic production of poly-3-hydroxybutyrate in Escherichia coli. Synth Syst Biotechnol 2023; 8:629-639. [PMID: 37823039 PMCID: PMC10562921 DOI: 10.1016/j.synbio.2023.09.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/14/2023] [Accepted: 09/19/2023] [Indexed: 10/13/2023] Open
Abstract
Utilizing anaerobic metabolisms for the production of biotechnologically relevant products presents potential advantages, such as increased yields and reduced energy dissipation. However, lower energy dissipation may indicate that certain reactions are operating closer to their thermodynamic equilibrium. While stoichiometric analyses and genetic modifications are frequently employed in metabolic engineering, the use of thermodynamic tools to evaluate the feasibility of planned interventions is less documented. In this study, we propose a novel metabolic engineering strategy to achieve an efficient anaerobic production of poly-(R)-3-hydroxybutyrate (PHB) in the model organism Escherichia coli. Our approach involves re-routing of two-thirds of the glycolytic flux through non-oxidative glycolysis and coupling PHB synthesis with NADH re-oxidation. We complemented our stoichiometric analysis with various thermodynamic approaches to assess the feasibility and the bottlenecks in the proposed engineered pathway. According to our calculations, the main thermodynamic bottleneck are the reactions catalyzed by the acetoacetyl-CoA β-ketothiolase (EC 2.3.1.9) and the acetoacetyl-CoA reductase (EC 1.1.1.36). Furthermore, we calculated thermodynamically consistent sets of kinetic parameters to determine the enzyme amounts required for sustaining the conversion fluxes. In the case of the engineered conversion route, the protein pool necessary to sustain the desired fluxes could account for 20% of the whole cell dry weight.
Collapse
Affiliation(s)
- Karel Olavarria
- Laboratory of Microbiology, Wageningen University and Research, Stippenenweg 4, 6708 WE, Wageningen, The Netherlands
- Centre for Living Technologies, Eindhoven-Wageningen-Utrecht Alliance, Princetonlaan 6, 3584 CB, Utrecht, The Netherlands
| | - Marco V. Becker
- Department of Biotechnology, Applied Sciences Faculty, Delft University of Technology, van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Diana Z. Sousa
- Laboratory of Microbiology, Wageningen University and Research, Stippenenweg 4, 6708 WE, Wageningen, The Netherlands
- Centre for Living Technologies, Eindhoven-Wageningen-Utrecht Alliance, Princetonlaan 6, 3584 CB, Utrecht, The Netherlands
| | - Mark C.M. van Loosdrecht
- Department of Biotechnology, Applied Sciences Faculty, Delft University of Technology, van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - S. Aljoscha Wahl
- Lehrstuhl für Bioverfahrenstechnik, Friedrich-Alexander-Universität, Paul-Gordan-Strasse 3, 91052, Erlangen, Germany
| |
Collapse
|
3
|
Romero M, Nakano FK, Finke J, Rocha C, Vens C. Leveraging class hierarchy for detecting missing annotations on hierarchical multi-label classification. Comput Biol Med 2023; 152:106423. [PMID: 36529023 DOI: 10.1016/j.compbiomed.2022.106423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 11/09/2022] [Accepted: 12/11/2022] [Indexed: 12/15/2022]
Abstract
With the development of new sequencing technologies, availability of genomic data has grown exponentially. Over the past decade, numerous studies have used genomic data to identify associations between genes and biological functions. While these studies have shown success in annotating genes with functions, they often assume that genes are completely annotated and fail to take into account that datasets are sparse and noisy. This work proposes a method to detect missing annotations in the context of hierarchical multi-label classification. More precisely, our method exploits the relations of functions, represented as a hierarchy, by computing probabilities based on the paths of functions in the hierarchy. By performing several experiments on a variety of rice (Oriza sativa Japonica), we showcase that the proposed method accurately detects missing annotations and yields superior results when compared to state-of-art methods from the literature.
Collapse
Affiliation(s)
- Miguel Romero
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Felipe Kenji Nakano
- Department of Public Health and Primary Care, KU Leuven Campus KULAK, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium; Itec, imec research group at KU Leuven, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium.
| | - Jorge Finke
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Camilo Rocha
- Department of Electronics and Computer Science, Pontificia Universidad Javeriana, Calle 18 N 118-250, Cali, 760031, Colombia.
| | - Celine Vens
- Department of Public Health and Primary Care, KU Leuven Campus KULAK, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium; Itec, imec research group at KU Leuven, Etienne Sabbelaan 53, Kortrijk, 8500, Belgium.
| |
Collapse
|
4
|
Artificial intelligence and machine-learning approaches in structure and ligand-based discovery of drugs affecting central nervous system. Mol Divers 2022; 27:959-985. [PMID: 35819579 DOI: 10.1007/s11030-022-10489-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 06/21/2022] [Indexed: 12/11/2022]
Abstract
CNS disorders are indications with a very high unmet medical needs, relatively smaller number of available drugs, and a subpar satisfaction level among patients and caregiver. Discovery of CNS drugs is extremely expensive affair with its own unique challenges leading to extremely high attrition rates and low efficiency. With explosion of data in information age, there is hardly any aspect of life that has not been touched by data driven technologies such as artificial intelligence (AI) and machine learning (ML). Drug discovery is no exception, emergence of big data via genomic, proteomic, biological, and chemical technologies has driven pharmaceutical giants to collaborate with AI oriented companies to revolutionise drug discovery, with the goal of increasing the efficiency of the process. In recent years many examples of innovative applications of AI and ML techniques in CNS drug discovery has been reported. Research on therapeutics for diseases such as schizophrenia, Alzheimer's and Parkinsonism has been provided with a new direction and thrust from these developments. AI and ML has been applied to both ligand-based and structure-based drug discovery and design of CNS therapeutics. In this review, we have summarised the general aspects of AI and ML from the perspective of drug discovery followed by a comprehensive coverage of the recent developments in the applications of AI/ML techniques in CNS drug discovery.
Collapse
|
5
|
Hierarchical classification of pollinating flying insects under changing environments. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
6
|
Liu W, Yuan J, Lyu G, Feng S. Label driven latent subspace learning for multi-view multi-label classification. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03600-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
7
|
Petkov S, Chiodi F. Impaired CD4+ T cell differentiation in HIV-1 infected patients receiving early anti-retroviral therapy. Genomics 2022; 114:110367. [PMID: 35429609 DOI: 10.1016/j.ygeno.2022.110367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 04/01/2022] [Accepted: 04/09/2022] [Indexed: 01/14/2023]
Abstract
Differentiation of CD4+ T naïve (TN) into central memory (TCM) cells involves extensive molecular processes. We compared the transcriptomes of CD4+ TN and TCM cells from HIV-1 infected patients receiving early anti-retroviral therapy (ART; EA; n = 13) and controls (n = 15). Comparison of protein coding genes between TCM and TN revealed 533 and 82 differentially expressed genes (DEGs) in controls and EA, respectively. A high degree of transcriptional complexity was detected during transition of CD4+ TN to TCM cells in controls involving 70 TFs, 20 master regulators of T cell differentiation (TBX21, GATA3, RARA, FOXP3, RORC); in EA only 7 TFs were modulated with expression of several master regulators remaining unchanged during differentiation. Analysis of interactions between modulated TFs and target genes revealed important regulatory interactions missing in EA group. We conclude that T cell differentiation in EA patients is impaired due to reduced modulation of genes involved in transition from CD4+ TN to TCM cells.
Collapse
Affiliation(s)
- Stefan Petkov
- Department of Microbiology, Tumor and Cell Biology, Biomedicum, Karolinska Institutet, Solna, Sweden
| | - Francesca Chiodi
- Department of Microbiology, Tumor and Cell Biology, Biomedicum, Karolinska Institutet, Solna, Sweden.
| |
Collapse
|
8
|
Zhu R, Zhang Q, Tang L, Zhao Y, Li J, Li F. Redescription of Bakuella (Bakuella) marina Agamaliev and Alekperov, 1976 (Protozoa, Hypotrichia), With Notes on Its Morphology, Morphogenesis, and Molecular Phylogeny. Front Microbiol 2022; 12:774226. [PMID: 35222294 PMCID: PMC8867040 DOI: 10.3389/fmicb.2021.774226] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Accepted: 12/23/2021] [Indexed: 11/30/2022] Open
Abstract
Because the original description of Bakuella (Bakuella) marina, type of the genus, is only based on protargol-impregnated specimens, one of the important living features, namely, the presence/absence of cortical granules, remains unknown so far. In the present work, a detailed investigation of a Chinese population of B. (Bakuella) marina is carried out using the integrated approaches, and the live morphology, ontogenesis, and molecular information of B. (Bakuella) marina are presented for the first time. The infraciliature of this population corresponds perfectly with that of the original description. The in vivo observation indicates that B. (Bakuella) marina possesses colorless cortical granules. The most prominent morphogenetic feature of B. (Bakuella) marina is that the parental adoral zone of membranelles is completely replaced by the newly formed one of the proters. Molecular phylogenetic analysis based on a small subunit ribosomal gene (SSU rDNA) shows that five Bakuella species are clustered with the species from other six Urostylid genera, namely, Anteholosticha, Apobakuella, Diaxonella, Holosticha, Neobakuella, and Urostyla. The monophyletic probabilities of the family Bakuellidae, genus Bakuella, subgenus B. (Bakuella), and subgenus B. (Pseudobakuella) are rejected by the approximately unbiased test. This study further shows that the family Bakuellidae, genus Bakuella, and subgenus B. (Bakuella) are all nonmonophyletic groups. In order to establish a reasonable classification system, information on molecular and morphogenesis of more Bakuellids and its related species is urgently needed.
Collapse
Affiliation(s)
- Rong Zhu
- College of Life Sciences, Hebei University, Baoding, China
| | - Qi Zhang
- College of Life Sciences, Hebei University, Baoding, China
| | - Lan Tang
- College of Life Sciences, Hebei University, Baoding, China
| | - Yan Zhao
- College of Life Sciences, Capital Normal University, Beijing, China
| | - Jingbao Li
- Key Laboratory for Space Bioscience and Biotechnology, School of Life Sciences, Institute of Special Environmental Biophysics, Northwestern Polytechnical University, Xi’an, China
| | - Fengchao Li
- College of Life Sciences, Hebei University, Baoding, China
- Innovation Center for Bioengineering and Biotechnology of Hebei Province, Baoding, China
| |
Collapse
|
9
|
|
10
|
|
11
|
Li H, Xuan J, Wang C, Chen Z, Grégori G, Zhao Y, Zhang W. Summertime Tintinnid Community in the Surface Waters Across the North Pacific Transition Zone. Front Microbiol 2021; 12:697801. [PMID: 34456886 PMCID: PMC8386027 DOI: 10.3389/fmicb.2021.697801] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 07/15/2021] [Indexed: 11/26/2022] Open
Abstract
Located from 35° to 45° latitude in both hemispheres, the transition zone is an important region with respect to the planktonic biogeography of the sea. However, to the best of our knowledge, there have been no reports on the existence of a tintinnid community in the transition zone. In this research, tintinnids along two transects across the North Pacific Transition Zone (NPTZ) were investigated in summer 2016 and 2019. Eighty-three oceanic tintinnid species were identified, 41 of which were defined as common oceanic species. The common oceanic species were further divided into five groups: boreal, warm water type I, warm water type II, transition zone, and cosmopolitan species. Undella californiensis and Undella clevei were transition zone species. Other species, such as Amphorides minor, Dadayiella ganymedes, Dictyocysta mitra, Eutintinnus pacificus, Eutintinnus tubulosus, Protorhabdonella simplex, and Steenstrupiella steenstrupii, were the most abundant in the NPTZ but spread over a much larger distribution region. Species richness showed no obvious increase in the NPTZ. Boreal, transition zone, and warm water communities were divided along the two transects. Tintinnid transition zone community mainly distributed in regions with water temperatures between 15 and 20°C. The tintinnid lorica oral diameter size classes were dominated by the 24-28 μm size class in three communities, but the dominance decreased from 66.26% in the boreal community to 48.85% in the transition zone community and then to 22.72% in the warm water community. Our research confirmed the existence of tintinnid transition zone species and community. The abrupt disappearance of warm water type I species below 15°C suggested that this group could be used as an indicator of the northern boundary of the NPTZ.
Collapse
Affiliation(s)
- Haibo Li
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China
- Laboratory for Marine Ecology and Environmental Science, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China
| | - Jun Xuan
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China
- Laboratory for Marine Ecology and Environmental Science, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Chaofeng Wang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China
- Laboratory for Marine Ecology and Environmental Science, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China
| | - Zhaohui Chen
- Physical Oceanography Laboratory/Frontiers Science Center for Deep Ocean Multispheres and Earth System, Ocean University of China, Qingdao, China
| | - Gérald Grégori
- University of Chinese Academy of Sciences, Beijing, China
- Aix-Marseille University, Université de Toulon, CNRS, IRD, Mediterranean Institute of Oceanography, Marseille, France
| | - Yuan Zhao
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China
- Laboratory for Marine Ecology and Environmental Science, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China
| | - Wuchang Zhang
- CAS Key Laboratory of Marine Ecology and Environmental Sciences, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China
- Laboratory for Marine Ecology and Environmental Science, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
- Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China
| |
Collapse
|
12
|
Li HD, Yang C, Zhang Z, Yang M, Wu FX, Omenn GS, Wang J. IsoResolve: predicting splice isoform functions by integrating gene and isoform-level features with domain adaptation. Bioinformatics 2021; 37:522-530. [PMID: 32966552 PMCID: PMC8088322 DOI: 10.1093/bioinformatics/btaa829] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 08/12/2020] [Accepted: 09/09/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High resolution annotation of gene functions is a central goal in functional genomics. A single gene may produce multiple isoforms with different functions through alternative splicing. Conventional approaches, however, consider a gene as a single entity without differentiating these functionally different isoforms. Towards understanding gene functions at higher resolution, recent efforts have focused on predicting the functions of isoforms. However, the performance of existing methods is far from satisfactory mainly because of the lack of isoform-level functional annotation. RESULTS We present IsoResolve, a novel approach for isoform function prediction, which leverages the information from gene function prediction models with domain adaptation (DA). IsoResolve treats gene-level and isoform-level features as source and target domains, respectively. It uses DA to project the two domains into a latent variable space in such a way that the latent variables from the two domains have similar distribution, which enables the gene domain information to be leveraged for isoform function prediction. We systematically evaluated the performance of IsoResolve in predicting functions. Compared with five state-of-the-art methods, IsoResolve achieved significantly better performance. IsoResolve was further validated by case studies of genes with isoform-level functional annotation. AVAILABILITY AND IMPLEMENTATION IsoResolve is freely available at https://github.com/genemine/IsoResolve. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hong-Dong Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering
| | - Changhuo Yang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering
| | - Zhimin Zhang
- College of Chemistry and Chemical Engineering, Central South University, Changsha, Hunan 410083, China
| | - Mengyun Yang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N5A9, Canada
| | - Gilbert S Omenn
- Institute for Systems Biology, Seattle, WA 98101, USA.,Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering
| |
Collapse
|
13
|
Handling imbalance in hierarchical classification problems using local classifiers approaches. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00762-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
14
|
Che X, Chen D, Mi J. Feature distribution-based label correlation in multi-label classification. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-020-01268-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
15
|
|
16
|
Paredes O, Romo-Vázquez R, Román-Godínez I, Vélez-Pérez H, Salido-Ruiz RA, Morales JA. Frequency spectra characterization of noncoding human genomic sequences. Genes Genomics 2020; 42:1215-1226. [DOI: 10.1007/s13258-020-00980-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 04/27/2020] [Indexed: 11/28/2022]
|
17
|
|
18
|
Mahood EH, Kruse LH, Moghe GD. Machine learning: A powerful tool for gene function prediction in plants. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11376. [PMID: 32765975 PMCID: PMC7394712 DOI: 10.1002/aps3.11376] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 03/19/2020] [Indexed: 05/06/2023]
Abstract
Recent advances in sequencing and informatic technologies have led to a deluge of publicly available genomic data. While it is now relatively easy to sequence, assemble, and identify genic regions in diploid plant genomes, functional annotation of these genes is still a challenge. Over the past decade, there has been a steady increase in studies utilizing machine learning algorithms for various aspects of functional prediction, because these algorithms are able to integrate large amounts of heterogeneous data and detect patterns inconspicuous through rule-based approaches. The goal of this review is to introduce experimental plant biologists to machine learning, by describing how it is currently being used in gene function prediction to gain novel biological insights. In this review, we discuss specific applications of machine learning in identifying structural features in sequenced genomes, predicting interactions between different cellular components, and predicting gene function and organismal phenotypes. Finally, we also propose strategies for stimulating functional discovery using machine learning-based approaches in plants.
Collapse
Affiliation(s)
- Elizabeth H. Mahood
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Lars H. Kruse
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| | - Gaurav D. Moghe
- Plant Biology SectionSchool of Integrative Plant SciencesCornell UniversityIthacaNew York14853USA
| |
Collapse
|
19
|
Shaw D, Chen H, Jiang T. DeepIsoFun: a deep domain adaptation approach to predict isoform functions. Bioinformatics 2020; 35:2535-2544. [PMID: 30535380 DOI: 10.1093/bioinformatics/bty1017] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 11/07/2018] [Accepted: 12/08/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Isoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms. RESULTS We evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions. AVAILABILITY AND IMPLEMENTATION https://github.com/dls03/DeepIsoFun/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dipan Shaw
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA
| | - Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA.,Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
20
|
Park JY, Kim JH. Incremental Class Learning for Hierarchical Classification. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:178-189. [PMID: 30188844 DOI: 10.1109/tcyb.2018.2866869] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Objects can be described in hierarchical semantics, and people also perceive them this way. It leads to the need for hierarchical classification in machine learning. On the other hand, when a new data that belongs to a new class is given, the existing classification methods should be retrained for all data including the new data. To deal with these issues, we propose an adaptive resonance theory-supervised predictive mapping for hierarchical classification (ARTMAP-HC) network that allows incremental class learning for raw data without normalization in advance. Our proposed ARTMAP-HC is composed of hierarchically stacked modules, and each module incorporates two fuzzy ARTMAP networks. Regardless of the level of the class hierarchy and the number of classes for each level, ARTMAP-HC is able to incrementally learn sequentially added input data belonging to new classes. By using a novel online normalization process, ARTMAP-HC can classify the new data without prior knowledge of the maximum value of the dataset. By adopting the prior labels appending process, the class dependency between class hierarchy levels is reflected in ARTMAP-HC. The effectiveness of the proposed ARTMAP-HC is validated through experiments on hierarchical classification datasets. To demonstrate the applicability, ARTMAP-HC is applied to a multimedia recommendation system for digital storytelling.
Collapse
|
21
|
Gao J, Liu L, Yao S, Huang X, Mamitsuka H, Zhu S. HPOAnnotator: improving large-scale prediction of HPO annotations by low-rank approximation with HPO semantic similarities and multiple PPI networks. BMC Med Genomics 2019; 12:187. [PMID: 31865916 PMCID: PMC6927106 DOI: 10.1186/s12920-019-0625-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND As a standardized vocabulary of phenotypic abnormalities associated with human diseases, the Human Phenotype Ontology (HPO) has been widely used by researchers to annotate phenotypes of genes/proteins. For saving the cost and time spent on experiments, many computational approaches have been proposed. They are able to alleviate the problem to some extent, but their performances are still far from satisfactory. METHOD For inferring large-scale protein-phenotype associations, we propose HPOAnnotator that incorporates multiple Protein-Protein Interaction (PPI) information and the hierarchical structure of HPO. Specifically, we use a dual graph to regularize Non-negative Matrix Factorization (NMF) in a way that the information from different sources can be seamlessly integrated. In essence, HPOAnnotator solves the sparsity problem of a protein-phenotype association matrix by using a low-rank approximation. RESULTS By combining the hierarchical structure of HPO and co-annotations of proteins, our model can well capture the HPO semantic similarities. Moreover, graph Laplacian regularizations are imposed in the latent space so as to utilize multiple PPI networks. The performance of HPOAnnotator has been validated under cross-validation and independent test. Experimental results have shown that HPOAnnotator outperforms the competing methods significantly. CONCLUSIONS Through extensive comparisons with the state-of-the-art methods, we conclude that the proposed HPOAnnotator is able to achieve the superior performance as a result of using a low-rank approximation with a graph regularization. It is promising in that our approach can be considered as a starting point to study more efficient matrix factorization-based algorithms.
Collapse
Affiliation(s)
- Junning Gao
- School of Computer Science and Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, 220 Handan Road, Shanghai, 200433 China
| | - Lizhi Liu
- School of Computer Science and Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, 220 Handan Road, Shanghai, 200433 China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, 220 Handan Road, Shanghai, 200433 China
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, Elizabeth Mitchell Dr, Albury, NSW 2640 Australia
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kashiwada Gokasho, Uji, Kyoto, 611-0011 Japan
- Department of Computer Science, Aalto University, Konemiehentie 2, Espoo, 02150 Finland
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, 220 Handan Road, Shanghai, 200433 China
- Shanghai Institute of Artificial Intelligence Algorithms and ISTBI, Fudan University, Shanghai, 200433 China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| |
Collapse
|
22
|
Pliakos K, Vens C. Network inference with ensembles of bi-clustering trees. BMC Bioinformatics 2019; 20:525. [PMID: 31660848 PMCID: PMC6819564 DOI: 10.1186/s12859-019-3104-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Accepted: 09/20/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Network inference is crucial for biomedicine and systems biology. Biological entities and their associations are often modeled as interaction networks. Examples include drug protein interaction or gene regulatory networks. Studying and elucidating such networks can lead to the comprehension of complex biological processes. However, usually we have only partial knowledge of those networks and the experimental identification of all the existing associations between biological entities is very time consuming and particularly expensive. Many computational approaches have been proposed over the years for network inference, nonetheless, efficiency and accuracy are still persisting open problems. Here, we propose bi-clustering tree ensembles as a new machine learning method for network inference, extending the traditional tree-ensemble models to the global network setting. The proposed approach addresses the network inference problem as a multi-label classification task. More specifically, the nodes of a network (e.g., drugs or proteins in a drug-protein interaction network) are modelled as samples described by features (e.g., chemical structure similarities or protein sequence similarities). The labels in our setting represent the presence or absence of links connecting the nodes of the interaction network (e.g., drug-protein interactions in a drug-protein interaction network). RESULTS We extended traditional tree-ensemble methods, such as extremely randomized trees (ERT) and random forests (RF) to ensembles of bi-clustering trees, integrating background information from both node sets of a heterogeneous network into the same learning framework. We performed an empirical evaluation, comparing the proposed approach to currently used tree-ensemble based approaches as well as other approaches from the literature. We demonstrated the effectiveness of our approach in different interaction prediction (network inference) settings. For evaluation purposes, we used several benchmark datasets that represent drug-protein and gene regulatory networks. We also applied our proposed method to two versions of a chemical-protein association network extracted from the STITCH database, demonstrating the potential of our model in predicting non-reported interactions. CONCLUSIONS Bi-clustering trees outperform existing tree-based strategies as well as machine learning methods based on other algorithms. Since our approach is based on tree-ensembles it inherits the advantages of tree-ensemble learning, such as handling of missing values, scalability and interpretability.
Collapse
Affiliation(s)
- Konstantinos Pliakos
- KU Leuven, Campus KULAK, Department of Public Health and Primary Care, Faculty of Medicine, Kortrijk, Belgium. .,ITEC, imec research group at KU Leuven, Kortrijk, Belgium.
| | - Celine Vens
- KU Leuven, Campus KULAK, Department of Public Health and Primary Care, Faculty of Medicine, Kortrijk, Belgium.,ITEC, imec research group at KU Leuven, Kortrijk, Belgium
| |
Collapse
|
23
|
Nakano FK, Lietaert M, Vens C. Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets. BMC Bioinformatics 2019; 20:485. [PMID: 31547800 PMCID: PMC6755698 DOI: 10.1186/s12859-019-3060-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 08/27/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.
Collapse
Affiliation(s)
- Felipe Kenji Nakano
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| | - Mathias Lietaert
- Howest University of Applied Sciences, Campus Brugge Station, Rijselstraat 5, Brugge, 8200 Belgium
| | - Celine Vens
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| |
Collapse
|
24
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
25
|
|
26
|
Sun L, Yang H, Cai Y, Li W, Liu G, Tang Y. In Silico Prediction of Endocrine Disrupting Chemicals Using Single-Label and Multilabel Models. J Chem Inf Model 2019; 59:973-982. [PMID: 30807141 DOI: 10.1021/acs.jcim.8b00551] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Endocrine disruption (ED) has become a serious public health issue and also poses a significant threat to the ecosystem. Due to complex mechanisms of ED, traditional in silico models focusing on only one mechanism are insufficient for detection of endocrine disrupting chemicals (EDCs), let alone offering an overview of possible action mechanisms for a known EDC. To remove these limitations, in this study both single-label and multilabel models were constructed across six ED targets, namely, AR (androgen receptor), ER (estrogen receptor alpha), TR (thyroid receptor), GR (glucocorticoid receptor), PPARg (peroxisome proliferator-activated receptor gamma), and aromatase. Two machine learning methods were used to build the single-label models, with multiple random under-sampling combining voting classification to overcome the challenge of data imbalance. Four methods were explored to construct the multilabel models that can predict the interaction of one EDC against multiple targets simultaneously. The single-label models of all the six targets have achieved reasonable performance with balanced accuracy (BA) values from 0.742 to 0.816. Each top single-label model was then joined to predict the multilabel test set with BA values from 0.586 to 0.711. The multilabel models could offer a significant boost over the single-label baselines with BA values for the multilabel test set from 0.659 to 0.832. Therefore, we concluded that single-label models could be employed for identification of potential EDCs, while multilabel ones are preferable for prediction of possible mechanisms of known EDCs.
Collapse
Affiliation(s)
- Lixia Sun
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Hongbin Yang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Yingchun Cai
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Weihua Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Guixia Liu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Yun Tang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| |
Collapse
|
27
|
|
28
|
Li Z, Liao B, Li Y, Liu W, Chen M, Cai L. Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning. RSC Adv 2018; 8:28503-28509. [PMID: 35542493 PMCID: PMC9083914 DOI: 10.1039/c8ra05122d] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Accepted: 07/12/2018] [Indexed: 12/04/2022] Open
Abstract
Gene function annotation is the main challenge in the post genome era, which is an important part of the genome annotation. The sequencing of the human genome project produces a whole genome data, providing abundant biological information for the study of gene function annotation. However, to obtain useful knowledge from a large amount of data, a potential strategy is to apply machine learning methods to mine these data and predict gene function. In this study, we improved multi-instance hierarchical clustering by using gene ontology hierarchy to annotate gene function, which combines gene ontology hierarchy with multi-instance multi-label learning frame structure. Then, we used multi-label support vector machine (MLSVM) and multi-label k-nearest neighbor (MLKNN) algorithm to predict the function of gene. Finally, we verified our method in four yeast expression datasets. The performance of the simulated experiments proved that our method is efficient.
Collapse
Affiliation(s)
- Zejun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Yun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Wenhua Liu
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Min Chen
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| |
Collapse
|
29
|
Mining features for biomedical data using clustering tree ensembles. J Biomed Inform 2018; 85:40-48. [PMID: 30012356 DOI: 10.1016/j.jbi.2018.07.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Revised: 07/02/2018] [Accepted: 07/12/2018] [Indexed: 01/07/2023]
Abstract
The volume of biomedical data available to the machine learning community grows very rapidly. A rational question is how informative these data really are or how discriminant the features describing the data instances are. Several biomedical datasets suffer from lack of variance in the instance representation, or even worse, contain instances with identical features and different class labels. Indisputably, this directly affects the performance of machine learning algorithms, as well as the ability to interpret their results. In this article, we emphasize on the aforementioned problem and propose a target-informed feature induction method based on tree ensemble learning. The method brings more variance into the data representation, thereby potentially increasing predictive performance of a learner applied to the induced features. The contribution of this article is twofold. Firstly, a problem affecting the quality of biomedical data is highlighted, and secondly, a method to handle that problem is proposed. The efficiency of the presented approach is validated on multi-target prediction tasks. The obtained results indicate that the proposed approach is able to boost the discrimination between the data instances and increase the predictive performance.
Collapse
|
30
|
Vidulin V, Šmuc T, Džeroski S, Supek F. The evolutionary signal in metagenome phyletic profiles predicts many gene functions. MICROBIOME 2018; 6:129. [PMID: 29991352 PMCID: PMC6040064 DOI: 10.1186/s40168-018-0506-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 06/19/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner. RESULTS We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models. CONCLUSIONS In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches.
Collapse
Affiliation(s)
- Vedrana Vidulin
- Faculty of Information Studies, 8000 Novo Mesto, Slovenia
- Division of Electronics, Rudjer Boskovic Institute, 10000 Zagreb, Croatia
- Department of Knowledge Technologies, Jozef Stefan Institute, 1000 Ljubljana, Slovenia
| | - Tomislav Šmuc
- Division of Electronics, Rudjer Boskovic Institute, 10000 Zagreb, Croatia
| | - Sašo Džeroski
- Department of Knowledge Technologies, Jozef Stefan Institute, 1000 Ljubljana, Slovenia
| | - Fran Supek
- Genome Data Science, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028 Barcelona, Spain
| |
Collapse
|
31
|
Papanikolaou Y, Tsoumakas G, Katakis I. Hierarchical partitioning of the output space in multi-label data. DATA KNOWL ENG 2018. [DOI: 10.1016/j.datak.2018.05.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
32
|
Notaro M, Schubach M, Robinson PN, Valentini G. Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods. BMC Bioinformatics 2017; 18:449. [PMID: 29025394 PMCID: PMC5639780 DOI: 10.1186/s12859-017-1854-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Accepted: 10/02/2017] [Indexed: 03/12/2023] Open
Abstract
BACKGROUND The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. RESULTS We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a "flat" learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity. CONCLUSIONS Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository.
Collapse
Affiliation(s)
- Marco Notaro
- Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano, Via Comelico 39, Milan, 20135 Italy
| | - Max Schubach
- Institute for Medical and Human Genetics, Charité - Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353 Germany
- Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, Berlin, 10178 Germany
| | - Peter N. Robinson
- Institute for Medical and Human Genetics, Charité - Universitätsmedizin Berlin, Augustenburger Platz 1, Berlin, 13353 Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, Berlin, 14195 Germany
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Dr, Farmington, 06032 CT USA
- Institute for Systems Genomics, University of Connecticut, 10 Discovery Dr, Farmington, 06032 CT USA
| | - Giorgio Valentini
- Anacleto Lab - Dipartimento di Informatica, Universitá degli Studi di Milano, Via Comelico 39, Milan, 20135 Italy
| |
Collapse
|
33
|
An approximated decision-theoretic algorithm for minimization of the Tversky loss under the multi-label framework. Pattern Anal Appl 2017. [DOI: 10.1007/s10044-017-0651-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
34
|
Weißenborn S, Walther D. Metabolic Pathway Assignment of Plant Genes based on Phylogenetic Profiling-A Feasibility Study. FRONTIERS IN PLANT SCIENCE 2017; 8:1831. [PMID: 29163570 PMCID: PMC5664361 DOI: 10.3389/fpls.2017.01831] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 10/10/2017] [Indexed: 05/19/2023]
Abstract
Despite many developed experimental and computational approaches, functional gene annotation remains challenging. With the rapidly growing number of sequenced genomes, the concept of phylogenetic profiling, which predicts functional links between genes that share a common co-occurrence pattern across different genomes, has gained renewed attention as it promises to annotate gene functions based on presence/absence calls alone. We applied phylogenetic profiling to the problem of metabolic pathway assignments of plant genes with a particular focus on secondary metabolism pathways. We determined phylogenetic profiles for 40,960 metabolic pathway enzyme genes with assigned EC numbers from 24 plant species based on sequence and pathway annotation data from KEGG and Ensembl Plants. For gene sequence family assignments, needed to determine the presence or absence of particular gene functions in the given plant species, we included data of all 39 species available at the Ensembl Plants database and established gene families based on pairwise sequence identities and annotation information. Aside from performing profiling comparisons, we used machine learning approaches to predict pathway associations from phylogenetic profiles alone. Selected metabolic pathways were indeed found to be composed of gene families of greater than expected phylogenetic profile similarity. This was particularly evident for primary metabolism pathways, whereas for secondary pathways, both the available annotation in different species as well as the abstraction of functional association via distinct pathways proved limiting. While phylogenetic profile similarity was generally not found to correlate with gene co-expression, direct physical interactions of proteins were reflected by a significantly increased profile similarity suggesting an application of phylogenetic profiling methods as a filtering step in the identification of protein-protein interactions. This feasibility study highlights the potential and challenges associated with phylogenetic profiling methods for the detection of functional relationships between genes as well as the need to enlarge the set of plant genes with proven secondary metabolism involvement as well as the limitations of distinct pathways as abstractions of relationships between genes.
Collapse
|
35
|
Fabris F, Freitas AA, Tullet JMA. An Extensive Empirical Comparison of Probabilistic Hierarchical Classifiers in Datasets of Ageing-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:1045-1058. [PMID: 26661786 DOI: 10.1109/tcbb.2015.2505288] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
This study comprehensively evaluates the performance of five types of probabilistic hierarchical classification methods used for predicting Gene Ontology (GO) terms related to ageing. Of those tested, a new hybrid of a Local Hierarchical Classifier (LHC) and the Predictive Clustering Tree algorithm (LHC-PCT) had the best predictive accuracy results. We also tested the impact of two types of variations in most hierarchical classification algorithms, namely: (a) changing the base algorithm (we tested Naive Bayes and Support Vector Machines), and the impact of (b) using or not the Correlation based Feature Selection (CFS) algorithm in a pre-processing step. In total, we evaluated the predictive performance of 17 variations of hierarchical classifiers across 15 datasets of ageing and longevity-related genes. We conclude that the LHC-PCT algorithm ranks better across several tests (seven out of 12). In addition, we interpreted the models generated by the PCT algorithm to show how hierarchical classification algorithms can be used to extract biological insights out of the ageing-related datasets that we compiled.
Collapse
|
36
|
Zhu R, Zhang Z, Li Y, Hu Z, Xin D, Qi Z, Chen Q. Discovering Numerical Differences between Animal and Plant microRNAs. PLoS One 2016; 11:e0165152. [PMID: 27768749 PMCID: PMC5074594 DOI: 10.1371/journal.pone.0165152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Accepted: 10/09/2016] [Indexed: 12/18/2022] Open
Abstract
Previous studies have confirmed that there are many differences between animal and plant microRNAs (miRNAs), and that numerical features based on sequence and structure can be used to predict the function of individual miRNAs. However, there is little research regarding numerical differences between animal and plant miRNAs, and whether a single numerical feature or combination of features could be used to distinguish animal and plant miRNAs or not. Therefore, in current study we aimed to discover numerical features that could be used to accomplish this. We performed a large-scale analysis of 132 miRNA numerical features, and identified 17 highly significant distinguishing features. However, none of the features independently could clearly differentiate animal and plant miRNAs. By further analysis, we found a four-feature subset that included helix number, stack number, length of pre-miRNA, and minimum free energy, and developed a logistic classifier that could distinguish animal and plant miRNAs effectively. The precision of the classifier was greater than 80%. Using this tool, we confirmed that there were universal differences between animal and plant miRNAs, and that a single feature was unable to adequately distinguish the difference. This feature set and classifier represent a valuable tool for identifying differences between animal and plant miRNAs at a molecular level.
Collapse
Affiliation(s)
- Rongsheng Zhu
- College of Science, Northeast Agricultural University, Harbin, China
| | - Zhanguo Zhang
- College of Science, Northeast Agricultural University, Harbin, China
| | - Yang Li
- College of Science, Northeast Agricultural University, Harbin, China
| | - Zhenbang Hu
- College of Agronomy, Northeast Agricultural University, Harbin, China
| | - Dawei Xin
- College of Agronomy, Northeast Agricultural University, Harbin, China
| | - Zhaoming Qi
- College of Agronomy, Northeast Agricultural University, Harbin, China
| | - Qingshan Chen
- College of Agronomy, Northeast Agricultural University, Harbin, China
| |
Collapse
|
37
|
Cerri R, Barros RC, P. L. F. de Carvalho AC, Jin Y. Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinformatics 2016; 17:373. [PMID: 27627880 PMCID: PMC5024469 DOI: 10.1186/s12859-016-1232-1] [Citation(s) in RCA: 52] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2016] [Accepted: 08/30/2016] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Hierarchical Multi-Label Classification is a classification task where the classes to be predicted are hierarchically organized. Each instance can be assigned to classes belonging to more than one path in the hierarchy. This scenario is typically found in protein function prediction, considering that each protein may perform many functions, which can be further specialized into sub-functions. We present a new hierarchical multi-label classification method based on multiple neural networks for the task of protein function prediction. A set of neural networks are incrementally training, each being responsible for the prediction of the classes belonging to a given level. RESULTS The method proposed here is an extension of our previous work. Here we use the neural network output of a level to complement the feature vectors used as input to train the neural network in the next level. We experimentally compare this novel method with several other reduction strategies, showing that it obtains the best predictive performance. Empirical results also show that the proposed method achieves better or comparable predictive performance when compared with state-of-the-art methods for hierarchical multi-label classification in the context of protein function prediction. CONCLUSIONS The experiments showed that using the output in one level as input to the next level contributed to better classification results. We believe the method was able to learn the relationships between the protein functions during training, and this information was useful for classification. We also identified in which functional classes our method performed better.
Collapse
Affiliation(s)
- Ricardo Cerri
- Department of Computer Science, UFSCar Federal University of São Carlos, Rodovia Washington Luís, Km 235, São Carlos, 13565-905 SP Brazil
| | - Rodrigo C. Barros
- Faculdade de Informática, Pontifícia Universidade Católica do Rio Grande do Sul, Av. Ipiranga, 6681, Porto Alegre, 90619-900 RS Brazil
| | - André C. P. L. F. de Carvalho
- Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Campus de São Carlos 135, São Carlos, 13566-590 SP Brazil
| | - Yaochu Jin
- Department of Computer Science, University of Surrey, GU2 7XH Guildford, Surrey, United Kingdom
| |
Collapse
|
38
|
Spetale FE, Tapia E, Krsticevic F, Roda F, Bulacio P. A Factor Graph Approach to Automated GO Annotation. PLoS One 2016; 11:e0146986. [PMID: 26771463 PMCID: PMC4714749 DOI: 10.1371/journal.pone.0146986] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 12/23/2015] [Indexed: 12/19/2022] Open
Abstract
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.
Collapse
Affiliation(s)
- Flavio E. Spetale
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
| | - Elizabeth Tapia
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
| | - Flavia Krsticevic
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad Regional San Nicolás, National Technological University, San Nicolás, Argentina
| | | | - Pilar Bulacio
- CIFASIS-Conicet Institute, Rosario, Argentina
- Facultad de Cs. Exactas, Ingeniería y Agrimensura, National University of Rosario, Rosario, Argentina
- Facultad Regional San Nicolás, National Technological University, San Nicolás, Argentina
| |
Collapse
|
39
|
Pazos Obregón F, Papalardo C, Castro S, Guerberoff G, Cantera R. Putative synaptic genes defined from a Drosophila whole body developmental transcriptome by a machine learning approach. BMC Genomics 2015; 16:694. [PMID: 26370122 PMCID: PMC4570697 DOI: 10.1186/s12864-015-1888-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 09/01/2015] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Assembly and function of neuronal synapses require the coordinated expression of a yet undetermined set of genes. Although roughly a thousand genes are expected to be important for this function in Drosophila melanogaster, just a few hundreds of them are known so far. RESULTS In this work we trained three learning algorithms to predict a "synaptic function" for genes of Drosophila using data from a whole-body developmental transcriptome published by others. Using statistical and biological criteria to analyze and combine the predictions, we obtained a gene catalogue that is highly enriched in genes of relevance for Drosophila synapse assembly and function but still not recognized as such. CONCLUSIONS The utility of our approach is that it reduces the number of genes to be tested through hypothesis-driven experimentation.
Collapse
Affiliation(s)
- Flavio Pazos Obregón
- Departamento de Biología del Neurodesarrollo, Instituto de Investigaciones Biológicas Clemente Estable, Avenida Italia 3318, PC 11600, Montevideo, Uruguay.
| | - Cecilia Papalardo
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Sebastián Castro
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Gustavo Guerberoff
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Rafael Cantera
- Departamento de Biología del Neurodesarrollo, Instituto de Investigaciones Biológicas Clemente Estable, Avenida Italia 3318, PC 11600, Montevideo, Uruguay.
- Zoology Department, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
40
|
Golzari F, Jalili S. VR-BFDT: A variance reduction based binary fuzzy decision tree induction method for protein function prediction. J Theor Biol 2015; 377:10-24. [PMID: 25865524 DOI: 10.1016/j.jtbi.2015.03.023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2014] [Revised: 03/11/2015] [Accepted: 03/20/2015] [Indexed: 11/20/2022]
Abstract
In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still known precisely. PFP is one of the special and complex problems in machine learning domain in which a protein (regarded as instance) may have more than one function simultaneously. Furthermore, the functions (regarded as classes) are dependent and also are organized in a hierarchical structure in the form of a tree or directed acyclic graph. One of the common learning methods proposed for solving this problem is decision trees in which, by partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may cause incorrect change in predicted label of the instance and finally misclassification. In this paper, a Variance Reduction based Binary Fuzzy Decision Tree (VR-BFDT) algorithm is proposed to predict functions of the proteins. This algorithm just fuzzifies the decision boundaries instead of converting the numeric attributes into fuzzy linguistic terms. It has the ability of assigning multiple functions to each protein simultaneously and preserves the hierarchy consistency between functional classes. It uses the label variance reduction as splitting criterion to select the best "attribute-value" at each node of the decision tree. The experimental results show that the overall performance of the proposed algorithm is promising.
Collapse
Affiliation(s)
- Fahimeh Golzari
- SCS Lab, Computer Engineering Department, Tarbiat Modares University, Tehran, Iran.
| | - Saeed Jalili
- SCS Lab, Computer Engineering Department, Tarbiat Modares University, Tehran, Iran.
| |
Collapse
|
41
|
Kahanda I, Funk C, Verspoor K, Ben-Hur A. PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources. F1000Res 2015; 4:259. [PMID: 26834980 PMCID: PMC4722686 DOI: 10.12688/f1000research.6670.1] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/06/2015] [Indexed: 01/21/2023] Open
Abstract
The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.
Collapse
Affiliation(s)
- Indika Kahanda
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| | - Christopher Funk
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Karin Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Victoria, 3010, Australia; Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Victoria, 3010, Australia
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, 80523, USA
| |
Collapse
|
42
|
Brbić M, Warnecke T, Kriško A, Supek F. Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled. Genome Biol Evol 2015; 7:1519-32. [PMID: 25971281 PMCID: PMC4494046 DOI: 10.1093/gbe/evv088] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/09/2015] [Indexed: 02/05/2023] Open
Abstract
The amino acid composition (AAC) of proteomes differs greatly between microorganisms and is associated with the environmental niche they inhabit, suggesting that these changes may be adaptive. Similarly, the oligonucleotide composition of genomes varies and may confer advantages at the DNA/RNA level. These influences overlap in protein-coding sequences, making it difficult to gauge their relative contributions. We disentangle these effects by systematically evaluating the correspondence between intergenic nucleotide composition, where protein-level selection is absent, the AAC, and ecological parameters of 909 prokaryotes. We find that G + C content, the most frequently used measure of genomic composition, cannot capture diversity in AAC and across ecological contexts. However, di-/trinucleotide composition in intergenic DNA predicts amino acid frequencies of proteomes to the point where very little cross-species variability remains unexplained (91% of variance accounted for). Qualitatively similar results were obtained for 49 fungal genomes, where 80% of the variability in AAC could be explained by the composition of introns and intergenic regions. Upon factoring out oligonucleotide composition and phylogenetic inertia, the residual AAC is poorly predictive of the microbes' ecological preferences, in stark contrast with the original AAC. Moreover, highly expressed genes do not exhibit more prominent environment-related AAC signatures than lowly expressed genes, despite contributing more to the effective proteome. Thus, evolutionary shifts in overall AAC appear to occur almost exclusively through factors shaping the global oligonucleotide content of the genome. We discuss these results in light of contravening evidence from biophysical data and further reading frame-specific analyses that suggest that adaptation takes place at the protein level.
Collapse
Affiliation(s)
- Maria Brbić
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia
| | - Tobias Warnecke
- MRC Clinical Sciences Centre, Imperial College, Hammersmith Campus, London, United Kingdom
| | - Anita Kriško
- Molecular Basis of Ageing, Mediterranean Institute for Life Sciences (MedILS), Split, Croatia
| | - Fran Supek
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia EMBL/CRG Systems Biology Unit, Centre for Genomic Regulation, Barcelona, Spain
| |
Collapse
|
43
|
Mei K, Peng J, Gao L, Zheng NN, Fan J. Hierarchical classification of large-scale patient records for automatic treatment stratification. IEEE J Biomed Health Inform 2015; 19:1234-45. [PMID: 25807574 DOI: 10.1109/jbhi.2015.2414876] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In this paper, a hierarchical learning algorithm is developed for classifying large-scale patient records, e.g., categorizing large-scale patient records into large numbers of known patient categories (i.e., thousands of known patient categories) for automatic treatment stratification. Our hierarchical learning algorithm can leverage tree structure to train more discriminative max-margin classifiers for high-level nodes and control interlevel error propagation effectively. By ruling out unlikely groups of patient categories (i.e., irrelevant high-level nodes) at an early stage, our hierarchical approach can achieve log-linear computational complexity, which is very attractive for big data applications. Our experiments on one specific medical domain have demonstrated that our hierarchical approach can achieve very competitive results on both classification accuracy and computational efficiency as compared with other state-of-the-art techniques.
Collapse
|
44
|
Škunca N, Dessimoz C. Phylogenetic profiling: how much input data is enough? PLoS One 2015; 10:e0114701. [PMID: 25679783 PMCID: PMC4332489 DOI: 10.1371/journal.pone.0114701] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 11/10/2014] [Indexed: 12/04/2022] Open
Abstract
Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the quality of predictions. In this work, we ask: how many genomes and functional annotations need to be considered for phylogenetic profiling to be effective? Phylogenetic profiling generally benefits from an increased amount of input data. However, by decomposing this improvement in predictive accuracy in terms of the contribution of additional genomes and of additional annotations, we observed diminishing returns in adding more than ∼100 genomes, whereas increasing the number of annotations remained strongly beneficial throughout. We also observed that maximising phylogenetic diversity within a clade of interest improves predictive accuracy, but the effect is small compared to changes in the number of genomes under comparison. Finally, we show that these findings are supported in light of the Open World Assumption, which posits that functional annotation databases are inherently incomplete. All the tools and data used in this work are available for reuse from http://lab.dessimoz.org/14_phylprof. Scripts used to analyse the data are available on request from the authors.
Collapse
Affiliation(s)
- Nives Škunca
- ETH Zürich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland
- Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland
- University College London, Gower St, London WC1E 6BT, UK
- * E-mail: (NS), (CD)
| | - Christophe Dessimoz
- Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland
- University College London, Gower St, London WC1E 6BT, UK
- * E-mail: (NS), (CD)
| |
Collapse
|
45
|
Yu G, Zhu H, Domeniconi C. Predicting protein functions using incomplete hierarchical labels. BMC Bioinformatics 2015; 16:1. [PMID: 25591917 PMCID: PMC4384381 DOI: 10.1186/s12859-014-0430-y] [Citation(s) in RCA: 83] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Accepted: 12/11/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction. RESULTS In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels. CONCLUSION The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.
Collapse
Affiliation(s)
- Guoxian Yu
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China.
- College of Computer and Information Sciences, Southwest University, Chongqing, China.
| | - Hailong Zhu
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China.
| | | |
Collapse
|
46
|
Tanaka EA, Nozawa SR, Macedo AA, Baranauskas JA. A multi-label approach using binary relevance and decision trees applied to functional genomics. J Biomed Inform 2014; 54:85-95. [PMID: 25549937 DOI: 10.1016/j.jbi.2014.12.011] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2014] [Revised: 11/18/2014] [Accepted: 12/18/2014] [Indexed: 11/28/2022]
Abstract
Many classification problems, especially in the field of bioinformatics, are associated with more than one class, known as multi-label classification problems. In this study, we propose a new adaptation for the Binary Relevance algorithm taking into account possible relations among labels, focusing on the interpretability of the model, not only on its performance. Experiments were conducted to compare the performance of our approach against others commonly found in the literature and applied to functional genomic datasets. The experimental results show that our proposal has a performance comparable to that of other methods and that, at the same time, it provides an interpretable model from the multi-label problem.
Collapse
Affiliation(s)
- Erica Akemi Tanaka
- Department of Computer Science and Mathematics, University of Sao Paulo (USP), Av. Bandeirantes, 3900, Ribeirão Preto, SP 14040-901, Brazil.
| | - Sérgio Ricardo Nozawa
- Dow AgroSciences (Seeds, Traits & Oils), Av. Antonio Diederichsen, 400, Ribeirão Preto, SP 14020-250, Brazil.
| | - Alessandra Alaniz Macedo
- Department of Computer Science and Mathematics, University of Sao Paulo (USP), Av. Bandeirantes, 3900, Ribeirão Preto, SP 14040-901, Brazil.
| | - José Augusto Baranauskas
- Department of Computer Science and Mathematics, University of Sao Paulo (USP), Av. Bandeirantes, 3900, Ribeirão Preto, SP 14040-901, Brazil.
| |
Collapse
|
47
|
Wu Q, Ye Y, Ho SS, Zhou S. Semi-supervised multi-label collective classification ensemble for functional genomics. BMC Genomics 2014; 15 Suppl 9:S17. [PMID: 25521242 PMCID: PMC4290603 DOI: 10.1186/1471-2164-15-s9-s17] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed. Most of the approaches in predicting functional properties of proteins require to either identify a reliable set of labeled proteins with similar attribute features to unannotated proteins, or to learn from a fully-labeled protein interaction network with a large amount of labeled data. However, acquiring such labels can be very difficult in practice, especially for multi-label protein function prediction problems. Learning with only a few labeled data can lead to poor performance as limited supervision knowledge can be obtained from similar proteins or from connections between them. To effectively annotate proteins even in the paucity of labeled data, it is important to take advantage of all data sources that are available in this problem setting, including interaction networks, attribute feature information, correlations of functional labels, and unlabeled data. RESULTS In this paper, we show that the underlying nature of predicting functional properties of proteins using various data sources of relational data is a typical collective classification (CC) problem in machine learning. The protein functional prediction task with limited annotation is then cast into a semi-supervised multi-label collective classification (SMCC) framework. As such, we propose a novel generative model based SMCC algorithm, called GM-SMCC, to effectively compute the label probability distributions of unannotated protein instances and predict their functional properties. To further boost the predicting performance, we extend the method in an ensemble manner, called EGM-SMCC, by utilizing multiple heterogeneous networks with various latent linkages constructed to explicitly model the relationships among the nodes for effectively propagate the supervision knowledge from labeled to unlabeled nodes. CONCLUSION Experimental results on a yeast gene dataset predicting the functions and localization of proteins demonstrate the effectiveness of the proposed method. In the comparison, we find that the performances of the proposed algorithms are better than the other compared algorithms.
Collapse
|
48
|
Newby D, Freitas AA, Ghafourian T. Comparing multilabel classification methods for provisional biopharmaceutics class prediction. Mol Pharm 2014; 12:87-102. [PMID: 25397721 DOI: 10.1021/mp500457t] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The biopharmaceutical classification system (BCS) is now well established and utilized for the development and biowaivers of immediate oral dosage forms. The prediction of BCS class can be carried out using multilabel classification. Unlike single label classification, multilabel classification methods predict more than one class label at the same time. This paper compares two multilabel methods, binary relevance and classifier chain, for provisional BCS class prediction. Large data sets of permeability and solubility of drug and drug-like compounds were obtained from the literature and were used to build models using decision trees. The separate permeability and solubility models were validated, and a BCS validation set of 127 compounds where both permeability and solubility were known was used to compare the two aforementioned multilabel classification methods for provisional BCS class prediction. Overall, the results indicate that the classifier chain method, which takes into account label interactions, performed better compared to the binary relevance method. This work offers a comparison of multilabel methods and shows the potential of the classifier chain multilabel method for improved biological property predictions for use in drug discovery and development.
Collapse
Affiliation(s)
- Danielle Newby
- Medway School of Pharmacy, Universities of Kent and Greenwich , Chatham, Kent, ME4 4TB, U.K
| | | | | |
Collapse
|
49
|
Li HD, Menon R, Omenn GS, Guan Y. The emerging era of genomic data integration for analyzing splice isoform function. Trends Genet 2014; 30:340-7. [PMID: 24951248 DOI: 10.1016/j.tig.2014.05.005] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Revised: 05/21/2014] [Accepted: 05/23/2014] [Indexed: 01/17/2023]
Abstract
The vast majority of multi-exon genes in humans undergo alternative splicing, which greatly increases the functional diversity of protein species. Predicting functions at the isoform level is essential to further our understanding of developmental abnormalities and cancers, which frequently exhibit aberrant splicing and dysregulation of isoform expression. However, determination of isoform function is very difficult, and efforts to predict isoform function have been limited in the functional genomics field. Deep sequencing of RNA now provides an unprecedented amount of expression data at the transcript level. We describe here emerging computational approaches that integrate such large-scale whole-transcriptome sequencing (RNA-seq) data for predicting the functions of alternatively spliced isoforms, and we discuss their applications in developmental and cancer biology. We outline future directions for isoform function prediction, emphasizing the need for heterogeneous genomic data integration and tissue-specific, dynamic isoform-level network modeling, which will allow the field to realize its full potential.
Collapse
Affiliation(s)
- Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, MI, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA; Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, MI, USA; Department of Electrical Engineering and Computer Science, Ann Arbor, MI, USA.
| |
Collapse
|
50
|
Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN BIOINFORMATICS 2014; 2014:901419. [PMID: 25937954 PMCID: PMC4393075 DOI: 10.1155/2014/901419] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 02/25/2014] [Indexed: 12/11/2022]
Abstract
Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware "flat" prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a "consensus" ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
| |
Collapse
|