1
|
Zhao R, Xie Z, Zhuang Y, L H Yu P. Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks. Int J Neural Syst 2024; 34:2450009. [PMID: 38318751 DOI: 10.1142/s0129065724500096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
Large-scale benchmark datasets are crucial in advancing research within the computer science communities. They enable the development of more sophisticated AI models and serve as "golden" benchmarks for evaluating their performance. Thus, ensuring the quality of these datasets is of utmost importance for academic research and the progress of AI systems. For the emerging vision-language tasks, some datasets have been created and frequently used, such as Flickr30k, COCO, and NoCaps, which typically contain a large number of images paired with their ground-truth textual descriptions. In this paper, an automatic method is proposed to assess the quality of large-scale benchmark datasets designed for vision-language tasks. In particular, a new cross-modal matching model is developed, which is capable of automatically scoring the textual descriptions of visual images. Subsequently, this model is employed to evaluate the quality of vision-language datasets by automatically assigning a score to each 'ground-truth' description for every image picture. With a good agreement between manual and automated scoring results on the datasets, our findings reveal significant disparities in the quality of the ground-truth descriptions included in the benchmark datasets. Even more surprising, it is evident that a small portion of the descriptions are unsuitable for serving as reliable ground-truth references. These discoveries emphasize the need for careful utilization of these publicly accessible benchmark databases.
Collapse
Affiliation(s)
- Ruibin Zhao
- Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR, P. R. China
- School of Computer Science and Information Engineering, Chuzhou University, Chuzhou, P. R. China
| | - Zhiwei Xie
- Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR, P. R. China
| | - Yipeng Zhuang
- Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR, P. R. China
| | - Philip L H Yu
- Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR, P. R. China
| |
Collapse
|
2
|
Majidian S, Agustinho DP, Chin CS, Sedlazeck FJ, Mahmoud M. Genomic variant benchmark: if you cannot measure it, you cannot improve it. Genome Biol 2023; 24:221. [PMID: 37798733 PMCID: PMC10552390 DOI: 10.1186/s13059-023-03061-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 09/18/2023] [Indexed: 10/07/2023] Open
Abstract
Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computational Biology, University of Lausanne, 1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | | | | | - Fritz J Sedlazeck
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Computer Science, Rice University, 6100 Main Street, Houston, TX, 77005, USA.
| | - Medhat Mahmoud
- Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX, 77030, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
3
|
Tian S, Zhan D, Yu Y, Wang Y, Liu M, Tan S, Li Y, Song L, Qin Z, Li X, Liu Y, Li Y, Ji S, Wang S, Zheng Y, He F, Qin J, Ding C. Quartet protein reference materials and datasets for multi-platform assessment of label-free proteomics. Genome Biol 2023; 24:202. [PMID: 37674236 PMCID: PMC10483797 DOI: 10.1186/s13059-023-03048-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Accepted: 08/23/2023] [Indexed: 09/08/2023] Open
Abstract
BACKGROUND Quantitative proteomics is an indispensable tool in life science research. However, there is a lack of reference materials for evaluating the reproducibility of label-free liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based measurements among different instruments and laboratories. RESULTS Here, we develop the Quartet standard as a proteome reference material with built-in truths, and distribute the same aliquots to 15 laboratories with nine conventional LC-MS/MS platforms across six cities in China. Relative abundance of over 12,000 proteins on 816 mass spectrometry files are obtained and compared for reproducibility among the instruments and laboratories to ultimately generate proteomics benchmark datasets. There is a wide dynamic range of proteomes spanning about 7 orders of magnitude, and the injection order has marked effects on quantitative instead of qualitative characteristics. CONCLUSION Overall, the Quartet offers valuable standard materials and data resources for improving the quality control of proteomic analyses as well as the reproducibility and reliability of research findings.
Collapse
Affiliation(s)
- Sha Tian
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Dongdong Zhan
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Ying Yu
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Yunzhi Wang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Mingwei Liu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Subei Tan
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Yan Li
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Lei Song
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Zhaoyu Qin
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Xianju Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Yang Liu
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Yao Li
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China
| | - Shuhui Ji
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Shanshan Wang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China
| | - Yuanting Zheng
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China.
| | - Fuchu He
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China.
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China.
| | - Jun Qin
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, 102206, China.
| | - Chen Ding
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Institutes of Biomedical Sciences, Human Phenome Institute, Zhongshan Hospital, Fudan University, Shanghai, 200433, China.
| |
Collapse
|
4
|
Nader N, El-Gamal FEZ, El-Sappagh S, Kwak KS, Elmogy M. Kinship verification and recognition based on handcrafted and deep learning feature-based techniques. PeerJ Comput Sci 2021; 7:e735. [PMID: 34977344 PMCID: PMC8670373 DOI: 10.7717/peerj-cs.735] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 09/13/2021] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVES Kinship verification and recognition (KVR) is the machine's ability to identify the genetic and blood relationship and its degree between humans' facial images. The face is used because it is one of the most significant ways to recognize each other. Automatic KVR is an interesting area for investigation. It greatly affects real-world applications, such as searching for lost family members, forensics, and historical and genealogical studies. This paper presents a comprehensive survey that describes KVR applications and kinship types. It presents a literature review of current studies starting from handcrafted passing through shallow metric learning and ending with deep learning feature-based techniques. Furthermore, kinship mostly used datasets are discussed that in turn open the way for future directions for the research in this field. Also, the KVR limitations are discussed, such as insufficient illumination, noise, occlusion, and age variations problems. Finally, future research directions are presented, such as age and gender variation problems. METHODS We applied a literature survey methodology to retrieve data from academic databases. An inclusion and exclusion criteria were set. Three stages were followed to select articles. Finally, the main KVR stages, along with the main methods in each stage, were presented. We believe that surveys can help researchers easily to detect areas that require more development and investigation. RESULTS It was found that handcrafted, metric learning, and deep learning were widely utilized in kinship verification and recognition problem using facial images. CONCLUSIONS Despite the scientific efforts that aim to address this hot research topic, many future research areas require investigation, such as age and gender variation. In the end, the presented survey makes it easier for researchers to identify the new areas that require more investigation and research.
Collapse
Affiliation(s)
- Nermeen Nader
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Fatma El-Zahraa El-Gamal
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Shaker El-Sappagh
- Information Systems Department, Faculty of Computers and Artificial Intelligence, Benha University, Banha, Egypt
- Faculty of Computer Science and Engineering, Galala University, Suez, Egypt
| | - Kyung Sup Kwak
- Department of Information and Communication Engineering, Inha University, Incheon, South Korea
| | - Mohammed Elmogy
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| |
Collapse
|
5
|
Hong D, Hu J, Yao J, Chanussot J, Zhu XX. Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model. ISPRS J Photogramm Remote Sens 2021; 178:68-80. [PMID: 34433999 PMCID: PMC8336649 DOI: 10.1016/j.isprsjprs.2021.05.011] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 05/13/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]
Abstract
As remote sensing (RS) data obtained from different sensors become available largely and openly, multimodal data processing and analysis techniques have been garnering increasing interest in the RS and geoscience community. However, due to the gap between different modalities in terms of imaging sensors, resolutions, and contents, embedding their complementary information into a consistent, compact, accurate, and discriminative representation, to a great extent, remains challenging. To this end, we propose a shared and specific feature learning (S2FL) model. S2FL is capable of decomposing multimodal RS data into modality-shared and modality-specific components, enabling the information blending of multi-modalities more effectively, particularly for heterogeneous data sources. Moreover, to better assess multimodal baselines and the newly-proposed S2FL model, three multimodal RS benchmark datasets, i.e., Houston2013 - hyperspectral and multispectral data, Berlin - hyperspectral and synthetic aperture radar (SAR) data, Augsburg - hyperspectral, SAR, and digital surface model (DSM) data, are released and used for land cover classification. Extensive experiments conducted on the three datasets demonstrate the superiority and advancement of our S2FL model in the task of land cover classification in comparison with previously-proposed state-of-the-art baselines. Furthermore, the baseline codes and datasets used in this paper will be made available freely at https://github.com/danfenghong/ISPRS_S2FL.
Collapse
Affiliation(s)
- Danfeng Hong
- Remote Sensing Technology Institute, German Aerospace Center, 82234 Wessling, Germany
| | - Jingliang Hu
- Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
| | - Jing Yao
- Aerospace Information Research Institute, Chinese Academy of Sciences, 100094 Beijing, China
| | - Jocelyn Chanussot
- Aerospace Information Research Institute, Chinese Academy of Sciences, 100094 Beijing, China
- Univ. Grenoble Alpes, INRIA, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
| | - Xiao Xiang Zhu
- Remote Sensing Technology Institute, German Aerospace Center, 82234 Wessling, Germany
- Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany
| |
Collapse
|
6
|
Mittal H, Pandey AC, Saraswat M, Kumar S, Pal R, Modwel G. A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets. Multimed Tools Appl 2021; 81:35001-35026. [PMID: 33584121 PMCID: PMC7870780 DOI: 10.1007/s11042-021-10594-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Revised: 01/07/2021] [Accepted: 01/21/2021] [Indexed: 06/12/2023]
Abstract
Image segmentation is an essential phase of computer vision in which useful information is extracted from an image that can range from finding objects while moving across a room to detect abnormalities in a medical image. As image pixels are generally unlabelled, the commonly used approach for the same is clustering. This paper reviews various existing clustering based image segmentation methods. Two main clustering methods have been surveyed, namely hierarchical and partitional based clustering methods. As partitional clustering is computationally better, further study is done in the perspective of methods belonging to this class. Further, literature bifurcates the partitional based clustering methods into three categories, namely K-means based methods, histogram-based methods, and meta-heuristic based methods. The survey of various performance parameters for the quantitative evaluation of segmentation results is also included. Further, the publicly available benchmark datasets for image-segmentation are briefed.
Collapse
Affiliation(s)
- Himanshu Mittal
- Jaypee Institute of Information Technology, Noida, Uttar Pradesh India
| | | | - Mukesh Saraswat
- Jaypee Institute of Information Technology, Noida, Uttar Pradesh India
| | - Sumit Kumar
- Amity University, Noida, Uttar Pradesh India
| | - Raju Pal
- Jaypee Institute of Information Technology, Noida, Uttar Pradesh India
| | - Garv Modwel
- Valeo India Private Limited, Chennai, Tamil Nadu India
| |
Collapse
|
7
|
Nakano FK, Lietaert M, Vens C. Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets. BMC Bioinformatics 2019; 20:485. [PMID: 31547800 PMCID: PMC6755698 DOI: 10.1186/s12859-019-3060-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 08/27/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.
Collapse
Affiliation(s)
- Felipe Kenji Nakano
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| | - Mathias Lietaert
- Howest University of Applied Sciences, Campus Brugge Station, Rijselstraat 5, Brugge, 8200 Belgium
| | - Celine Vens
- KU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, Kortrijk, 8500 Belgium
- ITEC - imec, Etienne Sabbelaan 51, Kortrijk, 8500 Belgium
| |
Collapse
|
8
|
Abstract
Background Benchmark datasets are essential for both method development and performance assessment. These datasets have numerous requirements, representativeness being one. In the case of variant tolerance/pathogenicity prediction, representativeness means that the dataset covers the space of variations and their effects. Results We performed the first analysis of the representativeness of variation benchmark datasets. We used statistical approaches to investigate how proteins in the benchmark datasets were representative for the entire human protein universe. We investigated the distributions of variants in chromosomes, protein structures, CATH domains and classes, Pfam protein families, Enzyme Commission (EC) classifications and Gene Ontology annotations in 24 datasets that have been used for training and testing variant tolerance prediction methods. All the datasets were available in VariBench or VariSNP databases. We tested also whether the pathogenic variant datasets contained neutral variants defined as those that have high minor allele frequency in the ExAC database. The distributions of variants over the chromosomes and proteins varied greatly between the datasets. Conclusions None of the datasets was found to be well representative. Many of the tested datasets had quite good coverage of the different protein characteristics. Dataset size correlates to representativeness but only weakly to the performance of methods trained on them. The results imply that dataset representativeness is an important factor and should be taken into account in predictor development and testing. Electronic supplementary material The online version of this article (10.1186/s12859-018-2478-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Gerard C P Schaafsma
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-221 84, Lund, Sweden
| | - Mauno Vihinen
- Protein Structure and Bioinformatics, Department of Experimental Medical Science, Lund University, BMC B13, SE-221 84, Lund, Sweden.
| |
Collapse
|
9
|
Dalkiran A, Rifaioglu AS, Martin MJ, Cetin-Atalay R, Atalay V, Doğan T. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics 2018; 19:334. [PMID: 30241466 PMCID: PMC6150975 DOI: 10.1186/s12859-018-2368-y] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2018] [Accepted: 09/10/2018] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND The automated prediction of the enzymatic functions of uncharacterized proteins is a crucial topic in bioinformatics. Although several methods and tools have been proposed to classify enzymes, most of these studies are limited to specific functional classes and levels of the Enzyme Commission (EC) number hierarchy. Besides, most of the previous methods incorporated only a single input feature type, which limits the applicability to the wide functional space. Here, we proposed a novel enzymatic function prediction tool, ECPred, based on ensemble of machine learning classifiers. RESULTS In ECPred, each EC number constituted an individual class and therefore, had an independent learning model. Enzyme vs. non-enzyme classification is incorporated into ECPred along with a hierarchical prediction approach exploiting the tree structure of the EC nomenclature. ECPred provides predictions for 858 EC numbers in total including 6 main classes, 55 subclass classes, 163 sub-subclass classes and 634 substrate classes. The proposed method is tested and compared with the state-of-the-art enzyme function prediction tools by using independent temporal hold-out and no-Pfam datasets constructed during this study. CONCLUSIONS ECPred is presented both as a stand-alone and a web based tool to provide probabilistic enzymatic function predictions (at all five levels of EC) for uncharacterized protein sequences. Also, the datasets of this study will be a valuable resource for future benchmarking studies. ECPred is available for download, together with all of the datasets used in this study, at: https://github.com/cansyl/ECPred . ECPred webserver can be accessed through http://cansyl.metu.edu.tr/ECPred.html .
Collapse
Affiliation(s)
- Alperen Dalkiran
- Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey
- Department of Computer Engineering, Adana Science and Technology University, 01250 Adana, Turkey
| | - Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey
- Department of Computer Engineering, Iskenderun Technical University, Hatay, 31200 İskenderun, Turkey
| | - Maria Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD UK
| | - Rengul Cetin-Atalay
- KanSiL, Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Turkey
- Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Turkey
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, 06800 Ankara, Turkey
- KanSiL, Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Turkey
| | - Tunca Doğan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD UK
- KanSiL, Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Turkey
- Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Turkey
| |
Collapse
|
10
|
Timme RE, Rand H, Shumway M, Trees EK, Simmons M, Agarwala R, Davis S, Tillman GE, Defibaugh-Chavez S, Carleton HA, Klimke WA, Katz LS. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ 2017; 5:e3893. [PMID: 29372115 PMCID: PMC5782805 DOI: 10.7717/peerj.3893] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Accepted: 09/15/2017] [Indexed: 11/20/2022] Open
Abstract
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
Collapse
Affiliation(s)
- Ruth E Timme
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America
| | - Hugh Rand
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America
| | - Martin Shumway
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of America
| | - Eija K Trees
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
| | - Mustafa Simmons
- Food Safety and Inspection Service, US Department of Agriculture, Athens, GA, United States of America
| | - Richa Agarwala
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of America
| | - Steven Davis
- Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America
| | - Glenn E Tillman
- Food Safety and Inspection Service, US Department of Agriculture, Athens, GA, United States of America
| | - Stephanie Defibaugh-Chavez
- Food Safety and Inspection Service, US Department of Agriculture, Wahington, D.C., United States of America
| | - Heather A Carleton
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America
| | - William A Klimke
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of America
| | - Lee S Katz
- Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of America.,Center for Food Safety, College of Agricultural and Environmental Sciences, University of Georgia, Griffin, GA, United States of America
| |
Collapse
|
11
|
Bylinskii Z, DeGennaro EM, Rajalingham R, Ruda H, Zhang J, Tsotsos JK. Towards the quantitative evaluation of visual attention models. Vision Res 2015; 116:258-68. [PMID: 25951756 DOI: 10.1016/j.visres.2015.04.007] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 03/15/2015] [Accepted: 04/02/2015] [Indexed: 11/17/2022]
Abstract
Scores of visual attention models have been developed over the past several decades of research. Differences in implementation, assumptions, and evaluations have made comparison of these models very difficult. Taxonomies have been constructed in an attempt at the organization and classification of models, but are not sufficient at quantifying which classes of models are most capable of explaining available data. At the same time, a multitude of physiological and behavioral findings have been published, measuring various aspects of human and non-human primate visual attention. All of these elements highlight the need to integrate the computational models with the data by (1) operationalizing the definitions of visual attention tasks and (2) designing benchmark datasets to measure success on specific tasks, under these definitions. In this paper, we provide some examples of operationalizing and benchmarking different visual attention tasks, along with the relevant design considerations.
Collapse
Affiliation(s)
- Z Bylinskii
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge 02141, USA; Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge 02141, USA.
| | - E M DeGennaro
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge 02141, USA
| | - R Rajalingham
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge 02141, USA
| | - H Ruda
- Computational Vision Laboratory, Department of Communication Sciences and Disorders, Northeastern University, Boston 02115, USA
| | - J Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Visual Attention Lab, Brigham and Women's Hospital, Cambridge, MA 02139, USA
| | - J K Tsotsos
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge 02141, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge 02141, USA; Electrical Engineering and Computer Science, Centre for Vision Research, York University, Toronto M3J 1P3, Canada
| |
Collapse
|
12
|
Oetjen J, Veselkov K, Watrous J, McKenzie JS, Becker M, Hauberg-Lotte L, Kobarg JH, Strittmatter N, Mróz AK, Hoffmann F, Trede D, Palmer A, Schiffler S, Steinhorst K, Aichler M, Goldin R, Guntinas-Lichius O, von Eggeling F, Thiele H, Maedler K, Walch A, Maass P, Dorrestein PC, Takats Z, Alexandrov T. Benchmark datasets for 3D MALDI- and DESI-imaging mass spectrometry. Gigascience 2015; 4:20. [PMID: 25941567 PMCID: PMC4418095 DOI: 10.1186/s13742-015-0059-4] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2014] [Accepted: 04/09/2015] [Indexed: 01/16/2023] Open
Abstract
Background Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms. Findings High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data. 3D imaging MS data were collected from five samples using two types of 3D imaging MS. 3D matrix-assisted laser desorption/ionization imaging (MALDI) MS data were collected from murine pancreas, murine kidney, human oral squamous cell carcinoma, and interacting microbial colonies cultured in Petri dishes. 3D desorption electrospray ionization (DESI) imaging MS data were collected from a human colorectal adenocarcinoma. Conclusions With the aim to stimulate computational research in the field of computational 3D imaging MS, selected high-quality 3D imaging MS datasets are provided that could be used by algorithm developers as benchmark datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0059-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Janina Oetjen
- MALDI Imaging Lab, University of Bremen, Bremen, Germany
| | - Kirill Veselkov
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Jeramie Watrous
- Department of Medicine, Biomedical Research Facility II, University of California, San Diego, USA
| | - James S McKenzie
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | | | | | | | - Nicole Strittmatter
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Anna K Mróz
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Franziska Hoffmann
- Institute of Physical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany ; Department of Otorhinolaryngology, Jena University Hospital, Jena, Germany
| | - Dennis Trede
- Steinbeis Center SCiLS Research, Bremen, Germany ; SCiLS GmbH, Bremen, Germany
| | - Andrew Palmer
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | | | | - Michaela Aichler
- Research Unit Analytical Pathology, Institute of Pathology, Helmholtz Center Munich, Munich, Germany
| | - Robert Goldin
- Department of Medicine, Faculty of Medicine, Imperial College London, London, UK
| | | | - Ferdinand von Eggeling
- Institute of Physical Chemistry, Friedrich-Schiller-University Jena, Jena, Germany ; Department of Otorhinolaryngology, Jena University Hospital, Jena, Germany ; Leibnitz Institute of Photonic Technology (IPHT), Jena, Germany ; Jena Center for Soft Matter (JCSM), Friedrich-Schiller-University Jena, Jena, Germany
| | | | - Kathrin Maedler
- MALDI Imaging Lab, University of Bremen, Bremen, Germany ; Islet Research Lab, Center for Biomolecular Interactions, University of Bremen, Bremen, Germany
| | - Axel Walch
- Research Unit Analytical Pathology, Institute of Pathology, Helmholtz Center Munich, Munich, Germany
| | - Peter Maass
- Center for Industrial Mathematics, University of Bremen, Bremen, Germany
| | - Pieter C Dorrestein
- Skaggs School of Pharmacy & Pharmaceutical Sciences, University of California, San Diego, USA
| | - Zoltan Takats
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, UK
| | - Theodore Alexandrov
- Steinbeis Center SCiLS Research, Bremen, Germany ; SCiLS GmbH, Bremen, Germany ; European Molecular Biology Laboratory, Heidelberg, Germany ; Skaggs School of Pharmacy & Pharmaceutical Sciences, University of California, San Diego, USA
| |
Collapse
|