1
|
Mduma N, Leo J. Dataset of banana leaves and stem images for object detection, classification and segmentation: A case of Tanzania. Data Brief 2023; 49:109322. [PMID: 37441627 PMCID: PMC10333424 DOI: 10.1016/j.dib.2023.109322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 05/28/2023] [Accepted: 06/12/2023] [Indexed: 07/15/2023] Open
Abstract
Banana is among major crops cultivated by most smallholder farmers in Tanzania and other parts of Africa. This crop is very important in the household economy as well as food security since it serves as both food and cash crops. Despite these benefits, the majority of smallholder farmers are experiencing low yields which are attributed to diseases. The most problematic diseases are Black Sigatoka and Fusarium Wilt Race 1. Black Sigatoka is a disease that produces spots on the leaves of bananas and is caused by an air-borne fungus called Pseudocercospora fijiensis, formerly known as Mycosphaerella fijiensis. Fusarium Wilt Race 1 disease is one of the most destructive banana diseases that is caused by a soil-borne fungus called Fusarium oxysporum f.sp. Cubense (Foc). The dataset of curated banana crop image is presented in this article. Images of both healthy and diseased banana leaves and stems were taken in Tanzania and are included in the dataset. Smartphone cameras were used to take pictures of the banana leaves and stems. The dataset is the largest publicly accessible dataset for banana leaves and stems and includes 16,092 images. The dataset is significant and can be used to develop machine learning models for early detection of diseases affecting bananas. This dataset can be used for a number of computer vision applications, including object detection, classification, and image segmentation. The motivation for generating this dataset is to contribute to developing machine learning tools and spur innovations that will help to address the issue of crop diseases and help to eradicate the problem of food security in Africa.
Collapse
|
2
|
Preeti P, Nath SK, Arambam N, Sharma T, Choudhury PR, Choudhury A, Khanna V, Strych U, Hotez PJ, Bottazzi ME, Rawal K. Vaxi-DL: An Artificial Intelligence-Enabled Platform for Vaccine Development. Methods Mol Biol 2023; 2673:305-316. [PMID: 37258923 DOI: 10.1007/978-1-0716-3239-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Vaccine development is a complex and long process. It involves several steps, including computational studies, experimental analyses, animal model system studies, and clinical trials. This process can be accelerated by using in silico antigen screening to identify potential vaccine candidates. In this chapter, we describe a deep learning-based technique which utilizes 18 biological and 9154 physicochemical properties of proteins for finding potential vaccine candidates. Using this technique, a new web-based system, named Vaxi-DL, was developed which helped in finding new vaccine candidates from bacteria, protozoa, viruses, and fungi. Vaxi-DL is available at: https://vac.kamalrawal.in/vaxidl/ .
Collapse
Affiliation(s)
- P Preeti
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Swarsat Kaushik Nath
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Nevidita Arambam
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Trapti Sharma
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Priyanka Ray Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Alakto Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Vrinda Khanna
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Ulrich Strych
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
| | - Peter J Hotez
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Maria Elena Bottazzi
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Kamal Rawal
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India.
| |
Collapse
|
3
|
Rawal K, Sinha R, Nath SK, Preeti P, Kumari P, Gupta S, Sharma T, Strych U, Hotez P, Bottazzi ME. Vaxi-DL: A web-based deep learning server to identify potential vaccine candidates. Comput Biol Med 2022; 145:105401. [PMID: 35381451 DOI: 10.1016/j.compbiomed.2022.105401] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 03/10/2022] [Accepted: 03/10/2022] [Indexed: 11/19/2022]
Abstract
The development of a new vaccine is a challenging exercise involving several steps including computational studies, experimental work, and animal studies followed by clinical studies. To accelerate the process, in silico screening is frequently used for antigen identification. Here, we present Vaxi-DL, web-based deep learning (DL) software that evaluates the potential of protein sequences to serve as vaccine target antigens. Four different DL pathogen models were trained to predict target antigens in bacteria, protozoa, fungi, and viruses that cause infectious diseases in humans. Datasets containing antigenic and non-antigenic sequences were derived from known vaccine candidates and the Protegen database. Biological and physicochemical properties were computed for the datasets using publicly available bioinformatics tools. For each of the four pathogen models, the datasets were divided into training, validation, and testing subsets and then scaled and normalised. The models were constructed using Fully Connected Layers (FCLs), hyper-tuned, and trained using the training subset. Accuracy, sensitivity, specificity, precision, recall, and AUC (Area under the Curve) were used as metrics to assess the performance of these models. The models were benchmarked using independent datasets of known target antigens against other prediction tools such as VaxiJen and Vaxign-ML. We also tested Vaxi-DL on 219 known potential vaccine candidates (PVC) from 37 different pathogens. Our tool predicted 175 PVCs correctly out of 219 sequences. We also tested Vaxi-DL on different datasets obtained from multiple resources. Our tool has demonstrated an average sensitivity of 93% and will thus be a useful tool for prioritising PVCs for preclinical studies.
Collapse
Affiliation(s)
- Kamal Rawal
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | - Robin Sinha
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | | | - P Preeti
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | - Priya Kumari
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | - Srijanee Gupta
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | - Trapti Sharma
- Amity Institute of Biotechnology, Amity University, Uttar Pradesh, India.
| | - Ulrich Strych
- Texas Children's Center for Vaccine Development, Departments of Pediatrics and Molecular Virology and Microbiology, National School of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA.
| | - Peter Hotez
- Texas Children's Center for Vaccine Development, Departments of Pediatrics and Molecular Virology and Microbiology, National School of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA; Department of Biology, Baylor University, Waco, TX, USA.
| | - Maria Elena Bottazzi
- Texas Children's Center for Vaccine Development, Departments of Pediatrics and Molecular Virology and Microbiology, National School of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA; Department of Biology, Baylor University, Waco, TX, USA.
| |
Collapse
|
4
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
5
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
6
|
Chen Q, Zhang X, Wan Y, Zobel J, Verspoor K. Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions. J Comput Biol 2018; 26:605-617. [PMID: 30585742 DOI: 10.1089/cmb.2018.0198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Duplicate sequence records-that is, records having similar or identical sequences-are a challenge in search of biological sequence databases. They significantly increase database search time and can lead to uninformative search results containing similar sequences. Sequence clustering methods have been used to address this issue to group similar sequences into clusters. These clusters form a nonredundant database consisting of representatives (one record per cluster) and members (the remaining records in a cluster). In this approach, for nonredundant database search, users search against representatives first and optionally expand search results by exploring member records from matching clusters. Existing studies used Precision and Recall to assess the search effectiveness of nonredundant databases. However, the use of Precision and Recall does not model user behavior in practice and thus may not reflect practical search effectiveness. In this study, we first propose innovative evaluation metrics to measure search effectiveness. The findings are that (1) the Precision of expanded sets is consistently lower than that of representatives, with a decrease up to 7% at top ranks; and (2) Recall is uninformative because, for most queries, expanded sets return more records than does search of the original unclustered databases. Motivated by these findings, we propose a solution that returns a user-specified proportion of top similar records, modeled by a ranking function that aggregates sequence and annotation similarities. In experiments undertaken on UniProtKB/Swiss-Prot, the largest expert-curated protein database, we show that our method dramatically reduces the number of returned sequences, increases Precision by 3%, and does not impact effective search time.
Collapse
Affiliation(s)
- Qingyu Chen
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Xiuzhen Zhang
- 2 School of Science, RMIT University, Melbourne, Australia
| | - Yu Wan
- 3 Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Parkville, Australia
| | - Justin Zobel
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| | - Karin Verspoor
- 1 School of Computing and Information Systems, The University of Melbourne, Parkville, Australia
| |
Collapse
|
7
|
Chen Q, Wan Y, Zhang X, Lei Y, Zobel J, Verspoor K. Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2017. [DOI: 10.1145/3131611] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis.
Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
Collapse
Affiliation(s)
| | - Yu Wan
- University of Melbourne, Victoria, Australia
| | | | - Yang Lei
- University of Melbourne, Australia
| | | | | |
Collapse
|
8
|
Chen Q, Zobel J, Verspoor K. Benchmarks for measurement of duplicate detection methods in nucleotide databases. Database (Oxford) 2017; 2023:2870676. [PMID: 28334741 PMCID: PMC10755258 DOI: 10.1093/database/baw164] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 11/17/2016] [Accepted: 11/21/2016] [Indexed: 01/01/2023]
Abstract
Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates; the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For example, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more diverse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. DATABASE URL : https://bitbucket.org/biodbqual/benchmarks.
Collapse
Affiliation(s)
- Qingyu Chen
- Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| |
Collapse
|