1
|
Wang J, Xue Q, Zhang CWJ, Wong KKL, Liu Z. Explainable coronary artery disease prediction model based on AutoGluon from AutoML framework. Front Cardiovasc Med 2024; 11:1360548. [PMID: 39011494 PMCID: PMC11246996 DOI: 10.3389/fcvm.2024.1360548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Accepted: 06/11/2024] [Indexed: 07/17/2024] Open
Abstract
Objective This study focuses on the innovative application of Automated Machine Learning (AutoML) technology in cardiovascular medicine to construct an explainable Coronary Artery Disease (CAD) prediction model to support the clinical diagnosis of CAD. Methods This study utilizes a combined data set of five public data sets related to CAD. An ensemble model is constructed using the AutoML open-source framework AutoGluon to evaluate the feasibility of AutoML in constructing a disease prediction model in cardiovascular medicine. The performance of the ensemble model is compared against individual baseline models. Finally, the disease prediction ensemble model is explained using SHapley Additive exPlanations (SHAP). Results The experimental results show that the AutoGluon-based ensemble model performs better than the individual baseline models in predicting CAD. It achieved an accuracy of 0.9167 and an AUC of 0.9562 in 4-fold cross-bagging. SHAP measures the importance of each feature to the prediction of the model and explains the prediction results of the model. Conclusion This study demonstrates the feasibility and efficacy of AutoML technology in cardiovascular medicine and highlights its potential in disease prediction. AutoML reduces the barriers to model building and significantly improves prediction accuracy. Additionally, the integration of SHAP enhances model transparency and explainability, which is critical to ensuring model credibility and widespread adoption in cardiovascular medicine.
Collapse
Affiliation(s)
- Jianghong Wang
- Faculty of Information Engineering and Automation, Center for Precision Medicine, Yan'an Hospital of Kunming City & Kunming University of Science and Technology, Kunming, China
| | - Qiang Xue
- Faculty of Information Engineering and Automation, Center for Precision Medicine, Yan'an Hospital of Kunming City & Kunming University of Science and Technology, Kunming, China
| | - Chris W J Zhang
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, Canada
| | | | - Zhihua Liu
- Faculty of Information Engineering and Automation, Center for Precision Medicine, Yan'an Hospital of Kunming City & Kunming University of Science and Technology, Kunming, China
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, Canada
- Bayer HealthCare & Dana-Farber Cancer Institute, Harvard University, Boston, MA, United States
| |
Collapse
|
2
|
Liu Z, Zhao X. piRNAs as emerging biomarkers and physiological regulatory molecules in cardiovascular disease. Biochem Biophys Res Commun 2024; 711:149906. [PMID: 38640879 DOI: 10.1016/j.bbrc.2024.149906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 04/01/2024] [Accepted: 04/05/2024] [Indexed: 04/21/2024]
Abstract
Cardiovascular diseases (CVD) represent one of the most considerable global health threats, owing to their high incidence and mortality rates. Despite the ongoing advancements in detection, prevention, treatment, and prognosis of CVD, which have resulted in a decline in both incidence and mortality rates, CVD remains a major public health concern. Therefore, novel diagnostic biomarkers and therapeutic interventions are imperative to minimise the risk of CVD. Non-coding RNAs (ncRNAs) have recently gained increasing attention, with PIWI-interacting RNAs (piRNAs) emerging as a class of small ncRNAs traditionally recognised for their role in silencing transposons within cells. Although the functional roles of PIWI proteins and piRNAs in human cells remain unclear, growing evidence suggests that these molecules are gradually becoming valuable biomarkers for the diagnosis and treatment of CVD. This review provides a comprehensive summary of the latest studies on piRNAs in CVD. This review discusses the roles of piRNAs in various cardiovascular subtypes, including myocardial hypertrophy, heart failure, myocardial infarction, and cardiac regeneration. The perceived insights may contribute novel perspectives for the diagnosis and treatment of CVD.
Collapse
Affiliation(s)
- Zhihua Liu
- School of Basic Medical Sciences, Center for Precision Medicine, Kunming YanAn Hospital & Kunming University of Science and Technology, Kunming, China; Department of Biostatistics and Computational Biology, Bayer HealthCare, Harvard University, Boston, MA, USA.
| | - Xi Zhao
- School of Basic Medical Sciences, Center for Precision Medicine, Kunming YanAn Hospital & Kunming University of Science and Technology, Kunming, China
| |
Collapse
|
3
|
Ferreira LM, Sáfadi T, Ferreira JL. K-mer applied in Mycobacterium tuberculosis genome cluster analysis. BRAZ J BIOL 2024; 84:e258258. [DOI: 10.1590/1519-6984.258258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/26/2022] [Indexed: 11/22/2022] Open
Abstract
Abstract According to studies carried out, approximately 10 million people developed tuberculosis in 2018. Of this total, 1.5 million people died from the disease. To study the behavior of the genome sequences of Mycobacterium tuberculosis (MTB), the bacterium responsible for the development of tuberculosis (TB), an analysis was performed using k-mers (DNA word frequency). The k values ranged from 1 to 10, because the analysis was performed on the full length of the sequences, where each sequence is composed of approximately 4 million base pairs, k values above 10, the analysis is interrupted, as consequence of the program's capacity. The aim of this work was to verify the formation of the phylogenetic tree in each k-mer analyzed. The results showed the formation of distinct groups in some k-mers analyzed, taking into account the threshold line. However, in all groups, the multidrug-resistant (MDR) and extensively drug-resistant (XDR) strains remained together and separated from the other strains.
Collapse
|
4
|
Jiménez-Gaona Y, Vivanco-Galván O, Cruz D, Armijos-Carrión A, Suárez JP. Compensatory Base Changes in ITS2 Secondary Structure Alignment, Modelling, and Molecular Phylogeny: An Integrated Approach to Improve Species Delimitation in Tulasnella (Basidiomycota). J Fungi (Basel) 2023; 9:894. [PMID: 37755002 PMCID: PMC10532482 DOI: 10.3390/jof9090894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 08/08/2023] [Accepted: 08/17/2023] [Indexed: 09/28/2023] Open
Abstract
BACKGROUND The delimitation of species of Tulasnella has been extensively studied, mainly at the morphological (sexual and asexual states) and molecular levels-showing ambiguity between them. An integrative species concept that includes characteristics such as molecular, ecology, morphology, and other information is crucial for species delimitation in complex groups such as Tulasnella. OBJECTIVES The aim of this study is to test evolutionary relationships using a combination of alignment-based and alignment-free distance matrices as an alternative molecular tool to traditional methods, and to consider the secondary structures and CBCs from ITS2 (internal transcribed spacer) sequences for species delimitation in Tulasnella. METHODOLOGY Three phylogenetic approaches were plotted: (i) alignment-based, (ii) alignment-free, and (iii) a combination of both distance matrices using the DISTATIS and pvclust libraries from an R package. Finally, the secondary structure consensus was modeled by Mfold, and a CBC analysis was obtained to complement the species delimitation using 4Sale. RESULTS AND CONCLUSIONS The phylogenetic tree results showed delimited monophyletic clades in Tulasnella spp., where all 142 Tulasnella sequences were divided into two main clades A and B and assigned to seven species (T. asymmetrica, T. andina, T. eichleriana ECU6, T. eichleriana ECU4 T. pinicola, T. violea), supported by bootstrap values from 72% to 100%. From the 2D secondary structure alignment, three types of consensus models with helices and loops were obtained. Thus, T. albida belongs to type I; T. eichleriana, T. tomaculum, and T. violea belong to type II; and T. asymmetrica, T. andina, T. pinicola, and T. spp. (GER) belong to type III; each type contains four to six domains, with nine CBCs among these that corroborate different species.
Collapse
Affiliation(s)
- Yuliana Jiménez-Gaona
- Departamento de Química, Universidad Técnica Particular de Loja (UTPL), San Cayetano Alto s/n, Loja 1101608, Ecuador
| | - Oscar Vivanco-Galván
- Departamento de Ciencias Biológicas, Universidad Técnica Particular de Loja (UTPL), San Cayetano Alto s/n, Loja 1101608, Ecuador; (O.V.-G.); (D.C.); (J.P.S.)
| | - Darío Cruz
- Departamento de Ciencias Biológicas, Universidad Técnica Particular de Loja (UTPL), San Cayetano Alto s/n, Loja 1101608, Ecuador; (O.V.-G.); (D.C.); (J.P.S.)
| | - Angelo Armijos-Carrión
- Department of Biology, Memorial University of Newfoundland, St. John’s, NL A1B 3X9, Canada;
| | - Juan Pablo Suárez
- Departamento de Ciencias Biológicas, Universidad Técnica Particular de Loja (UTPL), San Cayetano Alto s/n, Loja 1101608, Ecuador; (O.V.-G.); (D.C.); (J.P.S.)
| |
Collapse
|
5
|
Mao H, Wang H. Resolution of deep divergence of club fungi (phylum Basidiomycota). Synth Syst Biotechnol 2019; 4:225-231. [PMID: 31890927 PMCID: PMC6926304 DOI: 10.1016/j.synbio.2019.12.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Revised: 11/18/2019] [Accepted: 12/04/2019] [Indexed: 11/05/2022] Open
Abstract
A long-standing question about the early evolution of club fungi (phylum Basidiomycota) is the relationship between the three major groups, Pucciniomycotina, Ustilaginomycotina and Agaricomycotina. It is unresolved whether Agaricomycotina are more closely related to Ustilaginomycotina or to Pucciniomycotina. Here we reconstructed the branching order of the three subphyla through two sources of phylogenetic signals, i.e. standard phylogenomic analysis and alignment-free phylogenetic approach. Overall, beyond congruency within the frame of standard phylogenomic analysis, our results consistently and robustly supported the early divergence of Ustilaginomycotina and a closer relationship between Agaricomycotina and Pucciniomycotina.
Collapse
Affiliation(s)
- Hongliang Mao
- T-Life Research Center, Department of Physics, Fudan University, Shanghai, 200433, PR China
| | - Hao Wang
- T-Life Research Center, Department of Physics, Fudan University, Shanghai, 200433, PR China
| |
Collapse
|
6
|
Liu Z, Ma C, Gu J, Yu M. Potential biomarkers of acute myocardial infarction based on weighted gene co-expression network analysis. Biomed Eng Online 2019; 18:9. [PMID: 30683112 PMCID: PMC6347746 DOI: 10.1186/s12938-019-0625-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Accepted: 03/01/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Acute myocardial infarction (AMI) is the common cause of mortality in developed countries. The feasibility of whole-genome gene expression analysis to identify outcome-related genes and dysregulated pathways remains unknown. Molecular marker such as BNP, CRP and other serum inflammatory markers have got the notice at this point. However, these biomarkers exhibit elevated levels in patients with thyroid disease, renal failure and congestive heart failure. In this study, three groups of microarray data sets (GES66360, GSE48060, GSE29532) were collected from GEO, a total of 99, 52 and 55 samples, respectively. Weighted gene co-expression network analysis (WGCNA) was performed to obtain a classifier which composed of related genes that best characterize the AMI. RESULTS Here, this study obtained three groups of microarray data sets (GES66360, GSE48060, GSE29532) on AMI blood samples, a total of 99, 52 and 24 samples, respectively. In all, 4672 genes, 3185 genes, 3660 genes were identified in GSE66360, GSE48060, GSE60993 modules, respectively. We preformed WGCNA, GO and KEGG pathway enrichment analysis on these three data sets, finding function enrichment of the differential expression gene on inflammation and immune response. Transcriptome analysis were performed in AMI patients at four time points compared to CAD patients with no history of MI, to determine gene expression profiles and their possible changes during the recovery from myocardial infarction. CONCLUSIONS The results suggested that three overlapping genes (FGFBP2, GFOD1 and MLC1) between two modules could be a potential use of gene biomarkers for the diagnose of AMI.
Collapse
Affiliation(s)
- Zhihua Liu
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China. .,Beijing Yuqiu Medical Research Institute, Beijing, 100022, China. .,Shenzhen Yuqiu Biological Big Data Research Institute, Shenzhen, 518033, China. .,Nanjing Yuqiu Biotechnology Co., Ltd., Nanjing, 210009, China.
| | - Chenguang Ma
- Tsinghua University, Beijing, 100084, China.,Beijing Yuqiu Medical Research Institute, Beijing, 100022, China.,Shenzhen Yuqiu Biological Big Data Research Institute, Shenzhen, 518033, China.,Nanjing Yuqiu Biotechnology Co., Ltd., Nanjing, 210009, China
| | - Junhua Gu
- Shenzhen Yuqiu Biological Big Data Research Institute, Shenzhen, 518033, China.,Nanjing Yuqiu Biotechnology Co., Ltd., Nanjing, 210009, China.,Hebei University of Technology, Tianjin, 300130, China
| | - Ming Yu
- Shenzhen Yuqiu Biological Big Data Research Institute, Shenzhen, 518033, China.,Nanjing Yuqiu Biotechnology Co., Ltd., Nanjing, 210009, China.,Hebei University of Technology, Tianjin, 300130, China
| |
Collapse
|
7
|
PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing. Sci Rep 2016; 6:36213. [PMID: 27824078 PMCID: PMC5099940 DOI: 10.1038/srep36213] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 10/12/2016] [Indexed: 01/07/2023] Open
Abstract
With the advance of next generation sequencing techniques, whole genome sequencing (WGS) is expected to become the optimal method for molecular subtyping of bacterial isolates. To use WGS as a general subtyping method for disease outbreak investigation and surveillance, the layout of WGS-based typing must be comparable among laboratories. Whole genome multilocus sequence typing (wgMLST) is an approach that achieves this requirement. To apply wgMLST as a standard subtyping approach, a pan-genome allele database (PGAdb) for the population of a bacterial organism must first be established. We present a free web service tool, PGAdb-builder (http://wgmlstdb.imst.nsysu.edu.tw), for the construction of bacterial PGAdb. The effectiveness of PGAdb-builder was tested by constructing a pan-genome allele database for Salmonella enterica serovar Typhimurium, with the database being applied to create a wgMLST tree for a panel of epidemiologically well-characterized S. Typhimurium isolates. The performance of the wgMLST-based approach was as high as that of the SNP-based approach in Leekitcharoenphon’s study used for discerning among epidemiologically related and non-related isolates.
Collapse
|
8
|
Franz E, Gras LM, Dallman T. Significance of whole genome sequencing for surveillance, source attribution and microbial risk assessment of foodborne pathogens. Curr Opin Food Sci 2016. [DOI: 10.1016/j.cofs.2016.04.004] [Citation(s) in RCA: 66] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Wang D, Xu J, Yu J. KGCAK: a K-mer based database for genome-wide phylogeny and complexity evaluation. Biol Direct 2015; 10:53. [PMID: 26376976 PMCID: PMC4573299 DOI: 10.1186/s13062-015-0083-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 09/11/2015] [Indexed: 11/28/2022] Open
Abstract
Background The K-mer approach, treating genomic sequences as simple characters and counting the relative abundance of each string upon a fixed K, has been extensively applied to phylogeny inference for genome assembly, annotation, and comparison. Results To meet increasing demands for comparing large genome sequences and to promote the use of the K-mer approach, we develop a versatile database, KGCAK (http://kgcak.big.ac.cn/KGCAK/), containing ~8,000 genomes that include genome sequences of diverse life forms (viruses, prokaryotes, protists, animals, and plants) and cellular organelles of eukaryotic lineages. It builds phylogeny based on genomic elements in an alignment-free fashion and provides in-depth data processing enabling users to compare the complexity of genome sequences based on K-mer distribution. Conclusion We hope that KGCAK becomes a powerful tool for exploring relationship within and among groups of species in a tree of life based on genomic data. Reviewers This article was reviewed by Prof Mark Ragan and Dr Yuri Wolf.
Collapse
Affiliation(s)
- Dapeng Wang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China. .,Stem Cell Laboratory, UCL Cancer Institute, University College London, London, WC1E 6BT, UK.
| | - Jiayue Xu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China. .,University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Jun Yu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, PR China.
| |
Collapse
|
10
|
Abstract
Fifty complete Bacillus genome sequences and associated plasmids were compared using the “feature frequency profile” (FFP) method. The resulting whole-genome phylogeny supports the placement of three Bacillus species (B. thuringiensis, B. anthracis and B. cereus) as a single clade. The monophyletic status of B. anthracis was strongly supported by the analysis. FFP proved to be more effective in inferring the phylogeny of Bacillus than methods based on single gene sequences [16s rRNA gene, GryB (gyrase subunit B) and AroE (shikimate-5-dehydrogenase)] analyses. The findings of FFP analysis were verified using kSNP v2 (alignment-free sequence analysis method) and Harvest suite (core genome sequence alignment method).
Collapse
|
11
|
Zou Q, Hu Q, Guo M, Wang G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 2015; 31:2475-81. [DOI: 10.1093/bioinformatics/btv177] [Citation(s) in RCA: 121] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2015] [Accepted: 03/23/2015] [Indexed: 12/26/2022] Open
|
12
|
Wen J, Zhang Y, Yau SS. k-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J Theor Biol 2014; 363:145-50. [DOI: 10.1016/j.jtbi.2014.08.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2014] [Revised: 07/14/2014] [Accepted: 08/17/2014] [Indexed: 10/24/2022]
|
13
|
Fan L, Hui JHL, Yu ZG, Chu KH. VIP Barcoding: composition vector-based software for rapid species identification based on DNA barcoding. Mol Ecol Resour 2014; 14:871-81. [PMID: 24479510 DOI: 10.1111/1755-0998.12235] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Revised: 01/22/2014] [Accepted: 01/24/2014] [Indexed: 12/17/2022]
Abstract
Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time-consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest-neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment-free methods and (ii) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/.
Collapse
Affiliation(s)
- Long Fan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | | | | | | |
Collapse
|
14
|
Evaluation of whole genome sequencing for outbreak detection of Salmonella enterica. PLoS One 2014; 9:e87991. [PMID: 24505344 PMCID: PMC3913712 DOI: 10.1371/journal.pone.0087991] [Citation(s) in RCA: 186] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2013] [Accepted: 01/02/2014] [Indexed: 11/19/2022] Open
Abstract
Salmonella enterica is a common cause of minor and large food borne outbreaks. To achieve successful and nearly ‘real-time’ monitoring and identification of outbreaks, reliable sub-typing is essential. Whole genome sequencing (WGS) shows great promises for using as a routine epidemiological typing tool. Here we evaluate WGS for typing of S. Typhimurium including different approaches for analyzing and comparing the data. A collection of 34 S. Typhimurium isolates was sequenced. This consisted of 18 isolates from six outbreaks and 16 epidemiologically unrelated background strains. In addition, 8 S. Enteritidis and 5 S. Derby were also sequenced and used for comparison. A number of different bioinformatics approaches were applied on the data; including pan-genome tree, k-mer tree, nucleotide difference tree and SNP tree. The outcome of each approach was evaluated in relation to the association of the isolates to specific outbreaks. The pan-genome tree clustered 65% of the S. Typhimurium isolates according to the pre-defined epidemiology, the k-mer tree 88%, the nucleotide difference tree 100% and the SNP tree 100% of the strains within S. Typhimurium. The resulting outcome of the four phylogenetic analyses were also compared to PFGE reveling that WGS typing achieved the greater performance than the traditional method. In conclusion, for S. Typhimurium, SNP analysis and nucleotide difference approach of WGS data seem to be the superior methods for epidemiological typing compared to other phylogenetic analytic approaches that may be used on WGS. These approaches were also superior to the more classical typing method, PFGE. Our study also indicates that WGS alone is insufficient to determine whether strains are related or un-related to outbreaks. This still requires the combination of epidemiological data and whole genome sequencing results.
Collapse
|
15
|
Abstract
Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on comparative data, today usually DNA sequences. These have become so plentiful that alignment-free sequence comparison is of growing importance in the race between scientists and sequencing machines. In phylogenetics, efficient distance computation is the major contribution of alignment-free methods. A distance measure should reflect the number of substitutions per site, which underlies classical alignment-based phylogeny reconstruction. Alignment-free distance measures are either based on word counts or on match lengths, and I apply examples of both approaches to simulated and real data to assess their accuracy and efficiency. While phylogeny reconstruction is based on the number of substitutions, in population genetics, the distribution of mutations along a sequence is also considered. This distribution can be explored by match lengths, thus opening the prospect of alignment-free population genomics.
Collapse
|
16
|
Abstract
A plethora of biologically useful information lies obscured in the genomes of organisms. Encoded within the genome of an organism is the information about its evolutionary history. Evolutionary signals are scattered throughout the genome. Bioinformatics approaches are frequently invoked to deconstruct the evolutionary patterns underlying genomes, which are difficult to decipher using traditional laboratory experiments. However, interpreting constantly evolving genomes is a non-trivial task for bioinformaticians. Processes such as mutations, recombinations, insertions and deletions make genomes not only heterogeneous and difficult to decipher but also renders direct sequence comparison less effective. Here we present a brief overview of the sequence comparison methods with a focus on recently proposed alignment-free sequence comparison methods based on Shannon information entropy. Many of these sequence comparison methods have been adapted to construct phylogenetic trees to infer relationships among organisms.
Collapse
Affiliation(s)
- Mehul Jani
- University of North Texas, Denton, Texas
| | | |
Collapse
|
17
|
Cheng J, Zeng X, Ren G, Liu Z. CGAP: a new comprehensive platform for the comparative analysis of chloroplast genomes. BMC Bioinformatics 2013; 14:95. [PMID: 23496817 PMCID: PMC3636126 DOI: 10.1186/1471-2105-14-95] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2012] [Accepted: 02/11/2013] [Indexed: 02/06/2023] Open
Abstract
Background Chloroplast is an essential organelle in plants which contains independent genome. Chloroplast genomes have been widely used for plant phylogenetic inference recently. The number of complete chloroplast genomes increases rapidly with the development of various genome sequencing projects. However, no comprehensive platform or tool has been developed for the comparative and phylogenetic analysis of chloroplast genomes. Thus, we constructed a comprehensive platform for the comparative and phylogenetic analysis of complete chloroplast genomes which was named as chloroplast genome analysis platform (CGAP). Results CGAP is an interactive web-based platform which was designed for the comparative analysis of complete chloroplast genomes. CGAP integrated genome collection, visualization, content comparison, phylogeny analysis and annotation functions together. CGAP implemented four web servers including creating complete and regional genome maps of high quality, comparing genome features, constructing phylogenetic trees using complete genome sequences, and annotating draft chloroplast genomes submitted by users. Conclusions Both CGAP and source code are available at http://www.herbbol.org:8000/chloroplast. CGAP will facilitate the collection, visualization, comparison and annotation of complete chloroplast genomes. Users can customize the comparative and phylogenetic analysis using their own unpublished chloroplast genomes.
Collapse
Affiliation(s)
- Jinkui Cheng
- Department of Computational Biology and Bioinformatics, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100193, China
| | | | | | | |
Collapse
|