1
|
Algarni S, Foley SL, Tang H, Zhao S, Gudeta DD, Khajanchi BK, Ricke SC, Han J. Development of an antimicrobial resistance plasmid transfer gene database for enteric bacteria. FRONTIERS IN BIOINFORMATICS 2023; 3:1279359. [PMID: 38033626 PMCID: PMC10682676 DOI: 10.3389/fbinf.2023.1279359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 10/30/2023] [Indexed: 12/02/2023] Open
Abstract
Introduction: Type IV secretion systems (T4SSs) are integral parts of the conjugation process in enteric bacteria. These secretion systems are encoded within the transfer (tra) regions of plasmids, including those that harbor antimicrobial resistance (AMR) genes. The conjugal transfer of resistance plasmids can lead to the dissemination of AMR among bacterial populations. Methods: To facilitate the analyses of the conjugation-associated genes, transfer related genes associated with key groups of AMR plasmids were identified, extracted from GenBank and used to generate a plasmid transfer gene dataset that is part of the Virulence and Plasmid Transfer Factor Database at FDA, serving as the foundation for computational tools for the comparison of the conjugal transfer genes. To assess the genetic feature of the transfer gene database, genes/proteins of the same name (e.g., traI/TraI) or predicted function (VirD4 ATPase homologs) were compared across the different plasmid types to assess sequence diversity. Two analyses tools, the Plasmid Transfer Factor Profile Assessment and Plasmid Transfer Factor Comparison tools, were developed to evaluate the transfer genes located on plasmids and to facilitate the comparison of plasmids from multiple sequence files. To assess the database and associated tools, plasmid, and whole genome sequencing (WGS) data were extracted from GenBank and previous WGS experiments in our lab and assessed using the analysis tools. Results: Overall, the plasmid transfer database and associated tools proved to be very useful for evaluating the different plasmid types, their association with T4SSs, and increased our understanding how conjugative plasmids contribute to the dissemination of AMR genes.
Collapse
Affiliation(s)
- Suad Algarni
- Division of Microbiology, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
- Cellular and Molecular Biology Graduate Program, University of Arkansas, Fayetteville, AR, United States
| | - Steven L. Foley
- Division of Microbiology, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
| | - Hailin Tang
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
| | - Shaohua Zhao
- Office of Applied Science, Center for Veterinary Medicine, Food and Drug Administration, Laurel, MD, United States
| | - Dereje D. Gudeta
- Division of Microbiology, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
| | - Bijay K. Khajanchi
- Division of Microbiology, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
| | - Steven C. Ricke
- Meat Science and Animal Biologics Discovery Program, Animal and Dairy Sciences Department, University of Wisconsin, Madison, WI, United States
| | - Jing Han
- Division of Microbiology, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States
| |
Collapse
|
2
|
Identifying anticancer peptides by using a generalized chaos game representation. J Math Biol 2018; 78:441-463. [PMID: 30291366 DOI: 10.1007/s00285-018-1279-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 08/01/2018] [Indexed: 10/28/2022]
Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Collapse
|
3
|
Abstract
Advances in sequencing technologies led to rapid increase in the number and diversity of biological sequences, which facilitated development in the sequence research. In this paper, we present a new method for analyzing protein sequence similarity. We calculated the spectral radii of 20 amino acids (AAs) and put forward a novel 2-D graphical representation of protein sequences. To characterize protein sequences numerically, three groups of features were extracted and related to statistical, dynamics measurements and fluctuation complexity of the sequences. With the obtained feature vector, two models utilizing Gaussian Kernel similarity and Cosine similarity were built to measure the similarity between sequences. We applied our method to analyze the similarities/dissimilarities of four data sets. Both proposed models received consistent results with improvements when compared to that obtained by the ClustalW analysis. The novel approach we present in this study may therefore benefit protein research in medical and scientific fields.
Collapse
|
4
|
Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix. Sci Rep 2017; 7:46237. [PMID: 28393857 PMCID: PMC5385872 DOI: 10.1038/srep46237] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 03/14/2017] [Indexed: 11/08/2022] Open
Abstract
We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.
Collapse
|
5
|
El-Lakkani A, Lashin M. An efficient method for measuring the similarity of protein sequences. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:363-370. [PMID: 27103219 DOI: 10.1080/1062936x.2016.1174735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Accepted: 04/01/2016] [Indexed: 06/05/2023]
Abstract
An accurate numerical descriptor for protein sequence is introduced. It is basically a set of each three successive amino acids in the sequence (triplet), starting from left to right, in addition to the distances between each two successive amino acids in the triplet such that the summation of these distances does not exceed 8. This numerical descriptor combines two features the amino acid composition and the position of each amino acid relative to the other nearby amino acids. This numerical descriptor is used to measure the similarity between protein sequences in three sets: NADH dehydrogenase subunit 5 (ND5) proteins of different species, 24 transferrin proteins from vertebrates and 12 proteins of baculoviruses. High correlation coefficient values between our results and the results of ClustalW program are obtained. These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| | - M Lashin
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| |
Collapse
|
6
|
Thankachan SV, Apostolico A, Aluru S. A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem. J Comput Biol 2016; 23:472-82. [PMID: 27058840 DOI: 10.1089/cmb.2015.0235] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Alignment-free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogenetic reconstruction. Among all the methods based on substring composition, the average common substring (ACS) measure admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of this research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate k ≥ 1 mismatches have O(n(2)) worst-case time complexity, where n is the total length of the input sequences. On the other hand, accounting for mismatches has shown to lead to much improved classification, while heuristics can improve practical performance. In this article, we close the gap by presenting the first provably efficient algorithm for the k-mismatch average common string (ACSk) problem that takes O(n) space and O(n log(k) n) time in the worst case for any constant k. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applied to other complex approximate sequence matching problems.
Collapse
Affiliation(s)
| | - Alberto Apostolico
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| | - Srinivas Aluru
- College of Computing, Georgia Institute of Technology , Atlanta, Georgia
| |
Collapse
|
7
|
El-Lakkani A, Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2015; 26:125-137. [PMID: 25650529 DOI: 10.1080/1062936x.2014.995700] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
A new two-dimensional graphical representation of protein sequences is introduced. Twenty concentric evenly spaced circles divided by n radial lines into equal divisions are selected to represent any protein sequence of length n. Each circle represents one of the different 20 amino acids, and each radial line represents a single amino acid of the protein sequence. An efficient numerical method based on the graph is proposed to measure the similarity between two protein sequences. To prove the accuracy of our approach, the method is applied to NADH dehydrogenase subunit 5 (ND5) proteins of nine different species and 24 transferrin sequences from vertebrates. High values of correlation coefficient between our results and the results of ClustalW are obtained (approximately perfect correlations). These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Department of Biophysics, Faculty of Science , Cairo University , Giza , Egypt
| | | |
Collapse
|
8
|
Graham DJ. A new bioinformatics approach to natural protein collections: permutation structure contrasts of viral and cellular systems. Protein J 2013; 32:275-87. [PMID: 23605224 DOI: 10.1007/s10930-013-9485-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Biological cells and viruses operate by different replication and symmetry paradigms. Cells are able to replicate independently and express little spatial symmetry; viruses require cells for replication while manifesting high symmetry. The author inquired whether different paradigms were reflected in the permutations of amino acid sequences. The hypothesis was that the permutation structure level and symmetry within viral protein collections exceed that of living cells. The rationale was that one symmetry aspect generally accompanies and promotes others in a system. The inquiry was readily answered given abundant sequence archives for proteins. The analysis of collections from diverse viral and cellular sources lends strong support. Additional insights into protein primary structure, the design of collections, and the role of information are provided as well.
Collapse
Affiliation(s)
- Daniel J Graham
- Department of Chemistry, Loyola University Chicago, 6525 North Sheridan Road, Chicago, IL 60626, USA.
| |
Collapse
|
9
|
Huang DS, Yu HJ. Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:457-467. [PMID: 23929869 DOI: 10.1109/tcbb.2013.10] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a 400 by ((L-1)) matrix (M). In addition, we further derive a normalized 400-tuple mathematical descriptors (D), which is extracted from the primary protein sequences via singular values decomposition (SVD) of the matrix. The obtained 400-D normalized feature vectors (NFVs) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity for different sequences upon two data sets: 1) ND5 sequences from nine species and 2) transferrin sequences of 24 vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our proposed NFV-AAA approach does perform well in the field of similarity analysis of sequence.
Collapse
Affiliation(s)
- De-Shuang Huang
- School of Electronics and Information Engineering, Tongji University, 4800 Caoan Road, Shanghai 201804, China
| | | |
Collapse
|
10
|
Information properties of naturally-occurring proteins: Fourier analysis and complexity phase plots. Protein J 2012; 31:550-63. [PMID: 22814572 DOI: 10.1007/s10930-012-9432-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
In previous work from this lab, the information in natural proteins was investigated with Ribonuclease A (RNase A) serving as the source. The signature traits were investigated at three structure levels: primary through tertiary. The present paper travels further by charting the primary structure information of about half a million molecules. This was feasible given abundant sequence archives for both living and viral systems. Notably, a method is presented for evaluating primary structure information, based on Fourier analysis and spectral complexity. Significantly, the results show certain complexity traits to be universal for living sources. Viruses, by contrast, encode protein collections which are case-specific and complexity-divergent. The results have ramifications for discriminating collections on the basis of sequence information. This discrimination offers new strategies for selecting drug targets.
Collapse
|
11
|
Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV-1 genomes. Mol Phylogenet Evol 2011; 62:756-63. [PMID: 22155711 DOI: 10.1016/j.ympev.2011.11.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2010] [Revised: 10/29/2011] [Accepted: 11/18/2011] [Indexed: 10/14/2022]
Abstract
Pathogens like HIV-1, which evolve into many closely related variants displaying differential infectivity and evolutionary dynamics in a short time scale, require fast and accurate classification. Conventional whole genome sequence alignment-based methods are computationally expensive and involve complex analysis. Alignment-free methodologies are increasingly being used to effectively differentiate genomic variations between viral species. Multifractal analysis, which explores the self-similar nature of genomes, is an alignment-free methodology that has been applied to study such variations. However, whether multifractal analysis can quantify variations between closely related genomes, such as the HIV-1 subtypes, is an open question. Here we address the above by implementing the multifractal analysis on four retroviral genomes (HIV-1, HIV-2, SIVcpz, and HTLV-1), and demonstrate that individual multifractal properties can differentiate between different retrovirus types easily. However, the individual multifractal measures do not resolve within-group variations for different known subtypes of HIV-1 M group. We show here that these known subtypes can instead be classified correctly using a combination of the crucial multifractal measures. This method is simple and computationally fast in comparison to the conventional alignment-based methods for whole genome phylogenetic analysis.
Collapse
Affiliation(s)
- Aridaman Pandit
- Mathematical Modeling and Computational Biology Group, Centre for Cellular and Molecular Biology (CSIR), Hyderabad 500007, India
| | | | | |
Collapse
|
12
|
Yang L, Zhang X, Zhu H. Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. J Theor Biol 2011; 295:125-31. [PMID: 22138094 DOI: 10.1016/j.jtbi.2011.11.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Revised: 11/18/2011] [Accepted: 11/19/2011] [Indexed: 11/15/2022]
Abstract
This work proposes an alignment free comparison model for the DNA primary sequences. In this paper, we treat the double strands of the DNA rather than single strand. We define the shortest absent word of the double strands between the DNA sequences and some properties are studied to speed up the algorithm for searching the shortest absent word. We present a novel model for comparison, in which the similarity distribution is introduced to describe the similarity between the sequences. A distance measure is deduced based on the Shannon entropy meanwhile is used in phylogenetic analysis. Some experiments show that our model performs well in the field of sequence analysis.
Collapse
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | | | | |
Collapse
|