1
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
2
|
Xing L, Xia YY, Zhang QY, Xia ZF, Wan CX, Zhang LL, Luo XX. Streptomyces griseicoloratus sp. nov., isolated from soil in cotton fields in Xinjiang, China. Arch Microbiol 2022; 204:254. [PMID: 35412082 DOI: 10.1007/s00203-022-02818-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 01/26/2022] [Accepted: 02/15/2022] [Indexed: 11/29/2022]
Abstract
A novel bacterium of the genus Streptomyces, designated TRM S81-3T, was isolated from soil in cotton fields of Xinjiang, China. Comparative 16S rRNA gene sequence analysis indicated that strain TRM S81-3T is most closely related to Streptomyces viridiviolaceus NBRC 13359T (98.9% sequence similarity); however, the average nucleotide identity (ANI) between strains TRM S81-3T and S. viridiviolaceus NBRC 13359T is relatively low (91.6%). Strain TRM S81-3T possesses LL-diaminopimelic acid as the diagnostic cell-wall diamino acid, MK-9(H4), MK-9(H6), and MK-9(H10) as the major menaquinones, and polar lipids including diphosphatidylglycerol (DPG), phosphatidylcholine (PC), phosphatidylethanolamine (PE), phosphatidylmethyl ethanolamine (PME), phosphotidylinositolone (PI), phospholipid of unknown structure containing glucosamine (NPG), and two unidentified phospholipids (PLs).The major fatty acids are iso-C16:0, anteiso-C15:0, anteiso-C17:1 ω9c, anteiso-C17:0, iso-C15:0, and C14:0. The genomic DNA G + C content is 72.1%. Based on the evidence from this polyphasic study, strain TRM S81-3T represents a novel species of Streptomyces, for which the name Streptomyces grisecoloratus is proposed. The type strain is TRM S81-3T (= CCTCC AA 2020002T = LMG 31942T).
Collapse
Affiliation(s)
- Li Xing
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China
| | - Ying-Ying Xia
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China
| | - Qiao-Yan Zhang
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China
| | - Zhan-Feng Xia
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China.
| | - Chuan-Xing Wan
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China
| | - Li-Li Zhang
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China
| | - Xiao-Xia Luo
- Key Laboratory of Protection and Utilization of Biological Resources in Tarim Basin of Xinjiang Production & Construction Corps/College of Life Science, Tarim University, Alar, 843300, People's Republic of China.
| |
Collapse
|
3
|
Achom M, Roy P, Lagunas B, Picot E, Richards L, Bonyadi-Pour R, Pardal AJ, Baxter L, Richmond BL, Aschauer N, Fletcher EM, Rowson M, Blackwell J, Rich-Griffin C, Mysore KS, Wen J, Ott S, Carré IA, Gifford ML. Plant circadian clock control of Medicago truncatula nodulation via regulation of nodule cysteine-rich peptides. JOURNAL OF EXPERIMENTAL BOTANY 2022; 73:2142-2156. [PMID: 34850882 PMCID: PMC8982390 DOI: 10.1093/jxb/erab526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 11/30/2021] [Indexed: 06/13/2023]
Abstract
Legumes house nitrogen-fixing endosymbiotic rhizobia in specialized polyploid cells within root nodules, which undergo tightly regulated metabolic activity. By carrying out expression analysis of transcripts over time in Medicago truncatula nodules, we found that the circadian clock enables coordinated control of metabolic and regulatory processes linked to nitrogen fixation. This involves the circadian clock-associated transcription factor LATE ELONGATED HYPOCOTYL (LHY), with lhy mutants being affected in nodulation. Rhythmic transcripts in root nodules include a subset of nodule-specific cysteine-rich peptides (NCRs) that have the LHY-bound conserved evening element in their promoters. Until now, studies have suggested that NCRs act to regulate bacteroid differentiation and keep the rhizobial population in check. However, these conclusions came from the study of a few members of this very large gene family that has complex diversified spatio-temporal expression. We suggest that rhythmic expression of NCRs may be important for temporal coordination of bacterial activity with the rhythms of the plant host, in order to ensure optimal symbiosis.
Collapse
Affiliation(s)
- Mingkee Achom
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Proyash Roy
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
- Department of Genetic Engineering & Biotechnology, University of Dhaka, Dhaka, Bangladesh
| | - Beatriz Lagunas
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Emma Picot
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Luke Richards
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Roxanna Bonyadi-Pour
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Alonso J Pardal
- Warwick Medical School, University of Warwick, Coventry CV4 7AL, UK
| | - Laura Baxter
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Bethany L Richmond
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Nadine Aschauer
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Eleanor M Fletcher
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
- School of Biological Sciences, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, UK
| | - Monique Rowson
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Joseph Blackwell
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Charlotte Rich-Griffin
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
- Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Kirankumar S Mysore
- Institute for Agricultural Biosciences, Oklahoma State University, Ardmore, OK 73401, USA
| | - Jiangqi Wen
- Institute for Agricultural Biosciences, Oklahoma State University, Ardmore, OK 73401, USA
| | - Sascha Ott
- Warwick Medical School, University of Warwick, Coventry CV4 7AL, UK
| | - Isabelle A Carré
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
| | - Miriam L Gifford
- School of Life Sciences, Gibbet Hill Road, University of Warwick, Coventry CV4 7AL, UK
- Warwick Integrative Synthetic Biology Centre, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
4
|
Lee H, Shuaibi A, Bell JM, Pavlichin DS, Ji HP. Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations. NAR Cancer 2020; 2:zcaa034. [PMID: 33345188 PMCID: PMC7727745 DOI: 10.1093/narcan/zcaa034] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 10/23/2020] [Accepted: 11/12/2020] [Indexed: 12/26/2022] Open
Abstract
Cancer genome sequencing has led to important discoveries such as the identification of cancer genes. However, challenges remain in the analysis of cancer genome sequencing. One significant issue is that mutations identified by multiple variant callers are frequently discordant even when using the same genome sequencing data. For insertion and deletion mutations, oftentimes there is no agreement among different callers. Identifying somatic mutations involves read mapping and variant calling, a complicated process that uses many parameters and model tuning. To validate the identification of true mutations, we developed a method using k-mer sequences. First, we characterized the landscape of unique versus non-unique k-mers in the human genome. Second, we developed a software package, KmerVC, to validate the given somatic mutations from sequencing data. Our program validates the occurrence of a mutation based on statistically significant difference in frequency of k-mers with and without a mutation from matched normal and tumor sequences. Third, we tested our method on both simulated and cancer genome sequencing data. Counting k-mer involving mutations effectively validated true positive mutations including insertions and deletions across different individual samples in a reproducible manner. Thus, we demonstrated a straightforward approach for rapidly validating mutations from cancer genome sequencing data.
Collapse
Affiliation(s)
- HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Ahmed Shuaibi
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - John M Bell
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304, USA
| | - Dmitri S Pavlichin
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
5
|
Chiner-Oms A, González-Candelas F. EvalMSA: A Program to Evaluate Multiple Sequence Alignments and Detect Outliers. Evol Bioinform Online 2016; 12:277-284. [PMID: 27920488 PMCID: PMC5127606 DOI: 10.4137/ebo.s40583] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Revised: 10/02/2016] [Accepted: 10/05/2016] [Indexed: 12/01/2022] Open
Abstract
We present EvalMSA, a software tool for evaluating and detecting outliers in multiple sequence alignments (MSAs). This tool allows the identification of divergent sequences in MSAs by scoring the contribution of each row in the alignment to its quality using a sum-of-pair-based method and additional analyses. Our main goal is to provide users with objective data in order to take informed decisions about the relevance and/or pertinence of including/retaining a particular sequence in an MSA. EvalMSA is written in standard Perl and also uses some routines from the statistical language R. Therefore, it is necessary to install the R-base package in order to get full functionality. Binary packages are freely available from http://sourceforge.net/projects/evalmsa/for Linux and Windows.
Collapse
Affiliation(s)
- Alvaro Chiner-Oms
- Joint Research Unit "Infection and Public Health" FISABIO, Cavanilles Institute for Biodiversity and Evolutionary Biology, University of Valencia, Paterna, Valencia, Spain.; CIBER in Epidemiology and Public Health, Madrid, Spain
| | - Fernando González-Candelas
- Joint Research Unit "Infection and Public Health" FISABIO, Cavanilles Institute for Biodiversity and Evolutionary Biology, University of Valencia, Paterna, Valencia, Spain.; CIBER in Epidemiology and Public Health, Madrid, Spain
| |
Collapse
|
6
|
Eric SD, Nicholas TKDD, Theophilus KA. Bioinformatics with basic local alignment search tool (BLAST) and fast alignment (FASTA). ACTA ACUST UNITED AC 2014. [DOI: 10.5897/ijbc2013.0086] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
7
|
Abstract
INTRODUCTIONThe original Dayhoff percent accepted mutation (PAM) matrices were developed based on a small number of protein sequences and an evolutionary model of protein change. By extrapolating from the observed changes at small evolutionary distances to large ones, it was possible to establish a PAM250 scoring matrix for sequences that were highly divergent. Another approach to finding a scoring matrix for divergent sequences is to start with a more divergent set of sequences and produce a scoring matrix from the substitutions found in those less-related sequences. The blocks amino acid substitution matrices (BLOSUM) scoring matrices were prepared this way. This article explains how BLOSUM scoring matrices were created and how they can best be used.
Collapse
|
8
|
Abstract
INTRODUCTIONThe percent accepted mutation (PAM) scoring matrix is based on the Dayhoff model of protein evolution, which is a Markov process. In the Markov model of amino acid change, the probability of mutation at each site is independent of the previous history of mutations. Use of this model makes it possible to extrapolate amino acid substitutions observed over a relatively short period of evolutionary time to longer periods of evolutionary time. One criticism of the PAM scoring matrix is that the frequency of amino acid changes that require two nucleotide changes is higher than would be expected by chance. This article describes a test of the Markov model of protein evolution, which shows that the model can be valid if certain changes are made in the way that PAM matrices are calculated.
Collapse
|
9
|
Mount DW. Studies of varying alignment algorithm, amino Acid scoring matrix, and gap penalties. Cold Spring Harb Protoc 2008; 2008:pdb.ip60. [PMID: 21356841 DOI: 10.1101/pdb.ip60] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
INTRODUCTIONComparing different amino acid scoring matrix-gap penalty combinations poses several problems. For example, the analysis often overlooks the purposes of different matrices; e.g., protein family or domain searching, evolutionary analysis, or structural alignment. In the past, gap penalties were usually not published or well known, thus throwing a level of uncertainty into the results. More recently, when investigators publish a new scoring matrix, they usually provide suitable choices for gap penalties that may be used for comparisons with other matrices. This article summarizes a number of reports that have examined combinations of alignment algorithm, scoring matrix, and gap penalties used to align sequences for various purposes.
Collapse
|
10
|
Mount DW. Comparison of the PAM and BLOSUM Amino Acid Substitution Matrices. Cold Spring Harb Protoc 2008; 2008:pdb.ip59. [PMID: 21356840 DOI: 10.1101/pdb.ip59] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
INTRODUCTIONThe choice of a scoring system including scores for matches, mismatches, substitutions, insertions, and deletions influences the alignment of both DNA and protein sequences. To score matches and mismatches in alignments of proteins, it is necessary to know how often one amino acid is substituted for another in related proteins. Percent accepted mutation (PAM) matrices list the likelihood of change from one amino acid to another in homologous protein sequences during evolution and thus are focused on tracking the evolutionary origins of proteins. In contrast, the blocks amino acid substitution matrices (BLOSUM) are based on scoring substitutions found over a range of evolutionary periods. There are important differences in the ways that the PAM and BLOSUM scoring matrices were derived. These differences, which are discussed in this article, should be appreciated when interpreting the results of protein sequence alignments obtained with these matrices.
Collapse
|
11
|
Abstract
INTRODUCTIONCertain amino acid substitutions commonly occur in related proteins from different species. Because a protein still functions with these substitutions, the substituted amino acids are compatible with protein structure and function. Knowing the types of changes that are most and least common in a large number of proteins can assist with predicting alignments for any set of protein sequences. If related protein sequences are quite similar, they are easy to align, and one can readily determine the single-step amino acid changes. If ancestor relationships among a group of proteins are assessed, the most likely amino acid changes that occurred during evolution can be predicted. This type of analysis was pioneered by Margaret Dayhoff and used by her to produce a type of scoring matrix called a percent accepted mutation (PAM) matrix. This article introduces Dayhoff PAM matrices, explains how they are constructed and how they can be used for sequence alignments, and highlights their strengths and limitations.
Collapse
|