1
|
Praveen M. Characterizing the West Nile Virus's polyprotein from nucleotide sequence to protein structure - Computational tools. J Taibah Univ Med Sci 2024; 19:338-350. [PMID: 38304694 PMCID: PMC10831166 DOI: 10.1016/j.jtumed.2024.01.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 11/27/2023] [Accepted: 01/08/2024] [Indexed: 02/03/2024] Open
Abstract
Objectives West Nile virus (WNV) belongs to the Flaviviridae family and causes West Nile fever. The mechanism of transmission involves the culex mosquito species. Infected individuals are primarily asymptomatic, and few exhibit common symptoms. Moreover, 10 % of neuronal infection caused by this virus cause death. The proteins encoded by these genes had been uncharacterized, although understanding their function and structure is important for formulating antiviral drugs. Methods Herein, we used in silico approaches, including various bioinformatic tools and databases, to analyse the proteins from the WNV polyprotein individually. The characterization included GC content, physicochemical properties, conserved domains, soluble and transmembrane regions, signal localization, protein disorder, and secondary structure features and their respective 3D protein structures. Results Among 11 proteins, eight had >50 % GC content, eight proteins had basic pI values, three proteins were unstable under in vitro conditions, four were thermostable according to >100 AI values and some had negative GRAVY values in physicochemical analyses. All protein-conserved domains were shared among Flaviviridae family members. Five proteins were soluble and lacked transmembrane regions. Two proteins had signals for localization in the host endoplasmic reticulum. Non-structural (NS) 2A showed low protein disorder. The secondary structural features and tertiary structure models provide a valuable biochemical resource for designing selective substrates and synthetic inhibitors. Conclusions WNV proteins NS2A, NS2B, PM, NS3 and NS5 can be used as drug targets for the pharmacological design of lead antiviral compounds.
Collapse
Affiliation(s)
- Mallari Praveen
- Department of Zoology, Indira Gandhi National Tribal University, Amarkantak, Madhya Pradesh, India
| |
Collapse
|
2
|
Zhang S, Zhang T, Fu Y. Proteome-wide structural analysis quantifies structural conservation across distant species. Genome Res 2023; 33:1975-1993. [PMID: 37993136 PMCID: PMC10760455 DOI: 10.1101/gr.277771.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 10/16/2023] [Indexed: 11/24/2023]
Abstract
Traditional evolutionary biology research mainly relies on sequence information to infer evolutionary relationships between genes or proteins. In contrast, protein structural information has long been overlooked, although structures are more conserved and closely linked to the functions than the sequences. To address this gap, we conducted a proteome-wide structural analysis using experimental and computed protein structures for organisms from the three distinct domains, including Homo sapiens (eukarya), Escherichia coli (bacteria), and Methanocaldococcus jannaschii (archaea). We reveal the distribution of structural similarity and sequence identity at the genomic level and characterize the twilight zone, where signals obtained from sequence alignment are blurred and evolutionary relationships cannot be inferred unambiguously. We find that structurally similar homologous protein pairs in the twilight zone account for ∼0.004%-0.021% of all possible protein pair combinations, which translates to ∼8%-32% of the protein-coding genes, depending on the species under comparison. In addition, by comparing the structural homologs, we show that human proteins involved in the energy supply are more similar to their E. coli homologs, whereas proteins relating to the central dogma are more similar to their M. jannaschii homologs. We also identify a bacterial GPCR homolog in the E. coli proteome that displays distinctive domain architecture. Our results shed light on the characteristics of the twilight zone and the origin of different pathways from a protein structure perspective, highlighting an exciting new frontier in evolutionary biology.
Collapse
Affiliation(s)
- Shijie Zhang
- Department of Pharmacology and Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Teng Zhang
- Department of Pharmacology and Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| | - Yuan Fu
- Department of Pharmacology and Tianjin Key Laboratory of Inflammation Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
| |
Collapse
|
3
|
Feng H, Wang S, Wang Y, Ni X, Yang Z, Hu X, Sen Yang. LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information. Comput Struct Biotechnol J 2023; 21:1433-1447. [PMID: 36824229 PMCID: PMC9941877 DOI: 10.1016/j.csbj.2023.02.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 02/06/2023] [Accepted: 02/06/2023] [Indexed: 02/10/2023] Open
Abstract
Background Long non-coding RNA (lncRNA) is one of the most essential forms of transcripts, playing crucial regulatory roles in the development of cancers and diseases without protein-coding ability. It was assumed that short ORFs (sORFs) in lncRNA were weak to translate proteins. However, recent research has shown that sORFs can encode peptides, which increases the difficulty to identify lncRNA. Therefore, identifying lncRNAs with sORFs facilitates finding novel regulatory factors. Results In this paper, we propose LncCat for identifying lncRNA based on category boosting (CatBoost) and ORF-attention features. LncCat combines five types of features to encode transcript sequences and employs CatBoost to build a prediction model. In addition, the visualization comparison reveals that the ORF-attention features between lncRNAs and protein-coding transcripts are significantly distinct. The comparison results show that LncCat outperforms competing methods on several benchmark datasets. For Matthew's Correlation Coefficient (MCC), LncCat achieves 0.9503, 0.9219, 0.8591, 0.8672, and 0.9047 on the human, mouse, zebrafish, wheat, and chicken datasets, with improvements ranging from 1.90% to 7.82%, 1.49-17.63%, 6.11-21.50%, 3.02-51.64% and 5.35-26.90%, respectively. Moreover, LncCat dramatically improves the MCC by at least 11.90%, 12.96% and 42.61% on sORF test datasets of human, mouse, and zebrafish, respectively. Conclusions Experiments indicate that LncCat performs better both on long ORF and sORF datasets, and ORF-attention features show positive effects on predicting lncRNA. In brief, LncCat is a reliable method for identifying lncRNA. Additionally, a user-friendly web server is developed for academics at http://cczubio.top/lnccat.
Collapse
Affiliation(s)
- Hongqi Feng
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China
| | - Shaocong Wang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Xinye Ni
- The Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou 213164, China
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China
| | - Xuemei Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou 213164, China
- The Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou 213164, China
| |
Collapse
|
4
|
Chepkwony M, Wragg D, Latré de Laté P, Paxton E, Cook E, Ndambuki G, Kitala P, Gathura P, Toye P, Prendergast J. Longitudinal transcriptome analysis of cattle infected with Theileria parva. Int J Parasitol 2022; 52:799-813. [PMID: 36244429 DOI: 10.1016/j.ijpara.2022.07.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 07/01/2022] [Accepted: 07/14/2022] [Indexed: 11/05/2022]
Abstract
The apicomplexan cattle parasite Theileria parva is a major barrier to improving the livelihoods of smallholder farmers in Africa, killing over one million cattle on the continent each year. Although exotic breeds not native to Africa are highly susceptible to the disease, previous studies have illustrated that such breeds often show innate tolerance to infection by the parasite. The mechanisms underlying this tolerance remain largely unclear. To better understand the host response to T. parva infection we characterised the transcriptional response over 15 days in tolerant and susceptible cattle (n = 29) naturally exposed to the parasite. We identify key genes and pathways activated in response to infection as well as, importantly, several genes differentially expressed between the animals that ultimately survived or succumbed to infection. These include genes linked to key cell proliferation and infection pathways. Furthermore, we identify response expression quantitative trait loci containing genetic variants whose impact on the expression level of nearby genes changes in response to the infection. These therefore provide an indication of the genetic basis of differential host responses. Together these results provide a comprehensive analysis of the host transcriptional response to this under-studied pathogen, providing clues as to the mechanisms underlying natural tolerance to the disease.
Collapse
Affiliation(s)
- M Chepkwony
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Kenya, P.O. Box 30709, Nairobi 00100, Kenya
| | - D Wragg
- Centre for Tropical Livestock Genetics and Health (CTLGH), Easter Bush Campus, EH25 9RG, UK
| | - P Latré de Laté
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Kenya, P.O. Box 30709, Nairobi 00100, Kenya
| | - E Paxton
- Centre for Tropical Livestock Genetics and Health (CTLGH), Easter Bush Campus, EH25 9RG, UK
| | - E Cook
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Kenya, P.O. Box 30709, Nairobi 00100, Kenya
| | - G Ndambuki
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Kenya, P.O. Box 30709, Nairobi 00100, Kenya
| | - P Kitala
- College of Agriculture and Veterinary Sciences (CAVS), University of Nairobi, P.O. Box 29053-00624, Kangemi, Nairobi, Kenya
| | - P Gathura
- College of Agriculture and Veterinary Sciences (CAVS), University of Nairobi, P.O. Box 29053-00624, Kangemi, Nairobi, Kenya
| | - P Toye
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Kenya, P.O. Box 30709, Nairobi 00100, Kenya.
| | - J Prendergast
- Centre for Tropical Livestock Genetics and Health (CTLGH), Easter Bush Campus, EH25 9RG, UK.
| |
Collapse
|
5
|
Dsouza KB, Li AY, Bhargava VK, Libbrecht MW. Latent Representation of the Human Pan-Celltype Epigenome Through a Deep Recurrent Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2313-2323. [PMID: 34043510 DOI: 10.1109/tcbb.2021.3084147] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The availability of thousands of assays of epigenetic activity necessitates compressed representations of these data sets that summarize the epigenetic landscape of the genome. Until recently, most such representations were cell type-specific, applying to a single tissue or cell state. Recently, neural networks have made it possible to summarize data across tissues to produce a pan-cell type representation. In this work, we propose Epi-LSTM, a deep long short-term memory (LSTM) recurrent neural network autoencoder to capture the long-term dependencies in the epigenomic data. The latent representations from Epi-LSTM capture a variety of genomic phenomena, including gene-expression, promoter-enhancer interactions, replication timing, frequently interacting regions, and evolutionary conservation. These representations outperform existing methods in a majority of cell types while yielding smoother representations along the genomic axis due to their sequential nature.
Collapse
|
6
|
Zhu Y, Chen L, Hong X, Shi H, Li X. Revealing the novel complexity of plant long non-coding RNA by strand-specific and whole transcriptome sequencing for evolutionarily representative plant species. BMC Genomics 2022; 23:381. [PMID: 35590257 PMCID: PMC9118565 DOI: 10.1186/s12864-022-08602-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 05/03/2022] [Indexed: 12/02/2022] Open
Abstract
Background Previous studies on plant long noncoding RNAs (lncRNAs) lacked consistency and suffered from many factors like heterogeneous data sources and experimental protocols, different plant tissues, inconsistent bioinformatics pipelines, etc. For example, the sequencing of RNAs with poly(A) tails excluded a large portion of lncRNAs without poly(A), and use of regular RNA-sequencing technique did not distinguish transcripts’ direction for lncRNAs. The current study was designed to systematically discover and analyze lncRNAs across eight evolutionarily representative plant species, using strand-specific (directional) and whole transcriptome sequencing (RiboMinus) technique. Results A total of 39,945 lncRNAs (25,350 lincRNAs and 14,595 lncNATs) were identified, which showed molecular features of lncRNAs that are consistent across divergent plant species but different from those of mRNA. Further, transposable elements (TEs) were found to play key roles in the origination of lncRNA, as significantly large number of lncRNAs were found to contain TEs in gene body and promoter region, and transcription of many lncRNAs was driven by TE promoters. The lncRNA sequences were divergent even in closely related species, and most plant lncRNAs were genus/species-specific, amid rapid turnover in evolution. Evaluated with PhastCons scores, plant lncRNAs showed similar conservation level to that of intergenic sequences, suggesting that most lincRNAs were young and with short evolutionary age. INDUCED BY PHOSPHATE STARVATION (IPS) was found so far to be the only plant lncRNA group with conserved motifs, which may play important roles in the adaptation of terrestrial life during migration from aquatic to terrestrial. Most highly and specially expressed lncRNAs formed co-expression network with coding genes, and their functions were believed to be closely related to their co-expression genes. Conclusion The study revealed novel features and complexity of lncRNAs in plants through systematic analysis, providing important insights into the origination and evolution of plant lncRNAs. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08602-9.
Collapse
Affiliation(s)
- Yan Zhu
- Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Longxian Chen
- Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiangna Hong
- Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China.,Henan University, Kaifeng, China
| | - Han Shi
- Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xuan Li
- Key Laboratory of Synthetic Biology, Center for Excellence in Molecular Plant Sciences/Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China. .,University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
7
|
Klapproth C, Sen R, Stadler PF, Findeiß S, Fallmann J. Common Features in lncRNA Annotation and Classification: A Survey. Noncoding RNA 2021; 7:77. [PMID: 34940758 PMCID: PMC8708962 DOI: 10.3390/ncrna7040077] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 12/03/2021] [Accepted: 12/06/2021] [Indexed: 12/29/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) are widely recognized as important regulators of gene expression. Their molecular functions range from miRNA sponging to chromatin-associated mechanisms, leading to effects in disease progression and establishing them as diagnostic and therapeutic targets. Still, only a few representatives of this diverse class of RNAs are well studied, while the vast majority is poorly described beyond the existence of their transcripts. In this review we survey common in silico approaches for lncRNA annotation. We focus on the well-established sets of features used for classification and discuss their specific advantages and weaknesses. While the available tools perform very well for the task of distinguishing coding sequence from other RNAs, we find that current methods are not well suited to distinguish lncRNAs or parts thereof from other non-protein-coding input sequences. We conclude that the distinction of lncRNAs from intronic sequences and untranslated regions of coding mRNAs remains a pressing research gap.
Collapse
Affiliation(s)
- Christopher Klapproth
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany; (C.K.); (P.F.S.); (S.F.)
| | - Rituparno Sen
- Helmholtz Institute for RNA-Based Infection Research (HIRI), Helmholtz-Center for Infection Research (HZI), D-97080 Würzburg, Germany;
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany; (C.K.); (P.F.S.); (S.F.)
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Competence Center for Scalable Data Services and Solutions, and Leipzig Research Center for Civilization Diseases, University Leipzig, D-04103 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria
- Facultad de Ciencias, Universidad National de Colombia, Bogotá CO-111321, Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
| | - Sven Findeiß
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany; (C.K.); (P.F.S.); (S.F.)
| | - Jörg Fallmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany; (C.K.); (P.F.S.); (S.F.)
| |
Collapse
|
8
|
Javaid N, Choi S. CRISPR/Cas System and Factors Affecting Its Precision and Efficiency. Front Cell Dev Biol 2021; 9:761709. [PMID: 34901007 PMCID: PMC8652214 DOI: 10.3389/fcell.2021.761709] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 11/01/2021] [Indexed: 12/20/2022] Open
Abstract
The diverse applications of genetically modified cells and organisms require more precise and efficient genome-editing tool such as clustered regularly interspaced short palindromic repeats/CRISPR-associated protein (CRISPR/Cas). The CRISPR/Cas system was originally discovered in bacteria as a part of adaptive-immune system with multiple types. Its engineered versions involve multiple host DNA-repair pathways in order to perform genome editing in host cells. However, it is still challenging to get maximum genome-editing efficiency with fewer or no off-targets. Here, we focused on factors affecting the genome-editing efficiency and precision of CRISPR/Cas system along with its defense-mechanism, orthologues, and applications.
Collapse
Affiliation(s)
- Nasir Javaid
- Department of Molecular Science and Technology, Ajou University, Suwon, South Korea
| | - Sangdun Choi
- Department of Molecular Science and Technology, Ajou University, Suwon, South Korea
- S&K Therapeutics, Ajou University Campus Plaza, Suwon, South Korea
| |
Collapse
|
9
|
Zhao P, Du H, Jiang L, Zheng X, Feng W, Diao C, Zhou L, Liu GE, Zhang H, Chamba Y, Zhang Q, Li B, Liu JF. PRE-1 Revealed Previous Unknown Introgression Events in Eurasian Boars during the Middle Pleistocene. Genome Biol Evol 2021; 12:1751-1764. [PMID: 33151306 PMCID: PMC7643367 DOI: 10.1093/gbe/evaa142] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/03/2020] [Indexed: 12/22/2022] Open
Abstract
Introgression events and population admixture occurred among Sus species across the Eurasian mainland in the Middle Pleistocene, which reflects the local adaption of different populations and contributes to evolutionary novelty. Previous findings on these population introgressions were largely based on extensive genome-wide single-nucleotide polymorphism information, ignoring structural variants (SVs) as an important alternative resource of genetic variations. Here, we profiled the genome-wide SVs and explored the formation of pattern-related SVs, indicating that PRE1-SS is a recently active subfamily that was strongly associated with introgression events in multiple Asian and European pig populations. As reflected by the three different combination haplotypes from two specific patterns and known phylogenetic relationships in Eurasian boars, we identified the Asian Northern wild pigs as having experienced introgression from European wild boars around 0.5–0.2 Ma and having received latitude-related selection. During further exploration of the influence of pattern-related SVs on gene functions, we found substantial sequence changes in 199 intron regions of 54 genes and 3 exon regions of 3 genes (HDX, TRO, and SMIM1), implying that the pattern-related SVs were highly related to positive selection and adaption of pigs. Our findings revealed novel introgression events in Eurasian wild boars, providing a timeline of population admixture and divergence across the Eurasian mainland in the Middle Pleistocene.
Collapse
Affiliation(s)
- Pengju Zhao
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Heng Du
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Lin Jiang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
| | - Xianrui Zheng
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Wen Feng
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Chenguang Diao
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Lei Zhou
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Maryland
| | - Hao Zhang
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Yangzom Chamba
- College of Animal Science and Technology, Tibet Agriculture and Animal Husbandry College, Linzhi, Tibet, China
| | - Qin Zhang
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China.,College of Animal Science and Technology, Shandong Agricultural University, Taian, Shandong, PR China
| | - Bugao Li
- Department of Animal Sciences and Veterinary Medicine, Shanxi Agricultural University, Taigu, China
| | - Jian-Feng Liu
- National Engineering Laboratory for Animal Breeding; Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture; College of Animal Science and Technology, China Agricultural University, Beijing, China
| |
Collapse
|
10
|
Gao NL, He Z, Zhu Q, Jiang P, Hu S, Chen WH. Selection for Cheaper Amino Acids Drives Nucleotide Usage at the Start of Translation in Eukaryotic Genes. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:949-957. [PMID: 33741525 PMCID: PMC9403032 DOI: 10.1016/j.gpb.2021.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 05/30/2019] [Accepted: 08/18/2019] [Indexed: 12/04/2022]
Abstract
Coding regions have complex interactions among multiple selective forces, which are manifested as biases in nucleotide composition. Previous studies have revealed a decreasing GC gradient from the 5′-end to 3′-end of coding regions in various organisms. We confirmed that this gradient is universal in eukaryotic genes, but the decrease only starts from the ∼ 25th codon. This trend is mostly found in nonsynonymous (ns) sites at which the GC gradient is universal across the eukaryotic genome. Increased GC contents at ns sites result in cheaper amino acids, indicating a universal selection for energy efficiency toward the N-termini of encoded proteins. Within a genome, the decreasing GC gradient is intensified from lowly to highly expressed genes (more and more protein products), further supporting this hypothesis. This reveals a conserved selective constraint for cheaper amino acids at the translation start that drives the increased GC contents at ns sites. Elevated GC contents can facilitate transcription but result in a more stable local secondary structure around the start codon and subsequently impede translation initiation. Conversely, the GC gradients at four-fold and two-fold synonymous sites vary across species. They could decrease or increase, suggesting different constraints acting at the GC contents of different codon sites in different species. This study reveals that the overall GC contents at the translation start are consequences of complex interactions among several major biological processes that shape the nucleotide sequences, especially efficient energy usage.
Collapse
Affiliation(s)
- Na L Gao
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; Institute for Computer Science and Cluster of Excellence on Plant Sciences, Heinrich Heine University, Duesseldorf 40225, Germany
| | - Zilong He
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China; State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China; Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Interdisciplinary Innovation Institute of Medicine and Engineering, Beihang University, Beijing 100191, China
| | - Qianhui Zhu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China; State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Puzi Jiang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Songnian Hu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China; State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
| |
Collapse
|
11
|
Auboeuf D. Physicochemical Foundations of Life that Direct Evolution: Chance and Natural Selection are not Evolutionary Driving Forces. Life (Basel) 2020; 10:life10020007. [PMID: 31973071 PMCID: PMC7175370 DOI: 10.3390/life10020007] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Revised: 01/15/2020] [Accepted: 01/16/2020] [Indexed: 12/11/2022] Open
Abstract
The current framework of evolutionary theory postulates that evolution relies on random mutations generating a diversity of phenotypes on which natural selection acts. This framework was established using a top-down approach as it originated from Darwinism, which is based on observations made of complex multicellular organisms and, then, modified to fit a DNA-centric view. In this article, it is argued that based on a bottom-up approach starting from the physicochemical properties of nucleic and amino acid polymers, we should reject the facts that (i) natural selection plays a dominant role in evolution and (ii) the probability of mutations is independent of the generated phenotype. It is shown that the adaptation of a phenotype to an environment does not correspond to organism fitness, but rather corresponds to maintaining the genome stability and integrity. In a stable environment, the phenotype maintains the stability of its originating genome and both (genome and phenotype) are reproduced identically. In an unstable environment (i.e., corresponding to variations in physicochemical parameters above a physiological range), the phenotype no longer maintains the stability of its originating genome, but instead influences its variations. Indeed, environment- and cellular-dependent physicochemical parameters define the probability of mutations in terms of frequency, nature, and location in a genome. Evolution is non-deterministic because it relies on probabilistic physicochemical rules, and evolution is driven by a bidirectional interplay between genome and phenotype in which the phenotype ensures the stability of its originating genome in a cellular and environmental physicochemical parameter-depending manner.
Collapse
Affiliation(s)
- Didier Auboeuf
- Laboratory of Biology and Modelling of the Cell, Univ Lyon, ENS de Lyon, Univ Claude Bernard, CNRS UMR 5239, INSERM U1210, 46 Allée d'Italie, Site Jacques Monod, F-69007, Lyon, France
| |
Collapse
|
12
|
Zenil H, Minary P. Training-free measures based on algorithmic probability identify high nucleosome occupancy in DNA sequences. Nucleic Acids Res 2019; 47:e129. [PMID: 31511887 PMCID: PMC6846163 DOI: 10.1093/nar/gkz750] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2019] [Revised: 07/10/2019] [Accepted: 08/27/2019] [Indexed: 01/01/2023] Open
Abstract
We introduce and study a set of training-free methods of an information-theoretic and algorithmic complexity nature that we apply to DNA sequences to identify their potential to identify nucleosomal binding sites. We test the measures on well-studied genomic sequences of different sizes drawn from different sources. The measures reveal the known in vivo versus in vitro predictive discrepancies and uncover their potential to pinpoint high and low nucleosome occupancy. We explore different possible signals within and beyond the nucleosome length and find that the complexity indices are informative of nucleosome occupancy. We found that, while it is clear that the gold standard Kaplan model is driven by GC content (by design) and by k-mer training; for high occupancy, entropy and complexity-based scores are also informative and can complement the Kaplan model.
Collapse
Affiliation(s)
- Hector Zenil
- Oxford Immune Algorithmics, Oxford University Innovation, Oxford, UK
- Algorithmic Dynamics Lab, Unit of Computational Medicine, SciLifeLab, Center for Molecular Medicine, Karolinska Institute, Stockholm, Sweden
- Algorithmic Nature Group, LABORES for the Natural and Digital Sciences, Paris, France
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Peter Minary
- Department of Computer Science, University of Oxford, Oxford, UK
| |
Collapse
|
13
|
Piovesan A, Pelleri MC, Antonaros F, Strippoli P, Caracausi M, Vitale L. On the length, weight and GC content of the human genome. BMC Res Notes 2019; 12:106. [PMID: 30813969 PMCID: PMC6391780 DOI: 10.1186/s13104-019-4137-z] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Accepted: 02/15/2019] [Indexed: 01/08/2023] Open
Abstract
Objective Basic parameters commonly used to describe genomes including length, weight and relative guanine-cytosine (GC) content are widely cited in absence of a primary source. By using updated data and original software we determined these values to the best of our knowledge as standard reference for the whole human nuclear genome, for each chromosome and for mitochondrial DNA. We also devised a method to calculate the relative GC content in the whole messenger RNA sequence set and in transcriptomes by multiplying the GC content of each gene by its mean expression level. Results The male nuclear diploid genome extends for 6.27 Gigabase pairs (Gbp), is 205.00 cm (cm) long and weighs 6.41 picograms (pg). Female values are 6.37 Gbp, 208.23 cm, 6.51 pg. The individual variability and the implication for the DNA informational density in terms of bits/volume were discussed. The genomic GC content is 40.9%. Following analysis in different transcriptomes and species, we showed that the greatest deviation was observed in the pathological condition analysed (trisomy 21 leukaemic cells) and in Caenorhabditis elegans. Our results may represent a solid basis for further investigation on human structural and functional genomics while also providing a framework for other genome comparative analysis. Electronic supplementary material The online version of this article (10.1186/s13104-019-4137-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Allison Piovesan
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy
| | - Maria Chiara Pelleri
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy
| | - Francesca Antonaros
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy
| | - Pierluigi Strippoli
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy
| | - Maria Caracausi
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy.
| | - Lorenza Vitale
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, Via Belmeloro 8, 40126, Bologna, BO, Italy
| |
Collapse
|
14
|
Frank A, Froese T. The Standard Genetic Code can Evolve from a Two-Letter GC Code Without Information Loss or Costly Reassignments. ORIGINS LIFE EVOL B 2018; 48:259-272. [PMID: 29959584 DOI: 10.1007/s11084-018-9559-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2018] [Accepted: 06/21/2018] [Indexed: 11/27/2022]
Abstract
It is widely agreed that the standard genetic code must have been preceded by a simpler code that encoded fewer amino acids. How this simpler code could have expanded into the standard genetic code is not well understood because most changes to the code are costly. Taking inspiration from the recently synthesized six-letter code, we propose a novel hypothesis: the initial genetic code consisted of only two letters, G and C, and then expanded the number of available codons via the introduction of an additional pair of letters, A and U. Various lines of evidence, including the relative prebiotic abundance of the earliest assigned amino acids, the balance of their hydrophobicity, and the higher GC content in genome coding regions, indicate that the original two nucleotides were indeed G and C. This process of code expansion probably started with the third base, continued with the second base, and ended up as the standard genetic code when the second pair of letters was introduced into the first base. The proposed process is consistent with the available empirical evidence, and it uniquely avoids the problem of costly code changes by positing instead that the code expanded its capacity via the creation of new codons with extra letters.
Collapse
Affiliation(s)
- Alejandro Frank
- Institute for Nuclear Sciences (ICN), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
- Center for the Sciences of Complexity (C3), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
- El Colegio Nacional, Mexico City, Mexico
| | - Tom Froese
- Center for the Sciences of Complexity (C3), National Autonomous University of Mexico (UNAM), Mexico City, Mexico.
- Institute for Applied Mathematics and Systems Research (IIMAS), National Autonomous University of Mexico (UNAM), Mexico City, Mexico.
| |
Collapse
|
15
|
The evolution of genomic and epigenomic features in two Pleurotus fungi. Sci Rep 2018; 8:8313. [PMID: 29844491 PMCID: PMC5974365 DOI: 10.1038/s41598-018-26619-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 04/23/2018] [Indexed: 12/17/2022] Open
Abstract
Pleurotus tuoliensis (Bailinggu, designated Pt) and P. eryngii var. eryngii (Xingbaogu, designated Pe) are highly valued edible mushrooms. We report de novo assemblies of high-quality genomes for both mushrooms based on PacBio RS II sequencing and annotation of all identified genes. A comparative genomics analysis between Pt and Pe with P. ostreatus as an outgroup taxon revealed extensive genomic divergence between the two mushroom genomes primarily due to the rapid gain of taxon-specific genes and disruption of synteny in either taxon. The re-appraised phylogenetic relationship between Pt and Pe at the genome-wide level validates earlier proposals to designate Pt as an independent species. Variation of the identified wood-decay-related gene content can largely explain the variable adaptation and host specificity of the two mushrooms. On the basis of the two assembled genome sequences, methylomes and the regulatory roles of DNA methylation in gene expression were characterized and compared. The genome, methylome and transcriptome data of these two important mushrooms will provide valuable information for advancing our understanding of the evolution of Pleurotus and related genera and for facilitating genome- and epigenome-based strategies for mushroom breeding.
Collapse
|
16
|
Raphidocelis subcapitata (=Pseudokirchneriella subcapitata) provides an insight into genome evolution and environmental adaptations in the Sphaeropleales. Sci Rep 2018; 8:8058. [PMID: 29795299 PMCID: PMC5966456 DOI: 10.1038/s41598-018-26331-6] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2017] [Accepted: 05/08/2018] [Indexed: 11/08/2022] Open
Abstract
The Sphaeropleales are a dominant group of green algae, which contain species important to freshwater ecosystems and those that have potential applied usages. In particular, Raphidocelis subcapitata is widely used worldwide for bioassays in toxicological risk assessments. However, there are few comparative genome analyses of the Sphaeropleales. To reveal genome evolution in the Sphaeropleales based on well-resolved phylogenetic relationships, nuclear, mitochondrial, and plastid genomes were sequenced in this study. The plastid genome provides insights into the phylogenetic relationships of R. subcapitata, which is located in the most basal lineage of the four species in the family Selenastraceae. The mitochondrial genome shows dynamic evolutionary histories with intron expansion in the Selenastraceae. The 51.2 Mbp nuclear genome of R. subcapitata, encoding 13,383 protein-coding genes, is more compact than the genome of its closely related oil-rich species, Monoraphidium neglectum (Selenastraceae), Tetradesmus obliquus (Scenedesmaceae), and Chromochloris zofingiensis (Chromochloridaceae); however, the four species share most of their genes. The Sphaeropleales possess a large number of genes for glycerolipid metabolism and sugar assimilation, which suggests that this order is capable of both heterotrophic and mixotrophic lifestyles in nature. Comparison of transporter genes suggests that the Sphaeropleales can adapt to different natural environmental conditions, such as salinity and low metal concentrations.
Collapse
|
17
|
Li M, Ponce-Gordo F, Grim JN, Li C, Zou H, Li W, Wu S, Wang G. Morphological Redescription ofOpalina undulataNie 1932 fromFejervarya limnochariswith Molecular Phylogenetic Study of Opalinids (Heterokonta, Opalinea). J Eukaryot Microbiol 2018; 65:783-791. [DOI: 10.1111/jeu.12520] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Revised: 03/20/2018] [Accepted: 03/23/2018] [Indexed: 11/26/2022]
Affiliation(s)
- Ming Li
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| | - Francisco Ponce-Gordo
- Departamento de Microbiología y Parasitología; Facultad de Farmacia; Universidad Complutense de Madrid; Plaza Ramóny Cajal s/n 28040 Madrid Spain
| | - J. Norman Grim
- Department of Biological Sciences; Northern Arizona University; Flagstaff Arizona 86011
| | - Can Li
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| | - Hong Zou
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| | - Wenxiang Li
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| | - Shangong Wu
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| | - Guitang Wang
- Key Laboratory of Aquaculture Disease Control; Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology; Institute of Hydrobiology; Chinese Academy of Sciences; Wuhan 430072 China
| |
Collapse
|
18
|
Labuhn M, Adams FF, Ng M, Knoess S, Schambach A, Charpentier EM, Schwarzer A, Mateo JL, Klusmann JH, Heckl D. Refined sgRNA efficacy prediction improves large- and small-scale CRISPR-Cas9 applications. Nucleic Acids Res 2018; 46:1375-1385. [PMID: 29267886 PMCID: PMC5814880 DOI: 10.1093/nar/gkx1268] [Citation(s) in RCA: 161] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 11/27/2017] [Accepted: 12/11/2017] [Indexed: 12/26/2022] Open
Abstract
Genome editing with the CRISPR-Cas9 system has enabled unprecedented efficacy for reverse genetics and gene correction approaches. While off-target effects have been successfully tackled, the effort to eliminate variability in sgRNA efficacies-which affect experimental sensitivity-is in its infancy. To address this issue, studies have analyzed the molecular features of highly active sgRNAs, but independent cross-validation is lacking. Utilizing fluorescent reporter knock-out assays with verification at selected endogenous loci, we experimentally quantified the target efficacies of 430 sgRNAs. Based on this dataset we tested the predictive value of five recently-established prediction algorithms. Our analysis revealed a moderate correlation (r = 0.04 to r = 0.20) between the predicted and measured activity of the sgRNAs, and modest concordance between the different algorithms. We uncovered a strong PAM-distal GC-content-dependent activity, which enabled the exclusion of inactive sgRNAs. By deriving nine additional predictive features we generated a linear model-based discrete system for the efficient selection (r = 0.4) of effective sgRNAs (CRISPRater). We proved our algorithms' efficacy on small and large external datasets, and provide a versatile combined on- and off-target sgRNA scanning platform. Altogether, our study highlights current issues and efforts in sgRNA efficacy prediction, and provides an easily-applicable discrete system for selecting efficient sgRNAs.
Collapse
Affiliation(s)
- Maurice Labuhn
- Pediatric Hematology & Oncology, Hannover Medical School, Hannover, Germany
| | - Felix F Adams
- Institute of Experimental Hematology, Hannover Medical School, Hannover, Germany
| | - Michelle Ng
- Pediatric Hematology & Oncology, Hannover Medical School, Hannover, Germany
| | - Sabine Knoess
- Pediatric Hematology & Oncology, Hannover Medical School, Hannover, Germany
| | - Axel Schambach
- Institute of Experimental Hematology, Hannover Medical School, Hannover, Germany
- REBIRTH Cluster of Excellence, Hannover Medical School, Hannover, Germany
| | - Emmanuelle M Charpentier
- Department of Regulation in Infection Biology, Max Planck Institute for Infection Biology, Berlin, Germany
- The Laboratory for Molecular Infection Medicine Sweden, Umeå University, Umeå, Sweden
| | - Adrian Schwarzer
- Institute of Experimental Hematology, Hannover Medical School, Hannover, Germany
- Department of Hematology, Hemostasis, Oncology and Stem Cell Transplantation, Hannover Medical School, Hannover, Germany
| | - Juan L Mateo
- Centre for Organismal Studies (COS), Heidelberg University, Heidelberg, Germany
- Department of Information Technology, University of Oviedo, Oviedo, Asturias, Spain
| | - Jan-Henning Klusmann
- Pediatric Hematology & Oncology, Hannover Medical School, Hannover, Germany
- Department of Pediatrics I, Pediatric Hematology and Oncology, University of Halle, Halle, Germany
| | - Dirk Heckl
- Pediatric Hematology & Oncology, Hannover Medical School, Hannover, Germany
| |
Collapse
|
19
|
Sievers A, Bosiek K, Bisch M, Dreessen C, Riedel J, Froß P, Hausmann M, Hildenbrand G. K-mer Content, Correlation, and Position Analysis of Genome DNA Sequences for the Identification of Function and Evolutionary Features. Genes (Basel) 2017; 8:E122. [PMID: 28422050 PMCID: PMC5406869 DOI: 10.3390/genes8040122] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 03/24/2017] [Accepted: 04/04/2017] [Indexed: 12/26/2022] Open
Abstract
In genome analysis, k-mer-based comparison methods have become standard tools. However, even though they are able to deliver reliable results, other algorithms seem to work better in some cases. To improve k-mer-based DNA sequence analysis and comparison, we successfully checked whether adding positional resolution is beneficial for finding and/or comparing interesting organizational structures. A simple but efficient algorithm for extracting and saving local k-mer spectra (frequency distribution of k-mers) was developed and used. The results were analyzed by including positional information based on visualizations as genomic maps and by applying basic vector correlation methods. This analysis was concentrated on small word lengths (1 ≤ k ≤ 4) on relatively small viral genomes of Papillomaviridae and Herpesviridae, while also checking its usability for larger sequences, namely human chromosome 2 and the homologous chromosomes (2A, 2B) of a chimpanzee. Using this alignment-free analysis, several regions with specific characteristics in Papillomaviridae and Herpesviridae formerly identified by independent, mostly alignment-based methods, were confirmed. Correlations between the k-mer content and several genes in these genomes have been found, showing similarities between classified and unclassified viruses, which may be potentially useful for further taxonomic research. Furthermore, unknown k-mer correlations in the genomes of Human Herpesviruses (HHVs), which are probably of major biological function, are found and described. Using the chromosomes of a chimpanzee and human that are currently known, identities between the species on every analyzed chromosome were reproduced. This demonstrates the feasibility of our approach for large data sets of complex genomes. Based on these results, we suggest k-mer analysis with positional resolution as a method for closing a gap between the effectiveness of alignment-based methods (like NCBI BLAST) and the high pace of standard k-mer analysis.
Collapse
Affiliation(s)
- Aaron Sievers
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Katharina Bosiek
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Marc Bisch
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Chris Dreessen
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Jascha Riedel
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Patrick Froß
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Michael Hausmann
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
| | - Georg Hildenbrand
- Kirchhoff-Institute for Physics, Heidelberg University, INF 227, 69117 Heidelberg, Germany.
- Department of Radiation Oncology, Universitätsmedizin Mannheim, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68167 Mannheim, Germany.
| |
Collapse
|
20
|
Alkhateeb A, Rueda L. Zseq: An Approach for Preprocessing Next-Generation Sequencing Data. J Comput Biol 2017; 24:746-755. [PMID: 28414515 PMCID: PMC5563921 DOI: 10.1089/cmb.2017.0021] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of uniquek-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage ink-mers. Based on az-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
Collapse
Affiliation(s)
| | - Luis Rueda
- School of Computer Science, University of Windsor , Windsor, Canada
| |
Collapse
|
21
|
Zhu Y, Chen L, Zhang C, Hao P, Jing X, Li X. Global transcriptome analysis reveals extensive gene remodeling, alternative splicing and differential transcription profiles in non-seed vascular plant Selaginella moellendorffii. BMC Genomics 2017; 18:1042. [PMID: 28198676 PMCID: PMC5310277 DOI: 10.1186/s12864-016-3266-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Selaginella moellendorffii, a lycophyte, is a model plant to study the early evolution and development of vascular plants. As the first and only sequenced lycophyte to date, the genome of S. moellendorffii revealed many conserved genes and pathways, as well as specialized genes different from flowering plants. Despite the progress made, little is known about long noncoding RNAs (lncRNA) and the alternative splicing (AS) of coding genes in S. moellendorffii. Its coding gene models have not been fully validated with transcriptome data. Furthermore, it remains important to understand whether the regulatory mechanisms similar to flowering plants are used, and how they operate in a non-seed primitive vascular plant. Results RNA-sequencing (RNA-seq) was performed for three S. moellendorffii tissues, root, stem, and leaf, by constructing strand-specific RNA-seq libraries from RNA purified using RiboMinus isolation protocol. A total of 176 million reads (44 Gbp) were obtained from three tissue types, and were mapped to S. moellendorffii genome. By comparing with 22,285 existing gene models of S. moellendorffii, we identified 7930 high-confidence novel coding genes (a 35.6% increase), and for the first time reported 4422 lncRNAs in a lycophyte. Further, we refined 2461 (11.0%) of existing gene models, and identified 11,030 AS events (for 5957 coding genes) revealed for the first time for lycophytes. Tissue-specific gene expression with functional implication was analyzed, and 1031, 554, and 269 coding genes, and 174, 39, and 17 lncRNAs were identified in root, stem, and leaf tissues, respectively. The expression of critical genes for vascular development stages, i.e. formation of provascular cells, xylem specification and differentiation, and phloem specification and differentiation, was compared in S. moellendorffii tissues, indicating a less complex regulatory mechanism in lycophytes than in flowering plants. The results were further strengthened by the evolutionary trend of seven transcription factor families related to vascular development, which was observed among four representative species of seed and non-seed vascular plants, and nonvascular land and aquatic plants. Conclusions The deep RNA-seq study of S. moellendorffii discovered extensive new gene contents, including novel coding genes, lncRNAs, AS events, and refined gene models. Compared to flowering vascular plants, S. moellendorffii displayed a less complexity in both gene structure, alternative splicing, and regulatory elements of vascular development. The study offered important insight into the evolution of vascular plants, and the regulation mechanism of vascular development in a non-seed plant. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3266-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yan Zhu
- Key Laboratory of Synthetic Biology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200032, China
| | - Longxian Chen
- Key Laboratory of Synthetic Biology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200032, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Chengjun Zhang
- Germplasm Bank of Wild species in Southwest China, Kunming Institute of Botany, Chinese Academy of Science, Kunming, Yunnan, 650201, China
| | - Pei Hao
- Key Laboratory of Molecular Virology and Immunology, Institute Pasteur of Shanghai, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Xinyun Jing
- Key Laboratory of Synthetic Biology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200032, China.
| | - Xuan Li
- Key Laboratory of Synthetic Biology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, 200032, China.
| |
Collapse
|
22
|
Lau YL, Lee WC, Xia J, Zhang G, Razali R, Anwar A, Fong MY. Draft genome of Brugia pahangi: high similarity between B. pahangi and B. malayi. Parasit Vectors 2015; 8:451. [PMID: 26350613 PMCID: PMC4562187 DOI: 10.1186/s13071-015-1064-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Accepted: 09/01/2015] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Efforts to completely eradicate lymphatic filariasis from human population may be challenged by the emergence of Brugia pahangi as another zoonotic lymphatic filarial nematode. In this report, a genomic study was conducted to understand this species at molecular level. METHODS After blood meal on a B. pahangi-harbouring cat, the Aedes togoi mosquitoes were maintained to harvest infective third stage larvae, which were then injected into male Mongolian gerbils. Subsequently, adult B. pahangi were obtained from the infected gerbil for genomic DNA extraction. Sequencing and subsequently, construction of genomic libraries were performed. This was followed by genomic analyses and gene annotation analysis. By using archived protein sequences of B. malayi and a few other nematodes, clustering of gene orthologs and phylogenetics were conducted. RESULTS A total of 9687 coding genes were predicted. The genome of B. pahangi shared high similarity to that B. malayi genome, particularly genes annotated to fundamental processes. Nevertheless, 166 genes were considered to be unique to B. pahangi, which may be responsible for the distinct properties of B. pahangi as compared to other filarial nematodes. In addition, 803 genes were deduced to be derived from Wolbachia, an endosymbiont bacterium, with 44 of these genes intercalate into the nematode genome. CONCLUSIONS The reporting of B. pahangi draft genome contributes to genomic archive. Albeit with high similarity to B. malayi genome, the B. pahangi-unique genes found in this study may serve as new focus to study differences in virulence, vector selection and host adaptability among different Brugia spp.
Collapse
Affiliation(s)
- Yee-Ling Lau
- Department of Parasitology, Faculty of Medicine, University of Malaya, 50603, Kuala Lumpur, Malaysia.
| | - Wenn-Chyau Lee
- Singapore Immunology Network (SIgN), Agency for Science, Technology and Research (A*STAR), Singapore, 138648, Singapore
| | | | | | - Rozaimi Razali
- Sengenics HIR, University of Malaya, 50603, Kuala Lumpur, Malaysia
| | - Arif Anwar
- Sengenics HIR, University of Malaya, 50603, Kuala Lumpur, Malaysia
| | - Mun-Yik Fong
- Department of Parasitology, Faculty of Medicine, University of Malaya, 50603, Kuala Lumpur, Malaysia
| |
Collapse
|
23
|
Sadat A, Jeon J, Mir AA, Kim S, Choi J, Lee YH. Analysis of in planta Expressed Orphan Genes in the Rice Blast Fungus Magnaporthe oryzae. THE PLANT PATHOLOGY JOURNAL 2014; 30:367-74. [PMID: 25506301 PMCID: PMC4262289 DOI: 10.5423/ppj.oa.08.2014.0072] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 08/25/2014] [Accepted: 08/29/2014] [Indexed: 06/04/2023]
Abstract
Genomes contain a large number of unique genes which have not been found in other species. Although the origin of such "orphan" genes remains unclear, they are thought to be involved in species-specific adaptive processes. Here, we analyzed seven orphan genes (MoSPC1 to MoSPC7) prioritized based on in planta expressed sequence tag data in the rice blast fungus, Magnaporthe oryzae. Expression analysis using qRT-PCR confirmed the expression of four genes (MoSPC1, MoSPC2, MoSPC3 and MoSPC7) during plant infection. However, individual deletion mutants of these four genes did not differ from the wild-type strain for all phenotypes examined, including pathogenicity. The length, GC contents, codon adaptation index and expression during mycelial growth of the four genes suggest that these genes formed during the evolutionary history of M. oryzae. Synteny analyses using closely related fungal species corroborated the notion that these genes evolved de novo in the M. oryzae genome. In this report, we discuss our inability to detect phenotypic changes in the four deletion mutants. Based on these results, the four orphan genes may be products of de novo gene birth processes, and their adaptive potential is in the course of being tested for retention or extinction through natural selection.
Collapse
Affiliation(s)
- Abu Sadat
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Junhyun Jeon
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Albely Afifa Mir
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Seongbeom Kim
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Jaeyoung Choi
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| | - Yong-Hwan Lee
- Department of Agricultural Biotechnology, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
- Center for Fungal Pathogenesis, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
- Center for Fungal Genetic Resources, Plant Genomics and Breeding Institute, and Research Institute for Agriculture and Life Sciences, Seoul National University, Seoul 151-921, Korea
| |
Collapse
|
24
|
Jiang N, Wang L, Chen J, Wang L, Leach L, Luo Z. Conserved and divergent patterns of DNA methylation in higher vertebrates. Genome Biol Evol 2014; 6:2998-3014. [PMID: 25355807 PMCID: PMC4255770 DOI: 10.1093/gbe/evu238] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/20/2014] [Indexed: 02/07/2023] Open
Abstract
DNA methylation in the genome plays a fundamental role in the regulation of gene expression and is widespread in the genome of eukaryotic species. For example, in higher vertebrates, there is a "global" methylation pattern involving complete methylation of CpG sites genome-wide, except in promoter regions that are typically enriched for CpG dinucleotides, or so called "CpG islands." Here, we comprehensively examined and compared the distribution of CpG sites within ten model eukaryotic species and linked the observed patterns to the role of DNA methylation in controlling gene transcription. The analysis revealed two distinct but conserved methylation patterns for gene promoters in human and mouse genomes, involving genes with distinct distributions of promoter CpGs and gene expression patterns. Comparative analysis with four other higher vertebrates revealed that the primary regulatory role of the DNA methylation system is highly conserved in higher vertebrates.
Collapse
Affiliation(s)
- Ning Jiang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Lin Wang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China
| | - Jing Chen
- School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Luwen Wang
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China
| | - Lindsey Leach
- School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| | - Zewei Luo
- Department of Biostatistics & Computational Biology, SKLG, School of Life Sciences, Fudan University, Shanghai, China School of Biosciences, The University of Birmingham, Birmingham B15 2TT United Kingdom
| |
Collapse
|
25
|
Clustering of giant virus-DNA based on variations in local entropy. Viruses 2014; 6:2259-67. [PMID: 24887142 PMCID: PMC4074927 DOI: 10.3390/v6062259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 05/19/2014] [Accepted: 05/21/2014] [Indexed: 11/17/2022] Open
Abstract
We present a method for clustering genomic sequences based on variations in local entropy. We have analyzed the distributions of the block entropies of viruses and plant genomes. A distinct pattern for viruses and plant genomes is observed. These distributions, which describe the local entropic variability of the genomes, are used for clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS distances between all genomes that infect the chlorella algae shows the host specificity of the viruses. We illustrate the efficacy of this entropy-based clustering technique by the segregation of plant and virus genomes into separate bins.
Collapse
|
26
|
Implications of human genome structural heterogeneity: functionally related genes tend to reside in organizationally similar genomic regions. BMC Genomics 2014; 15:252. [PMID: 24684786 PMCID: PMC4234528 DOI: 10.1186/1471-2164-15-252] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2012] [Accepted: 03/21/2014] [Indexed: 01/30/2023] Open
Abstract
Background In an earlier study, we hypothesized that genomic segments with different sequence
organization patterns (OPs) might display functional specificity despite their
similar GC content. Here we tested this hypothesis by dividing the human genome
into 100 kb segments, classifying these segments into five compositional
groups according to GC content, and then characterizing each segment within the
five groups by oligonucleotide counting (k-mer analysis; also referred to as
compositional spectrum analysis, or CSA), to examine the distribution of sequence
OPs in the segments. We performed the CSA on the entire DNA, i.e., its coding and
non-coding parts the latter being much more abundant in the genome than the
former. Results We identified 38 OP-type clusters of segments that differ in their compositional
spectrum (CS) organization. Many of the segments that shared the same OP type were
enriched with genes related to the same biological processes (developmental,
signaling, etc.), components of biochemical complexes, or organelles. Thirteen
OP-type clusters showed significant enrichment in genes connected to specific
gene-ontology terms. Some of these clusters seemed to reflect certain events
during periods of horizontal gene transfer and genome expansion, and subsequent
evolution of genomic regions requiring coordinated regulation. Conclusions There may be a tendency for genes that are involved in the same biological
process, complex or organelle to use the same OP, even at a distance of ~
100 kb from the genes. Although the intergenic DNA is non-coding, the general
pattern of sequence organization (e.g., reflected in over-represented
oligonucleotide “words”) may be important and were protected, to some
extent, in the course of evolution.
Collapse
|
27
|
Li XQ, Du D. Variation, evolution, and correlation analysis of C+G content and genome or chromosome size in different kingdoms and phyla. PLoS One 2014; 9:e88339. [PMID: 24551092 PMCID: PMC3923770 DOI: 10.1371/journal.pone.0088339] [Citation(s) in RCA: 63] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2013] [Accepted: 01/06/2014] [Indexed: 12/05/2022] Open
Abstract
C+G content (GC content or G+C content) is known to be correlated with genome/chromosome size in bacteria but the relationship for other kingdoms remains unclear. This study analyzed genome size, chromosome size, and base composition in most of the available sequenced genomes in various kingdoms. Genome size tends to increase during evolution in plants and animals, and the same is likely true for bacteria. The genomic C+G contents were found to vary greatly in microorganisms but were quite similar within each animal or plant subkingdom. In animals and plants, the C+G contents are ranked as follows: monocot plants>mammals>non-mammalian animals>dicot plants. The variation in C+G content between chromosomes within species is greater in animals than in plants. The correlation between average chromosome C+G content and chromosome length was found to be positive in Proteobacteria, Actinobacteria (but not in other analyzed bacterial phyla), Ascomycota fungi, and likely also in some plants; negative in some animals, insignificant in two protist phyla, and likely very weak in Archaea. Clearly, correlations between C+G content and chromosome size can be positive, negative, or not significant depending on the kingdoms/groups or species. Different phyla or species exhibit different patterns of correlation between chromosome-size and C+G content. Most chromosomes within a species have a similar pattern of variation in C+G content but outliers are common. The data presented in this study suggest that the C+G content is under genetic control by both trans- and cis- factors and that the correlation between C+G content and chromosome length can be positive, negative, or not significant in different phyla.
Collapse
Affiliation(s)
- Xiu-Qing Li
- Molecular Genetics Laboratory, Potato Research Centre, Agriculture and Agri-Food Canada, Fredericton, New Brunswick, Canada
- * E-mail:
| | - Donglei Du
- Quantitative Methods Research Group, Faculty of Business Administration, University of New Brunswick, Fredericton, New Brunswick, Canada
| |
Collapse
|
28
|
Nabiyouni M, Prakash A, Fedorov A. Vertebrate codon bias indicates a highly GC-rich ancestral genome. Gene 2013; 519:113-9. [PMID: 23376453 DOI: 10.1016/j.gene.2013.01.033] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2012] [Revised: 01/10/2013] [Accepted: 01/17/2013] [Indexed: 11/16/2022]
Abstract
Two factors are thought to have contributed to the origin of codon usage bias in eukaryotes: 1) genome-wide mutational forces that shape overall GC-content and create context-dependent nucleotide bias, and 2) positive selection for codons that maximize efficient and accurate translation. Particularly in vertebrates, these two explanations contradict each other and cloud the origin of codon bias in the taxon. On the one hand, mutational forces fail to explain GC-richness (~60%) of third codon positions, given the GC-poor overall genomic composition among vertebrates (~40%). On the other hand, positive selection cannot easily explain strict regularities in codon preferences. Large-scale bioinformatic assessment, of nucleotide composition of coding and non-coding sequences in vertebrates and other taxa, suggests a simple possible resolution for this contradiction. Specifically, we propose that the last common vertebrate ancestor had a GC-rich genome (~65% GC). The data suggest that whole-genome mutational bias is the major driving force for generating codon bias. As the bias becomes prominent, it begins to affect translation and can result in positive selection for optimal codons. The positive selection can, in turn, significantly modulate codon preferences.
Collapse
Affiliation(s)
- Maryam Nabiyouni
- Program in Bioinformatics and Proteomics/Genomics, University of Toledo, Health Science Campus, Toledo, OH 43614, USA.
| | | | | |
Collapse
|
29
|
Berná L, Chaurasia A, Angelini C, Federico C, Saccone S, D'Onofrio G. The footprint of metabolism in the organization of mammalian genomes. BMC Genomics 2012; 13:174. [PMID: 22568857 PMCID: PMC3384468 DOI: 10.1186/1471-2164-13-174] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 05/08/2012] [Indexed: 01/02/2023] Open
Abstract
Background At present five evolutionary hypotheses have been proposed to explain the great variability of the genomic GC content among and within genomes: the mutational bias, the biased gene conversion, the DNA breakpoints distribution, the thermal stability and the metabolic rate. Several studies carried out on bacteria and teleostean fish pointed towards the critical role played by the environment on the metabolic rate in shaping the base composition of genomes. In mammals the debate is still open, and evidences have been produced in favor of each evolutionary hypothesis. Human genes were assigned to three large functional categories (as well as to the corresponding functional classes) according to the KOG database: (i) information storage and processing, (ii) cellular processes and signaling, and (iii) metabolism. The classification was extended to the organisms so far analyzed performing a reciprocal Blastp and selecting the best reciprocal hit. The base composition was calculated for each sequence of the whole CDS dataset. Results The GC3 level of the above functional categories was increasing from (i) to (iii). This specific compositional pattern was found, as footprint, in all mammalian genomes, but not in frog and lizard ones. Comparative analysis of human versus both frog and lizard functional categories showed that genes involved in the metabolic processes underwent the highest GC3 increment. Analyzing the KOG functional classes of genes, again a well defined intra-genomic pattern was found in all mammals. Not only genes of metabolic pathways, but also genes involved in chromatin structure and dynamics, transcription, signal transduction mechanisms and cytoskeleton, showed an average GC3 level higher than that of the whole genome. In the case of the human genome, the genes of the aforementioned functional categories showed a high probability to be associated with the chromosomal bands. Conclusions In the light of different evolutionary hypotheses proposed so far, and contributing with different potential to the genome compositional heterogeneity of mammalian genomes, the one based on the metabolic rate seems to play not a minor role. Keeping in mind similar results reported in bacteria and in teleosts, the specific compositional patterns observed in mammals highlight metabolic rate as unifying factor that fits over a wide range of living organisms.
Collapse
Affiliation(s)
- Luisa Berná
- Genome Evolution and Organization - Department Animal Physiology and Evolution, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Naples, Italy
| | | | | | | | | | | |
Collapse
|
30
|
Lee Y, Seifert SN, Nieman CC, McAbee RD, Goodell P, Fryxell RT, Lanzaro GC, Cornel AJ. High degree of single nucleotide polymorphisms in California Culex pipiens (Diptera: Culicidae) sensu lato. JOURNAL OF MEDICAL ENTOMOLOGY 2012; 49:299-306. [PMID: 22493847 PMCID: PMC3553656 DOI: 10.1603/me11108] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Resolution of systematic relationships among members of the Culex pipiens (L.) complex has important implications for public health as well as for studies on the evolution of sibling species. Currently held views contend that in California considerable genetic introgression occurs between Cx. pipiens and Cx. quinquefasciatus Say, and as such, these taxa behave as if they are a single species. Development of high throughput SNP genotyping tools for the analysis of Cx. pipiens complex population structure is therefore desirable. As a first step toward this goal, we sequenced 12 gene fragments from specimens collected in Marin and Fresno counties. On average, we found a higher single nucleotide polymorphism (SNP) density than any other mosquito species reported thus far. Coding regions contained significantly higher GC content (median 54.7%) than noncoding regions (42.4%; Wilcoxon rank sum test, P = 5.29 x 10(-5)). Differences in SNP allele frequencies observed between mosquitoes from Marin and Fresno counties indicated significant genetic divergence and suggest that SNP markers will be useful for future detailed population genetic studies of this group. The high density of SNPs highlights the difficulty in identifying species within the complex and may be associated with the large degree of phenotypic variation observed in this group of mosquitoes.
Collapse
Affiliation(s)
- Yoosook Lee
- School of Veterinary Medicine, Department of Pathology, Microbiology and Immunology, University of California-Davis, Davis, CA 95616, USA.
| | | | | | | | | | | | | | | |
Collapse
|
31
|
Luo H, Lin K, David A, Nijveen H, Leunissen JAM. ProRepeat: an integrated repository for studying amino acid tandem repeats in proteins. Nucleic Acids Res 2011; 40:D394-9. [PMID: 22102581 PMCID: PMC3245022 DOI: 10.1093/nar/gkr1019] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
ProRepeat (http://prorepeat.bioinformatics.nl/) is an integrated curated repository and analysis platform for in-depth research on the biological characteristics of amino acid tandem repeats. ProRepeat collects repeats from all proteins included in the UniProt knowledgebase, together with 85 completely sequenced eukaryotic proteomes contained within the RefSeq collection. It contains non-redundant perfect tandem repeats, approximate tandem repeats and simple, low-complexity sequences, covering the majority of the amino acid tandem repeat patterns found in proteins. The ProRepeat web interface allows querying the repeat database using repeat characteristics like repeat unit and length, number of repetitions of the repeat unit and position of the repeat in the protein. Users can also search for repeats by the characteristics of repeat containing proteins, such as entry ID, protein description, sequence length, gene name and taxon. ProRepeat offers powerful analysis tools for finding biological interesting properties of repeats, such as the strong position bias of leucine repeats in the N-terminus of eukaryotic protein sequences, the differences of repeat abundance among proteomes, the functional classification of repeat containing proteins and GC content constrains of repeats’ corresponding codons.
Collapse
Affiliation(s)
- Hong Luo
- Laboratory of Bioinformatics, Wageningen University and Research Centre, PO Box 569, 6700 AN Wageningen, Netherlands
| | | | | | | | | |
Collapse
|
32
|
Wang K, Wernersson R, Brunak S. The strength of intron donor splice sites in human genes displays a bell-shaped pattern. Bioinformatics 2011; 27:3079-84. [PMID: 21994226 DOI: 10.1093/bioinformatics/btr532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The gene concept has recently changed from the classical one protein notion into a much more diverse picture, where overlapping or fused transcripts, alternative transcription initiation, and genes within genes, add to the complexity generated by alternative splicing. Increased understanding of the mechanisms controlling pre-mRNA splicing is thus important for a wide range of aspects relating to gene expression. RESULTS We have discovered a convex gene delineating pattern in the strength of 5' intron splice sites. When comparing the strengths of > 18,000 intron containing Human genes, we found that when analysing them separately according to the number of introns they contain, initial splice sites were always stronger on average than subsequent ones, and that a similar reversed trend exist towards the terminal gene part. The convex pattern is strongest for genes with up to 10 introns. Interestingly, when analysing the intron containing gene pool from mouse consisting of >15,000 genes, we found the convex pattern to be conserved despite > 75 million years of evolutionary divergence between the two organisms. We also analysed an interesting, novel class of chimeric genes which during spliceosome assembly are fused and in tandem are transcribed and spliced into a single mature mRNA sequence. In their splice site patterns, these genes individually seem to deviate from the convex pattern, offering a possible rationale behind their fusion into a single transcript.
Collapse
Affiliation(s)
- Kai Wang
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Lyngby, Denmark
| | | | | |
Collapse
|
33
|
Raiz J, Damert A, Chira S, Held U, Klawitter S, Hamdorf M, Löwer J, Strätling WH, Löwer R, Schumann GG. The non-autonomous retrotransposon SVA is trans-mobilized by the human LINE-1 protein machinery. Nucleic Acids Res 2011; 40:1666-83. [PMID: 22053090 PMCID: PMC3287187 DOI: 10.1093/nar/gkr863] [Citation(s) in RCA: 164] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
SINE-VNTR-Alu (SVA) elements are non-autonomous, hominid-specific non-LTR retrotransposons and distinguished by their organization as composite mobile elements. They represent the evolutionarily youngest, currently active family of human non-LTR retrotransposons, and sporadically generate disease-causing insertions. Since preexisting, genomic SVA sequences are characterized by structural hallmarks of Long Interspersed Elements 1 (LINE-1, L1)-mediated retrotransposition, it has been hypothesized for several years that SVA elements are mobilized by the L1 protein machinery in trans. To test this hypothesis, we developed an SVA retrotransposition reporter assay in cell culture using three different human-specific SVA reporter elements. We demonstrate that SVA elements are mobilized in HeLa cells only in the presence of both L1-encoded proteins, ORF1p and ORF2p. SVA trans-mobilization rates exceeded pseudogene formation frequencies by 12- to 300-fold in HeLa-HA cells, indicating that SVA elements represent a preferred substrate for L1 proteins. Acquisition of an AluSp element increased the trans-mobilization frequency of the SVA reporter element by ~25-fold. Deletion of (CCCTCT)n repeats and Alu-like region of a canonical SVA reporter element caused significant attenuation of the SVA trans-mobilization rate. SVA de novo insertions were predominantly full-length, occurred preferentially in G+C-rich regions, and displayed all features of L1-mediated retrotransposition which are also observed in preexisting genomic SVA insertions.
Collapse
Affiliation(s)
- Julija Raiz
- Section PR2/Retroelements, Division of Medical Biotechnology, Paul-Ehrlich-Institut, Paul-Ehrlich-Strasse 51-59, D-63225 Langen, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Clément Y, Arndt PF. Substitution patterns are under different influences in primates and rodents. Genome Biol Evol 2011; 3:236-45. [PMID: 21339508 PMCID: PMC3068003 DOI: 10.1093/gbe/evr011] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
There are large-scale variations of the GC-content along mammalian chromosomes that have been called isochore structures. Primates and rodents have different isochore structures, which suggests that these lineages exhibit different modes of GC-content evolution. It has been shown that, in the human lineage, GC-biased gene conversion (gBGC), a neutral process associated with meiotic recombination, acts on GC-content evolution by influencing A or T to G or C substitution rates. We computed genome-wide substitution patterns in the mouse lineage from multiple alignments and compared them with substitution patterns in the human lineage. We found that in the mouse lineage, gBGC is active but weaker than in the human lineage and that male-specific recombination better predicts GC-content evolution than female-specific recombination. Furthermore, we were able to show that G or C to A or T substitution rates are predicted by a combination of different factors in both lineages. A or T to G or C substitution rates are most strongly predicted by meiotic recombination in the human lineage but by CpG odds ratio (the observed CpG frequency normalized by the expected CpG frequency) in the mouse lineage, suggesting that substitution patterns are under different influences in primates and rodents.
Collapse
Affiliation(s)
- Yves Clément
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | |
Collapse
|
35
|
Piskol R, Stephan W. Selective constraints in conserved folded RNAs of drosophilid and hominid genomes. Mol Biol Evol 2010; 28:1519-29. [PMID: 21172832 DOI: 10.1093/molbev/msq343] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Small noncoding RNAs as well as folded RNA structures in genic regions are crucial for many cellular processes. They are involved in posttranscriptional gene regulation (microRNAs), RNA modification (small nucleolar RNAs), regulation of splicing, correct localization of proteins, and many other processes. In most cases, a distinct secondary structure of the molecule is necessary for its correct function. Hence, selection should act to retain the structure of the molecule, although the underlying sequence is allowed to vary. Here, we present the first genome-wide estimates of selective constraints in folded RNA molecules in the nuclear genomes of drosophilids and hominids. In comparison to putatively neutrally evolving sites, we observe substantially reduced rates of substitutions at paired and unpaired sites of folded molecules. We estimated evolutionary constraints to be in the ranges of (0.974,0.991) and (0.895,1.000) for paired nucleotides in drosophilids and hominids, respectively. These values are significantly higher than for constraints at nonsynonymous sites of protein-coding genes in both genera. Nonetheless, valleys of only moderately reduced fitness (s ≈ 10(-4)) are sufficient to generate the observed fraction of nucleotide changes that are removed by purifying selection. In addition, a comparison of selective coefficients between drosophilids and hominids revealed significantly higher constraints in drosophilids, which can be attributed to the difference in long-term effective population size between these two groups of species. This difference is particularly apparent at the independently evolving (unpaired) sites.
Collapse
Affiliation(s)
- Robert Piskol
- Department of Biology II, Section of Evolutionary Biology, Ludwig-Maximilian-University, Munich, Germany.
| | | |
Collapse
|
36
|
Du H, Hu H, Meng Y, Zheng W, Ling F, Wang J, Zhang X, Nie Q, Wang X. The correlation coefficient of GC content of the genome-wide genes is positively correlated with animal evolutionary relationships. FEBS Lett 2010; 584:3990-4. [PMID: 20691688 DOI: 10.1016/j.febslet.2010.08.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2010] [Revised: 07/29/2010] [Accepted: 08/02/2010] [Indexed: 11/16/2022]
Abstract
In this study, we present a new method for evaluating animal evolutionary relationships. We used the GC% levels of genome-wide genes to determine the correlation between the GC% content and evolutionary relationship. The correlation coefficients of the GC% content of the orthologous genes of the paired animal species were calculated for a total of 21 species, and the evolutionary branching dates of these 21 species were derived from fossil records. The correlation coefficient of the GC% content of the orthologous genes of the species pair under study served as an indicator of their evolutionary relationship. Moreover, there was a decreasing linear relationship between the correlation coefficient and evolutionary branching date (R(2)=0.930).
Collapse
Affiliation(s)
- Hongli Du
- School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, China
| | | | | | | | | | | | | | | | | |
Collapse
|
37
|
Stover DA, Verrelli BC. Comparative Vertebrate Evolutionary Analyses of Type I Collagen: Potential of COL1a1 Gene Structure and Intron Variation for Common Bone-Related Diseases. Mol Biol Evol 2010; 28:533-42. [DOI: 10.1093/molbev/msq221] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
|
38
|
Künstner A, Wolf JBW, Backström N, Whitney O, Balakrishnan CN, Day L, Edwards SV, Janes DE, Schlinger BA, Wilson RK, Jarvis ED, Warren WC, Ellegren H. Comparative genomics based on massive parallel transcriptome sequencing reveals patterns of substitution and selection across 10 bird species. Mol Ecol 2010; 19 Suppl 1:266-76. [PMID: 20331785 DOI: 10.1111/j.1365-294x.2009.04487.x] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Next-generation sequencing technology provides an attractive means to obtain large-scale sequence data necessary for comparative genomic analysis. To analyse the patterns of mutation rate variation and selection intensity across the avian genome, we performed brain transcriptome sequencing using Roche 454 technology of 10 different non-model avian species. Contigs from de novo assemblies were aligned to the two available avian reference genomes, chicken and zebra finch. In total, we identified 6499 different genes across all 10 species, with approximately 1000 genes found in each full run per species. We found evidence for a higher mutation rate of the Z chromosome than of autosomes (male-biased mutation) and a negative correlation between the neutral substitution rate (d(S)) and chromosome size. Analyses of the mean d(N)/d(S) ratio (omega) of genes across chromosomes supported the Hill-Robertson effect (the effect of selection at linked loci) and point at stochastic problems with omega as an independent measure of selection. Overall, this study demonstrates the usefulness of next-generation sequencing for obtaining genomic resources for comparative genomic analysis of non-model organisms.
Collapse
Affiliation(s)
- Axel Künstner
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Sabbbia V, Romero H, Musto H, Naya H. Composition Profile of the Human Genome at the Chromosome Level. J Biomol Struct Dyn 2009; 27:361-70. [DOI: 10.1080/07391102.2009.10507322] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
40
|
Qi YJ, Qiu WY. Symmetry Analysis of an X-palindrome in Human and Chimpanzee. CHINESE J CHEM PHYS 2009. [DOI: 10.1088/1674-0068/22/04/401-405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
41
|
Freudenberg J, Wang M, Yang Y, Li W. Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome. BMC Bioinformatics 2009; 10 Suppl 1:S66. [PMID: 19208170 PMCID: PMC2648766 DOI: 10.1186/1471-2105-10-s1-s66] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several features are known to correlate with the GC-content in the human genome, including recombination rate, gene density and distance to telomere. However, by testing for pairwise correlation only, it is impossible to distinguish direct associations from indirect ones and to distinguish between causes and effects. RESULTS We use partial correlations to construct partially directed graphs for the following four variables: GC-content, recombination rate, exon density and distance-to-telomere. Recombination rate and exon density are unconditionally uncorrelated, but become inversely correlated by conditioning on GC-content. This pattern indicates a model where recombination rate and exon density are two independent causes of GC-content variation. CONCLUSION Causal inference and graphical models are useful methods to understand genome evolution and the mechanisms of isochore evolution in the human genome.
Collapse
Affiliation(s)
- Jan Freudenberg
- The Robert S Boas Center for Genomics and Human GeneticsFeinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|