1
|
Halama A, Zaghlool S, Thareja G, Kader S, Al Muftah W, Mook-Kanamori M, Sarwath H, Mohamoud YA, Stephan N, Ameling S, Pucic Baković M, Krumsiek J, Prehn C, Adamski J, Schwenk JM, Friedrich N, Völker U, Wuhrer M, Lauc G, Najafi-Shoushtari SH, Malek JA, Graumann J, Mook-Kanamori D, Schmidt F, Suhre K. A roadmap to the molecular human linking multiomics with population traits and diabetes subtypes. Nat Commun 2024; 15:7111. [PMID: 39160153 PMCID: PMC11333501 DOI: 10.1038/s41467-024-51134-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 07/26/2024] [Indexed: 08/21/2024] Open
Abstract
In-depth multiomic phenotyping provides molecular insights into complex physiological processes and their pathologies. Here, we report on integrating 18 diverse deep molecular phenotyping (omics-) technologies applied to urine, blood, and saliva samples from 391 participants of the multiethnic diabetes Qatar Metabolomics Study of Diabetes (QMDiab). Using 6,304 quantitative molecular traits with 1,221,345 genetic variants, methylation at 470,837 DNA CpG sites, and gene expression of 57,000 transcripts, we determine (1) within-platform partial correlations, (2) between-platform mutual best correlations, and (3) genome-, epigenome-, transcriptome-, and phenome-wide associations. Combined into a molecular network of > 34,000 statistically significant trait-trait links in biofluids, our study portrays "The Molecular Human". We describe the variances explained by each omics in the phenotypes (age, sex, BMI, and diabetes state), platform complementarity, and the inherent correlation structures of multiomics data. Further, we construct multi-molecular network of diabetes subtypes. Finally, we generated an open-access web interface to "The Molecular Human" ( http://comics.metabolomix.com ), providing interactive data exploration and hypotheses generation possibilities.
Collapse
Affiliation(s)
- Anna Halama
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar.
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
| | - Shaza Zaghlool
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
| | - Gaurav Thareja
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
| | - Sara Kader
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
| | - Wadha Al Muftah
- Qatar Genome Program, Qatar Foundation, Qatar Science and Technology Park, Innovation Center, Doha, Qatar
- Department of Genetic Medicine, Weill Cornell Medicine, Doha, Qatar
| | | | - Hina Sarwath
- Proteomics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
| | | | - Nisha Stephan
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
| | - Sabine Ameling
- German Centre for Cardiovascular Research, Partner Site Greifswald, University Medicine Greifswald, Greifswald, Germany
- Department of Functional Genomics, Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | | | - Jan Krumsiek
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA
| | - Cornelia Prehn
- Metabolomics and Proteomics Core, Helmholtz Zentrum München, Neuherberg, Germany
| | - Jerzy Adamski
- Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Institute of Biochemistry, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Jochen M Schwenk
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Nele Friedrich
- German Centre for Cardiovascular Research, Partner Site Greifswald, University Medicine Greifswald, Greifswald, Germany
- Institute of Clinical Chemistry and Laboratory Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Uwe Völker
- German Centre for Cardiovascular Research, Partner Site Greifswald, University Medicine Greifswald, Greifswald, Germany
- Department of Functional Genomics, Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Manfred Wuhrer
- Center for Proteomics and Metabolomics, Leiden University Medical Center, Leiden, The Netherlands
| | - Gordan Lauc
- Genos Glycoscience Research Laboratory, Zagreb, Croatia
- Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
| | - S Hani Najafi-Shoushtari
- MicroRNA Core Laboratory, Division of Research, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Cell and Developmental Biology, Weill Cornell Medicine, New York, NY, USA
| | - Joel A Malek
- Department of Genetic Medicine, Weill Cornell Medicine, Doha, Qatar
- Genomics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
| | - Johannes Graumann
- Institute of Translational Proteomics, Department of Medicine, Philipps-Universität Marburg, Marburg, Germany
| | - Dennis Mook-Kanamori
- Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, the Netherlands
- Department of Public Health and Primary Care, Leiden University Medical Center, Leiden, the Netherlands
| | - Frank Schmidt
- Proteomics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
- Department of Biochemistry, Weill Cornell Medicine, New York, NY, USA
| | - Karsten Suhre
- Bioinformatics Core, Weill Cornell Medicine-Qatar, Education City, Doha, Qatar.
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
2
|
Cheng C, An L, Li F, Ahmad W, Aslam M, Ul Haq MZ, Yan Y, Ahmad RM. Wide-Range Portrayal of AP2/ERF Transcription Factor Family in Maize ( Zea mays L.) Development and Stress Responses. Genes (Basel) 2023; 14:194. [PMID: 36672935 PMCID: PMC9859492 DOI: 10.3390/genes14010194] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 01/03/2023] [Accepted: 01/06/2023] [Indexed: 01/13/2023] Open
Abstract
The APETALA2/Ethylene-Responsive Transcriptional Factors containing conservative AP2/ERF domains constituted a plant-specific transcription factor (TF) superfamily, called AP2/ERF. The configuration of the AP2/ERF superfamily in maize has remained unresolved. In this study, we identified the 229 AP2/ERF genes in the latest (B73 RefGen_v5) maize reference genome. Phylogenetic classification of the ZmAP2/ERF family members categorized it into five clades, including 27 AP2 (APETALA2), 5 RAV (Related to ABI3/VP), 89 DREB (dehydration responsive element binding), 105 ERF (ethylene responsive factors), and a soloist. The duplication events of the paralogous genes occurred from 1.724-25.855 MYA, a key route to maize evolution. Structural analysis reveals that they have more introns and few exons. The results showed that 32 ZmAP2/ERFs regulate biotic stresses, and 24 ZmAP2/ERFs are involved in responses towards abiotic stresses. Additionally, the expression analysis showed that DREB family members are involved in plant sex determination. The real-time quantitative expression profiling of ZmAP2/ERFs in the leaves of the maize inbred line B73 under ABA, JA, salt, drought, heat, and wounding stress revealed their specific expression patterns. Conclusively, this study unveiled the evolutionary pathway of ZmAP2/ERFs and its essential role in stress and developmental processes. The generated information will be useful for stress resilience maize breeding programs.
Collapse
Affiliation(s)
- Cheng Cheng
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
| | - Likun An
- College of Agriculture and Forestry Sciences, Qinghai University, Xining 810016, China
| | - Fangzhe Li
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
| | - Wahaj Ahmad
- Institute of Soil and Environmental Sciences, COMSATS University Islamabad, Abbottabad 22020, Pakistan
| | - Muhammad Aslam
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan
| | - Muhammad Zia Ul Haq
- Department of Agronomy, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan
| | - Yuanxin Yan
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
| | - Ramala Masood Ahmad
- State Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan
| |
Collapse
|
3
|
Matroud A, Tuffley C, Hendy M. An Asymmetric Alignment Algorithm for Estimating Ancestor-Descendant Edit Distance for Tandem Repeats. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2080-2091. [PMID: 33587704 DOI: 10.1109/tcbb.2021.3059239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Tandem repeats are repetitive structures present in some DNA sequences, consisting of many repeated copies of a single motif. They can serve as important markers for phylogenetic and population genetic studies, due to the high polymorphism in the number of motif copies as well as variations in the motif. The first step in using tandem repeats for phylogenetic studies is to estimate the evolutionary distance between a pair D1 and D2 of tandem repeat sequences with homologous motifs. This problem can be broken into two sub-problems: 1) Construct the most recent common ancestor of the sequences. 2) Calculate the evolutionary distance between each sequence and the hypothesised common ancestor. We present an algorithm that estimates the solution to the second problem. This takes the form of an asymmetric alignment algorithm to estimate the evolutionary distance between two tandem repeat sequences A and D, where D is assumed to have descended from A, under a model that allows block duplication, deletion, and variant substitution. The algorithm is asymmetric in the sense that the two input sequences A and D play different roles in the calculations, reflecting the assumption that D descends from A. Our model assumes static motif boundaries, meaning that motif duplication and deletion events must respect the motif boundaries. The algorithm may also be applied without modification to more complex repetitive structures with two or more motifs, such as nested tandem repeats.
Collapse
|
4
|
Lefranc MP, Lefranc G. IMGT ®Homo sapiens IG and TR Loci, Gene Order, CNV and Haplotypes: New Concepts as a Paradigm for Jawed Vertebrates Genome Assemblies. Biomolecules 2022; 12:381. [PMID: 35327572 PMCID: PMC8945572 DOI: 10.3390/biom12030381] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 02/21/2022] [Accepted: 02/24/2022] [Indexed: 02/04/2023] Open
Abstract
IMGT®, the international ImMunoGeneTics information system®, created in 1989, by Marie-Paule Lefranc (Université de Montpellier and CNRS), marked the advent of immunoinformatics, a new science which emerged at the interface between immunogenetics and bioinformatics for the study of the adaptive immune responses. IMGT® is based on a standardized nomenclature of the immunoglobulin (IG) and T cell receptor (TR) genes and alleles from fish to humans and on the IMGT unique numbering for the variable (V) and constant (C) domains of the immunoglobulin superfamily (IgSF) of vertebrates and invertebrates, and for the groove (G) domain of the major histocompatibility (MH) and MH superfamily (MhSF) proteins. IMGT® comprises 7 databases, 17 tools and more than 25,000 pages of web resources for sequences, genes and structures, based on the IMGT Scientific chart rules generated from the IMGT-ONTOLOGY axioms and concepts. IMGT® reference directories are used for the analysis of the NGS high-throughput expressed IG and TR repertoires (natural, synthetic and/or bioengineered) and for bridging sequences, two-dimensional (2D) and three-dimensional (3D) structures. This manuscript focuses on the IMGT®Homo sapiens IG and TR loci, gene order, copy number variation (CNV) and haplotypes new concepts, as a paradigm for jawed vertebrates genome assemblies.
Collapse
Affiliation(s)
- Marie-Paule Lefranc
- IMGT®, The International ImMunoGeneTics Information System®, Laboratoire d’Immuno Génétique Moléculaire (LIGM), Institut de Génétique Humaine (IGH), Université de Montpellier (UM), Centre National de la Recherche Scientifique (CNRS), UMR 9002 CNRS-UM, 141 rue de la Cardonille, CEDEX 5, 34396 Montpellier, France
| | - Gérard Lefranc
- IMGT®, The International ImMunoGeneTics Information System®, Laboratoire d’Immuno Génétique Moléculaire (LIGM), Institut de Génétique Humaine (IGH), Université de Montpellier (UM), Centre National de la Recherche Scientifique (CNRS), UMR 9002 CNRS-UM, 141 rue de la Cardonille, CEDEX 5, 34396 Montpellier, France
| |
Collapse
|
5
|
Barragan AC, Weigel D. Plant NLR diversity: the known unknowns of pan-NLRomes. THE PLANT CELL 2021; 33:814-831. [PMID: 33793812 PMCID: PMC8226294 DOI: 10.1093/plcell/koaa002] [Citation(s) in RCA: 69] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 10/23/2020] [Indexed: 05/20/2023]
Abstract
Plants and pathogens constantly adapt to each other. As a consequence, many members of the plant immune system, and especially the intracellular nucleotide-binding site leucine-rich repeat receptors, also known as NOD-like receptors (NLRs), are highly diversified, both among family members in the same genome, and between individuals in the same species. While this diversity has long been appreciated, its true extent has remained unknown. With pan-genome and pan-NLRome studies becoming more and more comprehensive, our knowledge of NLR sequence diversity is growing rapidly, and pan-NLRomes provide powerful platforms for assigning function to NLRs. These efforts are an important step toward the goal of comprehensively predicting from sequence alone whether an NLR provides disease resistance, and if so, to which pathogens.
Collapse
Affiliation(s)
- A Cristina Barragan
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | | |
Collapse
|
6
|
Ahmed S, Rashid MAR, Zafar SA, Azhar MT, Waqas M, Uzair M, Rana IA, Azeem F, Chung G, Ali Z, Atif RM. Genome-wide investigation and expression analysis of APETALA-2 transcription factor subfamily reveals its evolution, expansion and regulatory role in abiotic stress responses in Indica Rice (Oryza sativa L. ssp. indica). Genomics 2020; 113:1029-1043. [PMID: 33157261 DOI: 10.1016/j.ygeno.2020.10.037] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 10/08/2020] [Accepted: 10/30/2020] [Indexed: 12/18/2022]
Abstract
Rice is an important cereal crop that serves as staple food for more than half of the world population. Abiotic stresses resulting from changing climatic conditions are continuously threating its yield and production. Genes in APETALA-2 (AP2) family encode transcriptional regulators implicated during regulation of developmental processes and abiotic stress responses but their identification and characterization in indica rice was still missing. In this context, twenty-six genes distributed among eleven chromosomes in Indica rice encoding AP2 transcription-factor subfamily were identified and their diverse haplotypes were studied. Phylogenetic analysis of OsAP2 TF family-members grouped them into three clades indicating conservation of clades among cereals. Segmental duplications were observed to be principal route of evolution, supporting the higher positive selection-pressure, which were estimated to be originated about 10.57 to 56.72 million years ago (MYA). Conserved domain analysis and intron-exon distribution pattern of identified OsAP2s revealed their exclusive distribution among the specific clades of the phylogenetic tree. Moreover, the members of osa-miR172 family were also identified potentially targeting four OsAP2 genes. The real-time quantitative expression profiling of OsAP2s under heat stress conditions in contrasting indica rice genotypes revealed the differential expression pattern of OsAP2s (6 genes up-regulated and 4 genes down-regulated) in stress- and genotype-dependent manner. These findings unveiled the evolutionary pathways of AP2-TF in rice, and can help the functional characterization under developmental and stress responses.
Collapse
Affiliation(s)
- Sohaib Ahmed
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan
| | - Muhammad Abdul Rehman Rashid
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Research Center of Perennial Rice Engineering and Technology in Yunnan, School of Agriculture, Yunnan University, Kunming 650500, China; Industrial Crops Research Institute, Yunnan Academy of Agricultural Sciences, Kunming 650200, China; Department of Bioinformatics and Biotechnology, Government College University, Faisalabad 38000, Pakistan.
| | - Syed Adeel Zafar
- National key facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Muhammad Tehseen Azhar
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan; School of Agriculture Sciences, Zhengzhou University, Zhengzhou 450000, China.
| | - Muhammad Waqas
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan
| | - Muhammad Uzair
- National key facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Iqrar Ahmad Rana
- Center for Agricultural Biochemistry and Biotechnology, University of Agriculture, Faisalabad 38040, Pakistan.
| | - Farrukh Azeem
- Department of Bioinformatics and Biotechnology, Government College University, Faisalabad 38000, Pakistan.
| | - Gyuhwa Chung
- Department of Biotechnology, Chonnam National University, Chonnam 59626, Republic of Korea.
| | - Zulfiqar Ali
- Institute of Plant Breeding and Biotechnology, Muhammad Nawaz Shareef University of Agriculture, Multan 66000, Pakistan.
| | - Rana Muhammad Atif
- Department of Plant Breeding and Genetics, University of Agriculture Faisalabad, Faisalabad 38040, Pakistan; Center for Advanced Studies in Agriculture and Food Security (CAS-AFS), University of Agriculture Faisalabad, Faisalabad-38040 Pakistan.
| |
Collapse
|
7
|
Greenman CD, Penso-Dolfin L, Wu T. The complexity of genome rearrangement combinatorics under the infinite sites model. J Theor Biol 2020; 501:110335. [DOI: 10.1016/j.jtbi.2020.110335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 04/16/2020] [Accepted: 05/14/2020] [Indexed: 11/30/2022]
|
8
|
Lefranc MP, Lefranc G. Immunoglobulins or Antibodies: IMGT ® Bridging Genes, Structures and Functions. Biomedicines 2020; 8:E319. [PMID: 32878258 PMCID: PMC7555362 DOI: 10.3390/biomedicines8090319] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 08/23/2020] [Accepted: 08/25/2020] [Indexed: 12/18/2022] Open
Abstract
IMGT®, the international ImMunoGeneTics® information system founded in 1989 by Marie-Paule Lefranc (Université de Montpellier and CNRS), marked the advent of immunoinformatics, a new science at the interface between immunogenetics and bioinformatics. For the first time, the immunoglobulin (IG) or antibody and T cell receptor (TR) genes were officially recognized as 'genes' as well as were conventional genes. This major breakthrough has allowed the entry, in genomic databases, of the IG and TR variable (V), diversity (D) and joining (J) genes and alleles of Homo sapiens and of other jawed vertebrate species, based on the CLASSIFICATION axiom. The second major breakthrough has been the IMGT unique numbering and the IMGT Collier de Perles for the V and constant (C) domains of the IG and TR and other proteins of the IG superfamily (IgSF), based on the NUMEROTATION axiom. IMGT-ONTOLOGY axioms and concepts bridge genes, sequences, structures and functions, between biological and computational spheres in the IMGT® system (Web resources, databases and tools). They provide the IMGT Scientific chart rules to identify, to describe and to analyse the IG complex molecular data, the huge diversity of repertoires, the genetic (alleles, allotypes, CNV) polymorphisms, the IG dual function (paratope/epitope, effector properties), the antibody humanization and engineering.
Collapse
Affiliation(s)
- Marie-Paule Lefranc
- IMGT, The International ImMunoGeneTics Information System, Laboratoire d’ImmunoGénétique Moléculaire LIGM, Institut de Génétique Humaine IGH, Université de Montpellier UM, Centre National de la Recherche Scientifique CNRS, UMR 9002 CNRS-UM, 141 Rue de la Cardonille, CEDEX 5, 34396 Montpellier, France
| | - Gérard Lefranc
- IMGT, The International ImMunoGeneTics Information System, Laboratoire d’ImmunoGénétique Moléculaire LIGM, Institut de Génétique Humaine IGH, Université de Montpellier UM, Centre National de la Recherche Scientifique CNRS, UMR 9002 CNRS-UM, 141 Rue de la Cardonille, CEDEX 5, 34396 Montpellier, France
| |
Collapse
|
9
|
Brejová B, Kravec M, Landau GM, Vinař T. Fast computation of a string duplication history under no-breakpoint-reuse. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2014; 372:20130133. [PMID: 24751867 PMCID: PMC3996574 DOI: 10.1098/rsta.2013.0133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this paper, we provide an O(n log(2) n log log n log* n) algorithm to compute a duplication history of a string under no-breakpoint-reuse condition. The motivation of this problem stems from computational biology, in particular, from analysis of complex gene clusters. The problem is also related to computing edit distance with block operations, but, in our scenario, the start of the history is not fixed, but chosen to minimize the distance measure.
Collapse
Affiliation(s)
- Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| | - Martin Kravec
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| | - Gad M. Landau
- Department of Computer Science, University of Haifa, Haifa 31905, Israel
- Department of Computer Science and Engineering, NYU-Poly, Six MetroTech Center, Brooklyn, NY 11201-3840, USA
| | - Tomáš Vinař
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| |
Collapse
|
10
|
Schaper E, Gascuel O, Anisimova M. Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol 2014; 31:1132-48. [PMID: 24497029 PMCID: PMC3995336 DOI: 10.1093/molbev/msu062] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture, we performed a proteome-wide analysis of the mode of evolution for human protein TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs, we reconstructed bispecies TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≥61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Ma. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to the high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE.
Collapse
Affiliation(s)
- Elke Schaper
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
| | | | | |
Collapse
|
11
|
Chauve C, El-Mabrouk N, Guéguen L, Semeria M, Tannier E. Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
12
|
Schaper E, Kajava AV, Hauser A, Anisimova M. Repeat or not repeat?--Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res 2012; 40:10005-17. [PMID: 22923522 PMCID: PMC3488214 DOI: 10.1093/nar/gks726] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats.
Collapse
Affiliation(s)
- Elke Schaper
- Computer Science Department, ETH Zürich, Universitätsstrasse 6, CH-8092 Zürich, Switzerland.
| | | | | | | |
Collapse
|
13
|
Li Q, Jin X, Zhu YX. Identification and analyses of miRNA genes in allotetraploid Gossypium hirsutum fiber cells based on the sequenced diploid G. raimondii genome. J Genet Genomics 2012; 39:351-60. [PMID: 22835981 DOI: 10.1016/j.jgg.2012.04.008] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Revised: 04/25/2012] [Accepted: 04/25/2012] [Indexed: 01/19/2023]
Abstract
The plant genome possesses a large number of microRNAs (miRNAs) mainly 21-24 nucleotides in length. They play a vital role in regulation of target gene expression at various stages throughout the whole plant life cycle. Here we sequenced and analyzed ≈ 10 million non-coding RNAs (ncRNAs) derived from fiber tissue of the allotetraploid cotton (Gossypium hirsutum) 7 days post-anthesis using ncRNA-seq technology. In terms of distinct reads, 24 nt ncRNA is by far the dominant species, followed by 21 nt and 23 nt ncRNAs. Using ab initio prediction, we identified and characterized a total of 562 candidate miRNA gene loci on the recently assembled D(5) genome of the diploid cotton G. raimondii. Of all the 562 predicted miRNAs, 22 were previously discovered in cotton species and 187 had sequence conservation and homology to homologous miRNAs of other plant species. Nucleotide bias analysis showed that the 9th and 1st positions were significantly conserved among different types of miRNA genes. Among the 463 putative miRNA target genes, most significant up/down-regulation occurred in 10-20 days post-anthesis, indicating that miRNAs played an important role during the elongation and secondary cell wall synthesis stages of cotton fiber development. The discovery of new miRNA genes will help understand the mechanisms of miRNA generation and regulation in cotton.
Collapse
Affiliation(s)
- Qin Li
- The State Key Laboratory of Protein and Plant Gene Research, College of Life Sciences, Peking University, Beijing 100871, China
| | | | | |
Collapse
|
14
|
Abstract
The purpose of this chapter is to provide a comprehensive review of the field of genome rearrangement, i.e., comparative genomics, based on the representation of genomes as ordered sequences of signed genes. We specifically focus on the "hard part" of genome rearrangement, how to handle duplicated genes. The main questions are: how have present-day genomes evolved from a common ancestor? What are the most realistic evolutionary scenarios explaining the observed gene orders? What was the content and structure of ancestral genomes? We aim to provide a concise but complete overview of the field, starting with the practical problem of finding an appropriate representation of a genome as a sequence of ordered genes or blocks, namely the problems of orthology, paralogy, and synteny block identification. We then consider three levels of gene organization: the gene family level (evolution by duplication, loss, and speciation), the cluster level (evolution by tandem duplications), and the genome level (all types of rearrangement events, including whole genome duplication).
Collapse
Affiliation(s)
- Nadia El-Mabrouk
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, QC, Canada
| | | |
Collapse
|
15
|
Tremblay Savard O, Bertrand D, El-Mabrouk N. Evolution of orthologous tandemly arrayed gene clusters. BMC Bioinformatics 2011; 12 Suppl 9:S2. [PMID: 22152029 PMCID: PMC3283317 DOI: 10.1186/1471-2105-12-s9-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Tandemly Arrayed Gene (TAG) clusters are groups of paralogous genes that are found adjacent on a chromosome. TAGs represent an important repertoire of genes in eukaryotes. In addition to tandem duplication events, TAG clusters are affected during their evolution by other mechanisms, such as inversion and deletion events, that affect the order and orientation of genes. The DILTAG algorithm developed in 1 makes it possible to infer a set of optimal evolutionary histories explaining the evolution of a single TAG cluster, from an ancestral single gene, through tandem duplications (simple or multiple, direct or inverted), deletions and inversion events. RESULTS We present a general methodology, which is an extension of DILTAG, for the study of the evolutionary history of a set of orthologous TAG clusters in multiple species. In addition to the speciation events reflected by the phylogenetic tree of the considered species, the evolutionary events that are taken into account are simple or multiple tandem duplications, direct or inverted, simple or multiple deletions, and inversions. We analysed the performance of our algorithm on simulated data sets and we applied it to the protocadherin gene clusters of human, chimpanzee, mouse and rat. CONCLUSIONS Our results obtained on simulated data sets showed a good performance in inferring the total number and size distribution of duplication events. A limitation of the algorithm is however in dealing with multiple gene deletions, as the algorithm is highly exponential in this case, and becomes quickly intractable.
Collapse
Affiliation(s)
| | - Denis Bertrand
- Computational and Mathematical Biology, Genome Institute of Singapore, Singapore
| | - Nadia El-Mabrouk
- Department of Computer Science (DIRO), University of Montreal, Montreal, Quebec, Canada
| |
Collapse
|
16
|
Song G, Zhang L, Vinar T, Miller W. CAGE: Combinatorial Analysis of Gene-cluster Evolution. J Comput Biol 2011; 17:1227-42. [PMID: 20874406 DOI: 10.1089/cmb.2010.0094] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Much important evolutionary activity occurs in gene clusters, where a copy of a gene may be free to acquire new functions. Current computational methods to extract evolutionary information from sequence data for such clusters are suboptimal, in part because accurate sequence data are often lacking in these genomic regions, making existing methods difficult to apply. We describe a new method for reconstructing the recent evolutionary history of gene clusters, and evaluate its performance on both simulated data and actual human gene clusters.
Collapse
Affiliation(s)
- Giltae Song
- Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, PA 16802, USA.
| | | | | | | |
Collapse
|
17
|
Vinar T, Brejová B, Song G, Siepel A. Reconstructing histories of complex gene clusters on a phylogeny. J Comput Biol 2011; 17:1267-79. [PMID: 20874408 DOI: 10.1089/cmb.2010.0090] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Clusters of genes that have evolved by repeated segmental duplication present difficult challenges throughout genomic analysis, from sequence assembly to functional analysis. These clusters are one of the major sources of evolutionary innovation, and they are linked to multiple diseases, including HIV and a variety of cancers. Understanding their evolutionary histories is a key to the application of comparative genomics methods in these regions of the genome. We propose a probabilistic model of gene cluster evolution on a phylogeny, and an MCMC algorithm for reconstruction of duplication histories from genomic sequences in multiple species. Several projects are underway to obtain high quality BAC-based assemblies of duplicated clusters in multiple species, and we anticipate use of our methods in their analysis.
Collapse
Affiliation(s)
- Tomás Vinar
- Faculty of Mathematics, Physics and Informatics, Comenius University , Bratislava, Slovakia
| | | | | | | |
Collapse
|
18
|
Kahn CL, Mozes S, Raphael BJ. Efficient algorithms for analyzing segmental duplications with deletions and inversions in genomes. Algorithms Mol Biol 2010; 5:11. [PMID: 20047668 PMCID: PMC2820476 DOI: 10.1186/1748-7188-5-11] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Accepted: 01/04/2010] [Indexed: 02/06/2023] Open
Abstract
Background Segmental duplications, or low-copy repeats, are common in mammalian genomes. In the human genome, most segmental duplications are mosaics comprised of multiple duplicated fragments. This complex genomic organization complicates analysis of the evolutionary history of these sequences. One model proposed to explain this mosaic patterns is a model of repeated aggregation and subsequent duplication of genomic sequences. Results We describe a polynomial-time exact algorithm to compute duplication distance, a genomic distance defined as the most parsimonious way to build a target string by repeatedly copying substrings of a fixed source string. This distance models the process of repeated aggregation and duplication. We also describe extensions of this distance to include certain types of substring deletions and inversions. Finally, we provide a description of a sequence of duplication events as a context-free grammar (CFG). Conclusion These new genomic distances will permit more biologically realistic analyses of segmental duplications in genomes.
Collapse
|
19
|
Lajoie M, Bertrand D, El-Mabrouk N. Inferring the evolutionary history of gene clusters from phylogenetic and gene order data. Mol Biol Evol 2009; 27:761-72. [PMID: 19903657 DOI: 10.1093/molbev/msp271] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Gene duplication is frequent within gene clusters and plays a fundamental role in evolution by providing a source of new genetic material upon which natural selection can act. Although classical phylogenetic inference methods provide some insight into the evolutionary history of a gene cluster, they are not sufficient alone to differentiate single- from multiple gene duplication events and to answer other questions regarding the nature and size of evolutionary events. In this paper, we present an algorithm allowing to infer a set of optimal evolutionary histories for a gene cluster in a single species, according to a general cost model involving variable length duplications (in tandem or inverted), deletions, and inversions. We applied our algorithm to the human olfactory receptor and protocadherin gene clusters, showing that the duplication size distribution differs significantly between the two gene families. The algorithm is available through a web interface at http://www-lbit.iro.umontreal.ca/DILTAG/.
Collapse
Affiliation(s)
- Mathieu Lajoie
- Département d'informatique et de recherche opérationnelle Université de Montréal, Montréal, Canada.
| | | | | |
Collapse
|
20
|
Zhang Y, Song G, Vinar T, Green ED, Siepel A, Miller W. Evolutionary history reconstruction for Mammalian complex gene clusters. J Comput Biol 2009; 16:1051-70. [PMID: 19645598 DOI: 10.1089/cmb.2009.0040] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Clusters of genes that evolved from single progenitors via repeated segmental duplications present significant challenges to the generation of a truly complete human genome sequence. Such clusters can confound both accurate sequence assembly and downstream computational analysis, yet they represent a hotbed of functional innovation, making them of extreme interest. We have developed an algorithm for reconstructing the evolutionary history of gene clusters using only human genomic sequence data, which allows the tempo of large-scale evolutionary events in human gene clusters to be estimated. We further propose an extension of the method to simultaneously reconstructing the evolutionary histories of orthologous gene clusters in multiple primates, which will facilitate primate comparative sequencing studies that aim to reconstruct their evolutionary history more fully.
Collapse
Affiliation(s)
- Yu Zhang
- Center for Comparative Genomics and Bioinformatics, Penn State University , University Park, PA 16802, USA.
| | | | | | | | | | | |
Collapse
|
21
|
Abstract
A number of biological processes can lead to genes being copied within the genome of some given species. Duplicate genes of this form are called paralogs and such genes share a high degree sequence similarity as well as often having closely related functions. Some genes have become widely duplicated to form multigene families in which the copies are distributed both within the genomes of individual species and across different species. Statistical modelling of gene duplication and the evolution of multi-gene families currently lags behind well-established models of DNA sequence evolution despite an increasing volume of available data, but the analysis of multi-gene families is important as part of a wider effort to understand evolution at the genomic level. This article reviews existing approaches to modelling multi-gene families and presents various challenges and possibilities for this exciting area of research.
Collapse
Affiliation(s)
- Tom M W Nye
- School of Mathematics and Statistics, Newcastle University, Newcastle, UK.
| |
Collapse
|
22
|
Vinař T, Brejová B, Song G, Siepel A. Reconstructing Histories of Complex Gene Clusters on a Phylogeny. COMPARATIVE GENOMICS 2009. [DOI: 10.1007/978-3-642-04744-2_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
23
|
Bushley KE, Ripoll DR, Turgeon BG. Module evolution and substrate specificity of fungal nonribosomal peptide synthetases involved in siderophore biosynthesis. BMC Evol Biol 2008; 8:328. [PMID: 19055762 PMCID: PMC2644324 DOI: 10.1186/1471-2148-8-328] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2008] [Accepted: 12/03/2008] [Indexed: 11/20/2022] Open
Abstract
Background Most filamentous ascomycete fungi produce high affinity iron chelators called siderophores, biosynthesized nonribosomally by multimodular adenylating enzymes called nonribosomal peptide synthetases (NRPSs). While genes encoding the majority of NRPSs are intermittently distributed across the fungal kingdom, those encoding ferrichrome synthetase NRPSs, responsible for biosynthesis of ferrichrome siderophores, are conserved, which offers an opportunity to trace their evolution and the genesis of their multimodular domain architecture. Furthermore, since the chemistry of many ferrichromes is known, the biochemical and structural 'rules' guiding NRPS substrate choice can be addressed using protein structural modeling and evolutionary approaches. Results A search of forty-nine complete fungal genome sequences revealed that, with the exception of Schizosaccharomyces pombe, none of the yeast, chytrid, or zygomycete genomes contained a candidate ferrichrome synthetase. In contrast, all filamentous ascomycetes queried contained at least one, while presence and numbers in basidiomycetes varied. Genes encoding ferrichrome synthetases were monophyletic when analyzed with other NRPSs. Phylogenetic analyses provided support for an ancestral duplication event resulting in two main lineages. They also supported the proposed hypothesis that ferrichrome synthetases derive from an ancestral hexamodular gene, likely created by tandem duplication of complete NRPS modules. Recurrent losses of individual domains or complete modules from this ancestral gene best explain the diversity of extant domain architectures observed. Key residues and regions in the adenylation domain pocket involved in substrate choice and for binding the amino and carboxy termini of the substrate were identified. Conclusion Iron-chelating ferrichrome synthetases appear restricted to fission yeast, filamentous ascomycetes, and basidiomycetes and fall into two main lineages. Phylogenetic analyses suggest that loss of domains or modules led to evolution of iterative biosynthetic mechanisms that allow flexibility in biosynthesis of the ferrichrome product. The 10 amino acid NRPS code, proposed earlier, failed when we tried to infer substrate preference. Instead, our analyses point to several regions of the binding pocket important in substrate choice and suggest that two positions of the code are involved in substrate anchoring, not substrate choice.
Collapse
Affiliation(s)
- Kathryn E Bushley
- Department of Plant Pathology & Plant-Microbe Biology, 334 Plant Science Building, Cornell University, Ithaca, NY 14853, USA.
| | | | | |
Collapse
|
24
|
Bertrand D, Lajoie M, El-Mabrouk N. Inferring ancestral gene orders for a family of tandemly arrayed genes. J Comput Biol 2008; 15:1063-77. [PMID: 18781832 DOI: 10.1089/cmb.2008.0025] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Tandemly arrayed genes (TAG) constitute a large fraction of most genomes and play important biological roles. They evolve through unequal recombination, which places duplicated genes next to the original ones (tandem duplications). Many algorithms have been proposed to infer a tandem duplication history for a TAG cluster. However, the presence of different transcriptional orientations in many clusters highlights the fact that processes such as inversions also contribute to their evolution. Moreover, existing algorithms are restricted to the study of TAGs evolution in a single species (only paralogous genes are considered). To circumvent these limitations, we consider an evolutionary model for TAGs involving duplication, gene loss, inversion, and speciation events. A general framework to infer ancestral gene orders that minimize the number of inversions in the whole evolutionary history is presented. At the methodological level, this paper integrates three approaches to genome evolution: the duplication tree reconstruction, the gene tree/species tree reconciliation theory, and the concept of inversion median used in order-based phylogeny reconstruction. An application on a cluster of olfactory receptor genes in four mammals is presented.
Collapse
|
25
|
Lajoie M, Bertrand D, El-Mabrouk N, Gascuel O. Duplication and inversion history of a tandemly repeated genes family. J Comput Biol 2007; 14:462-78. [PMID: 17572024 DOI: 10.1089/cmb.2007.a007] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Given a phylogenetic tree for a family of tandemly repeated genes and their signed order on the chromosome, we aim to find the minimum number of inversions compatible with an evolutionary history of this family. This is the first attempt to account for inversions in an evolutionary model of tandemly repeated genes. We present a branch-and-bound algorithm that finds the exact solution, and a polynomial-time heuristic based on the breakpoint distance. We show, on simulated data, that those algorithms can be used to improve phylogenetic inference of tandemly repeated gene families. An application on a published phylogeny of KRAB zinc finger genes is presented.
Collapse
Affiliation(s)
- Mathieu Lajoie
- DIRO, Université de Montréal, Montréal H3C 3J7, QC, Canada.
| | | | | | | |
Collapse
|
26
|
Sammeth M, Stoye J. Comparing tandem repeats with duplications and excisions of variable degree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:395-407. [PMID: 17085848 DOI: 10.1109/tcbb.2006.46] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Traditional sequence comparison by alignment employs a mutation model comprised of two events, substitutions and indels (insertions or deletions) of single positions. However, modern genetic analysis knows a variety of more complex mutation events (e.g., duplications, excisions, and rearrangements), especially regarding DNA. With ever more DNA sequence data becoming available, the need to accurately compare sequences which have clearly undergone more complicated types of mutational processes is becoming critical. Herein we introduce a new method for pairwise alignment and comparison of sequences with respect to the special evolution of tandem repeats: substitutions and indels of single positions and, additionally, duplications and excisions of variable degree (i.e., of one or more repeat copies simultaneously) are taken into account. To evaluate our method, we apply it to the spa VNTR (variable number of tandem repeats) cluster of Staphylococcus aureus, a bacterium of high medical importance.
Collapse
|
27
|
Guddeti S, Zhang DC, Li AL, Leseberg CH, Kang H, Li XG, Zhai WX, Johns MA, Mao L. Molecular evolution of the rice miR395 gene family. Cell Res 2006; 15:631-8. [PMID: 16117853 DOI: 10.1038/sj.cr.7290333] [Citation(s) in RCA: 92] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
MicroRNAs (miRNAs) are 20-22 nucleotide non-coding RNAs that play important roles in plant and animal development. They are usually processed from larger precursors that can form stem-loop structures. Among 20 miRNA families that are conserved between Arabidopsis and rice, the rice miR395 gene family was unique because it was organized into compact clusters that could be transcribed as one single transcript. We show here that in fact this family had four clusters of total 24 genes. Three of these clusters were segmental duplications. They contained miR395 genes of both 120 bp and 66 bp long. However, only the latter was repeatedly duplicated. The fourth cluster contained miR395 genes of two different sizes that could be the consequences of intergenic recombination of genes from the first three clusters. On each cluster, both 1-duplication and 2-duplication histories were observed based on the sequence similarity between miR395 genes, some of which were nearly identical suggesting a recent origin. This was supported by a miR395 locus survey among several species of the genus Oryza, where two clusters were only found in species with an AA genome, the genome of the cultivated rice. A comparative study of the genomic organization of Medicago truncatula miR395 gene family showed significant expansion of intergenic spaces indicating that the originally clustered genes were drifting away from each other. The diverse genomic organizations of a conserved microRNA gene family in different plant genomes indicated that this important negative gene regulation system has undergone dramatic tune-ups in plant genomes.
Collapse
Affiliation(s)
- Sreelatha Guddeti
- Department of Biological Sciences, Northern Illinois University, DeKalb, IL 60115, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
28
|
An exact and polynomial distance-based algorithm to reconstruct single copy tandem duplication trees. ACTA ACUST UNITED AC 2005. [DOI: 10.1016/j.jda.2004.08.013] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Kelley J, de Bono B, Trowsdale J. IRIS: A database surveying known human immune system genes. Genomics 2005; 85:503-11. [PMID: 15780753 DOI: 10.1016/j.ygeno.2005.01.009] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2004] [Revised: 01/19/2005] [Accepted: 01/20/2005] [Indexed: 11/29/2022]
Abstract
We have compiled an online database of known human defense genes: the Immunogenetic Related Information Source (IRIS). As of October 1, 2004, there are 1562 immune genes recorded in IRIS, representing 7% of the human genome. This resource contains searchable information including chromosomal location, sequence data, and a curated functional annotation for each entry. We used IRIS as a basis for analyzing the composition and characteristics of the immune genome, such as gene clustering, polymorphism, and relationship to disease. High protein sequence similarity correlated inversely with distance between immune genes, consistent with clustering of duplicated loci. We also found that, even though some immune genes exhibit high levels of polymorphism, such as MHC class I, the range of levels of polymorphism in immune genes is similar to that of nonimmune genes. Approximately 20% of immune genes have a known disease association. IRIS is available online at .
Collapse
Affiliation(s)
- James Kelley
- Department of Pathology, Immunology Division, University of Cambridge, Cambridge CB2 1QP, UK
| | | | | |
Collapse
|
30
|
|
31
|
Bertrand D, Gascuel O. Topological rearrangements and local search method for tandem duplication trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:15-28. [PMID: 17044161 DOI: 10.1109/tcbb.2005.15] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch . Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR, TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any other program.
Collapse
Affiliation(s)
- Denis Bertrand
- Projet Méthodes et Algorithmes pour la Bioinformatique, LIRMM (UMR 5506, CNRS-Univ. Montpellier 2), 161 rue Ada, 34392 Montpellier 5, France
| | | |
Collapse
|
32
|
Alkan C, Eichler EE, Bailey JA, Sahinalp SC, Tüzün E. The Role of Unequal Crossover in Alpha-Satellite DNA Evolution: A Computational Analysis. J Comput Biol 2004; 11:933-44. [PMID: 15700410 DOI: 10.1089/cmb.2004.11.933] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Human DNA consists of a large number of tandem repeat sequences. Such sequences are usually called satellites, with the primary example being the centromeric alpha-satellite DNA. The basic repeat unit of the alpha-satellite DNA is a 171 bp monomer. Arbitrary monomer pairs usually have considerable sequence divergence (20-40%). However, with the exception of peripheral alpha-satellite DNA, monomers can be grouped into blocks of k-monomers (4 < or = k < or = 20) between which the divergence rate is much smaller (e.g., 5%). Perhaps the simplest and best understood mechanism for tandem repeat array evolution is unequal crossover. Although it is possible that alpha-satellite sequences developed as a result of subsequent unequal crossovers only, no formal computational framework seems to have been developed to verify this possibility. In this paper, we develop such a framework and report on experiments which imply that pericentromeric alpha-satellite segments (which are devoid of higher order structure) are evolutionarily distinct from the higher order repeat segments. It is likely that the higher order repeats developed independently in distinct regions of the genome and were carried into their current locations through an unknown mechanism of transposition.
Collapse
Affiliation(s)
- Can Alkan
- Department of EECS, Case Western Reserve University, Cleveland, OH 44106, USA
| | | | | | | | | |
Collapse
|
33
|
Abstract
In the class of repeated sequences that occur in DNA, minisatellites have been found polymorphic and became useful tools in genetic mapping and forensic studies. They consist of a heterogeneous tandem array of a short repeat unit. The slightly different units along the array are called variants. Minisatellites evolve mainly through tandem duplications and tandem deletions of variants. Jeffreys et al. (1997) devised a method to obtain the sequence of variants along the array in a digital code and called such sequences maps. Minisatellite maps give access to the detail of mutation processes at work on such loci. In this paper, we design an algorithm to compare two maps under an evolutionary model that includes deletion, insertion, mutation, tandem duplication, and tandem deletion of a variant. Our method computes an optimal alignment in reasonable time; and the alignment score, i.e., the weighted sum of its elementary operations, is a distance metric between maps. The main difficulty is that the optimal sequence of operations depends on the order in which they are applied to the map. Taking the maps of the minisatellite MSY1 of 609 men, we computed all pairwise distances and reconstructed an evolutionary tree of these individuals. MSY1 (DYF155S1) is a hypervariable locus on the Y chromosome. In our tree, the populations of some haplogroups are monophyletic, showing that one can decipher a microevolutionary signal using minisatellite maps comparison.
Collapse
Affiliation(s)
- Sèverine Bérard
- L.I.R.M.M., UMR CNRS 5506, 161 rue Ada, F34392 Montpellier Cedex 5, France
| | | |
Collapse
|
34
|
|
35
|
Pisanti N, Marangoni R, Ferragina P, Frangioni A, Savona A, Pisanelli C, Luccio F. PaTre: a method for paralogy trees construction. J Comput Biol 2003; 10:791-802. [PMID: 14633400 DOI: 10.1089/106652703322539105] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Genomes can be described as a collection of clusters, the gene families, whose members are called paralogs. Paralogs are genes that most probably share duplication history and show a significant similarity in their sequences, even if they perform slightly different biological function. Among the different mechanisms that have led to an increase of the genomic information during biological evolution, gene duplication is probably the most important. To better understand duplication events, the first step is to investigate the history of the gene families in order to detect which duplication events have taken place, and in which relative (partial) order. Here we present a method, called PaTre, that, given a gene family, attempts to construct the paralogy tree of the family. We will work under the hypothesis that every family member derives from a duplication process of another member. By the term paralogy tree, we mean a directed tree in which the root represents the most ancient paralog of the family and each oriented arc (a, b) represents the existence of a duplication event from the template gene a to its copy b. Notice that gene a survives the event and can serve as a template of more than one duplication event; in fact, there can be more than one arc leaving a. PaTre uses new algorithmic techniques motivated by the specific application at hand. The reliability of the inferential process has been tested by means of a simulator that implements different hypotheses on the duplication-with-modification paradigm and on three examples of different biological gene families, belonging either to lower and higher organisms.
Collapse
Affiliation(s)
- N Pisanti
- Department of Computer Science, University of Pisa, Italy.
| | | | | | | | | | | | | |
Collapse
|
36
|
Elemento O, Lefranc MP. IMGT/PhyloGene: an on-line tool for comparative analysis of immunoglobulin and T cell receptor genes. DEVELOPMENTAL AND COMPARATIVE IMMUNOLOGY 2003; 27:763-779. [PMID: 12818634 DOI: 10.1016/s0145-305x(03)00078-8] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
IMGT/PhyloGene is an on-line software package for comparative analysis of immunoglobulin (IG) and T cell receptor (TR) variable genes of all vertebrate species, newly implemented in IMGT, the international ImMunoGeneTics information system ((R)). IMGT/PhyloGene is strongly associated with the IMGT gene and allele nomenclature and with the IMGT unique numbering for V-REGION, which directly creates standardized alignments from IMGT reference sequences. IMGT/PhyloGene is the first tool to use the IMGT expertized and standardized data for automated comparative analyses, and the first on-line software package for phylogenetic reconstruction to be integrated to a sequence database. Starting from a standardized alignment of selected sequences, IMGT/PhyloGene computes a matrix of evolutionary distances, builds a tree using the Neighbor-Joining (NJ) algorithm, and outputs various graphical tree representations. The resulting IMGT/PhyloGene tree is then used as a support for studying the evolution of particular subregions, such as the CDR-IMGT (Complementarity Determining Regions) or the V-RS (Variable gene Recombination Signals). IMGT/PhyloGene is freely available at http://imgt.cines.fr.
Collapse
Affiliation(s)
- Olivier Elemento
- IMGT, the International ImMunoGeneTics Information System, Laboratoire d'ImmunoGénétique Moléculaire (LIGM), Université Montpellier II, UPR CNRS 1142, Institut de Génétique Humaine (IGH), 141 rue de la Cardonille, 34396 Cedex 5, Montpellier, France
| | | |
Collapse
|