1
|
Yeung C, Qu X, Sala-Torra O, Woolston D, Radich J, Fang M. Mutational profiling in acute lymphoblastic leukemia by RNA sequencing and chromosomal genomic array testing. Cancer Med 2021; 10:5629-5642. [PMID: 34288525 PMCID: PMC8366081 DOI: 10.1002/cam4.4101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 04/29/2021] [Accepted: 05/02/2021] [Indexed: 12/03/2022] Open
Abstract
Background Comprehensive molecular and cytogenetic profiling of acute lymphoblastic leukemia (ALL) is important and critical to the current standard of care for patients with B‐acute lymphoblastic leukemia (B‐ALL). Here we propose a rapid process for detecting gene fusions whereby FusionPlex RNA next‐generation sequencing (NGS) and DNA chromosome genomic array testing (CGAT) are combined for a more efficient approach in the management of patients with B‐ALL. Methods We performed RNA NGS and CGAT on 28 B‐ALL samples and, in four patients, compared fixed cell pellets to paired cryo‐preserved samples as a starting material to further assess the utility of cytogenetic fixed pellets for gene expression analysis. Results Among the fixed specimens, when using alternative techniques as references, including karyotype, fluorescence in situ hybridization, CGAT, and RT‐qPCR, fusions were detected by RNA NGS with 100% sensitivity and specificity. In the four paired fixed versus fresh cryopreserved samples, fusions were also 100% concordant. Four of the 28 patients showed mutations that were detected by RNA sequencing and three of four of these mutations had well‐known drug resistance implications. Conclusions We conclude that FusionPlex is a robust and reliable anchored multiplex RNA sequencing platform for use in the detection of fusions in both fresh cryopreserved and cytogenetic fixed pellets. Gene expression data could only be obtained from fresh samples and although limited variant data are available, critical hotspot variants can be determined in conjunction with the fusions.
Collapse
Affiliation(s)
- Cecilia Yeung
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA.,University of Washington, Seattle, WA, USA.,Seattle Cancer Care Alliance, Seattle, WA, USA
| | - Xiaoyu Qu
- Seattle Cancer Care Alliance, Seattle, WA, USA
| | - Olga Sala-Torra
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA
| | - David Woolston
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA
| | - Jerry Radich
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA.,University of Washington, Seattle, WA, USA.,Seattle Cancer Care Alliance, Seattle, WA, USA
| | - Min Fang
- Fred Hutchinson Cancer Research Center, Clinical Research Division, Seattle, WA, USA.,University of Washington, Seattle, WA, USA.,Seattle Cancer Care Alliance, Seattle, WA, USA
| |
Collapse
|
2
|
Biswas N, Chakrabarti S. Artificial Intelligence (AI)-Based Systems Biology Approaches in Multi-Omics Data Analysis of Cancer. Front Oncol 2020; 10:588221. [PMID: 33154949 PMCID: PMC7591760 DOI: 10.3389/fonc.2020.588221] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 09/21/2020] [Indexed: 12/13/2022] Open
Abstract
Cancer is the manifestation of abnormalities of different physiological processes involving genes, DNAs, RNAs, proteins, and other biomolecules whose profiles are reflected in different omics data types. As these bio-entities are very much correlated, integrative analysis of different types of omics data, multi-omics data, is required to understanding the disease from the tumorigenesis to the disease progression. Artificial intelligence (AI), specifically machine learning algorithms, has the ability to make decisive interpretation of "big"-sized complex data and, hence, appears as the most effective tool for the analysis and understanding of multi-omics data for patient-specific observations. In this review, we have discussed about the recent outcomes of employing AI in multi-omics data analysis of different types of cancer. Based on the research trends and significance in patient treatment, we have primarily focused on the AI-based analysis for determining cancer subtypes, disease prognosis, and therapeutic targets. We have also discussed about AI analysis of some non-canonical types of omics data as they have the capability of playing the determiner role in cancer patient care. Additionally, we have briefly discussed about the data repositories because of their pivotal role in multi-omics data storing, processing, and analysis.
Collapse
Affiliation(s)
- Nupur Biswas
- Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, IICB TRUE Campus, Kolkata, India
| | - Saikat Chakrabarti
- Structural Biology and Bioinformatics Division, CSIR-Indian Institute of Chemical Biology, IICB TRUE Campus, Kolkata, India
| |
Collapse
|
3
|
Worthey EA. Analysis and Annotation of Whole-Genome or Whole-Exome Sequencing Derived Variants for Clinical Diagnosis. ACTA ACUST UNITED AC 2017; 95:9.24.1-9.24.28. [PMID: 29044471 DOI: 10.1002/cphg.49] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Over the last 10 years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing or analysis (given access to appropriate tools), but rather clinical interpretation. Interpretation of genetic findings in a complex and ever changing clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires application of appropriate interpretation tools, as well as development and application of appropriate methodologies and standard procedures. This unit provides an overview of these items. Specific challenges related to implementation of genome-wide sequencing in a clinical setting are discussed. © 2017 by John Wiley & Sons, Inc.
Collapse
|
4
|
de Brot S, Schade B, Croci M, Dettwiler M, Guscetti F. Sequence and partial functional analysis of canine Bcl-2 family proteins. Res Vet Sci 2015; 104:126-35. [PMID: 26850551 DOI: 10.1016/j.rvsc.2015.12.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2014] [Revised: 11/08/2015] [Accepted: 12/04/2015] [Indexed: 12/26/2022]
Abstract
Dogs present with spontaneous neoplasms biologically similar to human cancers. Apoptotic pathways are deregulated during cancer genesis and progression and are important for therapy. We have assessed the degree of conservation of a set of canine Bcl-2 family members with the human and murine orthologs. To this end, seven complete canine open reading frames were cloned in this family, four of which are novel for the dog, their sequences were analyzed, and their functional interactions were studied in yeasts. We found a high degree of overall and domain sequence homology between canine and human proteins. It was slightly higher than between murine and human proteins. Functional interactions between canine pro-apoptotic Bax and Bak and anti-apoptotic Bcl-xL, Bcl-w, and Mcl-1 were recapitulated in yeasts. Our data provide support for the notion that systems based on canine-derived proteins might faithfully reproduce Bcl-2 family member interactions known from other species and establish the yeast as a useful tool for functional studies with canine proteins.
Collapse
Affiliation(s)
- S de Brot
- Institute of Veterinary Pathology, Vetsuisse Faculty, University of Zurich, Winterthurerstrasse 268, CH-8057 Zurich, Switzerland
| | - B Schade
- Institute of Veterinary Pathology, Vetsuisse Faculty, University of Zurich, Winterthurerstrasse 268, CH-8057 Zurich, Switzerland
| | - M Croci
- Institute of Veterinary Pathology, Vetsuisse Faculty, University of Zurich, Winterthurerstrasse 268, CH-8057 Zurich, Switzerland
| | - M Dettwiler
- Institute of Veterinary Pathology, Vetsuisse Faculty, University of Zurich, Winterthurerstrasse 268, CH-8057 Zurich, Switzerland
| | - F Guscetti
- Institute of Veterinary Pathology, Vetsuisse Faculty, University of Zurich, Winterthurerstrasse 268, CH-8057 Zurich, Switzerland.
| |
Collapse
|
5
|
Worthey EA. Analysis and annotation of whole-genome or whole-exome sequencing-derived variants for clinical diagnosis. CURRENT PROTOCOLS IN HUMAN GENETICS 2013; 79:9.24.1-9.24.24. [PMID: 24510652 DOI: 10.1002/0471142905.hg0924s79] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Over the last several years, next-generation sequencing (NGS) has transformed genomic research through substantial advances in technology and reduction in the cost of sequencing, and also in the systems required for analysis of these large volumes of data. This technology is now being used as a standard molecular diagnostic test under particular circumstances in some clinical settings. The advances in sequencing have come so rapidly that the major bottleneck in identification of causal variants is no longer the sequencing but rather the analysis and interpretation. Interpretation of genetic findings in a clinical setting is scarcely a new challenge, but the task is increasingly complex in clinical genome-wide sequencing given the dramatic increase in dataset size and complexity. This increase requires the development of novel or repositioned analysis tools, methodologies, and processes. This unit provides an overview of these items. Specific challenges related to implementation in a clinical setting are discussed.
Collapse
Affiliation(s)
- Elizabeth A Worthey
- Department of Pediatrics, Medical College of Wisconsin, Milwaukee, Wisconsin.,The Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, Wisconsin.,Department of Computer Science, University of Wisconsin, Milwaukee, Wisconsin
| |
Collapse
|
6
|
Kiran A, Loughran G, O'Mahony JJ, Baranov PV. Identification of A-to-I RNA editing: dotting the i's in the human transcriptome. BIOCHEMISTRY (MOSCOW) 2012; 76:915-23. [PMID: 22022965 DOI: 10.1134/s0006297911080074] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The phenomenon of adenosine-to-inosine (A-to-I) RNA editing has attracted considerable attention from the scientific community due to its potential relationship to the evolution of cognition in animals. While A-to-I editing exists in all organisms with neurons, including those with primitive neuronal systems (hydra and nematodes), it is particularly frequent in organisms with a highly developed central nervous system (primates, especially humans). Diversification of RNA transcript sequences via A-to-I editing serves a number of different functional roles, such as altering the genome-templated identity of particular amino acids in proteins or altering splice site junctions and modulating regulation of alternatively spliced mRNA variants. Here we provide an overview of current computational and experimental methods for the high-throughput discovery of edited RNA nucleotides in the human transcriptome, as well as a survey of the existing RNA editing bioinformatics resources and an outlook of future perspectives.
Collapse
Affiliation(s)
- A Kiran
- Biochemistry Department, University College Cork, Cork, Ireland
| | | | | | | |
Collapse
|
7
|
Baertsch R, Diekhans M, Kent WJ, Haussler D, Brosius J. Retrocopy contributions to the evolution of the human genome. BMC Genomics 2008; 9:466. [PMID: 18842134 PMCID: PMC2584115 DOI: 10.1186/1471-2164-9-466] [Citation(s) in RCA: 92] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2008] [Accepted: 10/08/2008] [Indexed: 02/06/2023] Open
Abstract
Background Evolution via point mutations is a relatively slow process and is unlikely to completely explain the differences between primates and other mammals. By contrast, 45% of the human genome is composed of retroposed elements, many of which were inserted in the primate lineage. A subset of retroposed mRNAs (retrocopies) shows strong evidence of expression in primates, often yielding functional retrogenes. Results To identify and analyze the relatively recently evolved retrogenes, we carried out BLASTZ alignments of all human mRNAs against the human genome and scored a set of features indicative of retroposition. Of over 12,000 putative retrocopy-derived genes that arose mainly in the primate lineage, 726 with strong evidence of transcript expression were examined in detail. These mRNA retroposition events fall into three categories: I) 34 retrocopies and antisense retrocopies that added potential protein coding space and UTRs to existing genes; II) 682 complete retrocopy duplications inserted into new loci; and III) an unexpected set of 13 retrocopies that contributed out-of-frame, or antisense sequences in combination with other types of transposed elements (SINEs, LINEs, LTRs), even unannotated sequence to form potentially novel genes with no homologs outside primates. In addition to their presence in human, several of the gene candidates also had potentially viable ORFs in chimpanzee, orangutan, and rhesus macaque, underscoring their potential of function. Conclusion mRNA-derived retrocopies provide raw material for the evolution of genes in a wide variety of ways, duplicating and amending the protein coding region of existing genes as well as generating the potential for new protein coding space, or non-protein coding RNAs, by unexpected contributions out of frame, in reverse orientation, or from previously non-protein coding sequence.
Collapse
Affiliation(s)
- Robert Baertsch
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA.
| | | | | | | | | |
Collapse
|
8
|
Kong QP, Salas A, Sun C, Fuku N, Tanaka M, Zhong L, Wang CY, Yao YG, Bandelt HJ. Distilling artificial recombinants from large sets of complete mtDNA genomes. PLoS One 2008; 3:e3016. [PMID: 18714389 PMCID: PMC2515346 DOI: 10.1371/journal.pone.0003016] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2008] [Accepted: 07/21/2008] [Indexed: 11/19/2022] Open
Abstract
Background Large-scale genome sequencing poses enormous problems to the logistics of laboratory work and data handling. When numerous fragments of different genomes are PCR amplified and sequenced in a laboratory, there is a high immanent risk of sample confusion. For genetic markers, such as mitochondrial DNA (mtDNA), which are free of natural recombination, single instances of sample mix-up involving different branches of the mtDNA phylogeny would give rise to reticulate patterns and should therefore be detectable. Methodology/Principal Findings We have developed a strategy for comparing new complete mtDNA genomes, one by one, to a current skeleton of the worldwide mtDNA phylogeny. The mutations distinguishing the reference sequence from a putative recombinant sequence can then be allocated to two or more different branches of this phylogenetic skeleton. Thus, one would search for two (or three) near-matches in the total mtDNA database that together best explain the variation seen in the recombinants. The evolutionary pathway from the mtDNA tree connecting this pair together with the recombinant then generate a grid-like median network, from which one can read off the exchanged segments. Conclusions We have applied this procedure to a large collection of complete human mtDNA sequences, where several recombinants could be distilled by our method. All these recombinant sequences were subsequently corrected by de novo experiments – fully concordant with the predictions from our data-analytical approach.
Collapse
Affiliation(s)
- Qing-Peng Kong
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
- Laboratory for Conservation and Utilization of Bio-resource, Yunnan University, Kunming 650091, China
| | - Antonio Salas
- Unidade de Xenética, Instituto de Medicina Legal, Facultad de Medicina, Universidad de Santiago de Compostela, Galicia, Spain
| | - Chang Sun
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Noriyuki Fuku
- Department of Genomics for Longevity and Health, Tokyo Metropolitan Institute of Gerontology, Tokyo, Japan
| | - Masashi Tanaka
- Department of Genomics for Longevity and Health, Tokyo Metropolitan Institute of Gerontology, Tokyo, Japan
| | - Li Zhong
- Laboratory for Conservation and Utilization of Bio-resource, Yunnan University, Kunming 650091, China
| | - Cheng-Ye Wang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Yong-Gang Yao
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
- Key Laboratory of Animal Models and Human Disease Mechanisms, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | | |
Collapse
|
9
|
Szafranski K, Schindler S, Taudien S, Hiller M, Huse K, Jahn N, Schreiber S, Backofen R, Platzer M. Violating the splicing rules: TG dinucleotides function as alternative 3' splice sites in U2-dependent introns. Genome Biol 2008; 8:R154. [PMID: 17672918 PMCID: PMC2374985 DOI: 10.1186/gb-2007-8-8-r154] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2007] [Revised: 06/14/2007] [Accepted: 08/01/2007] [Indexed: 01/25/2023] Open
Abstract
TG dinucleotides functioning as alternative 3' splice sites were identified and experimentally verified in 36 human genes. Background Despite some degeneracy of sequence signals that govern splicing of eukaryotic pre-mRNAs, it is an accepted rule that U2-dependent introns exhibit the 3' terminal dinucleotide AG. Intrigued by anecdotal evidence for functional non-AG 3' splice sites, we carried out a human genome-wide screen. Results We identified TG dinucleotides functioning as alternative 3' splice sites in 36 human genes. The TG-derived splice variants were experimentally validated with a success rate of 92%. Interestingly, ratios of alternative splice variants are tissue-specific for several introns. TG splice sites and their flanking intron sequences are substantially conserved between orthologous vertebrate genes, even between human and frog, indicating functional relevance. Remarkably, TG splice sites are exclusively found as alternative 3' splice sites, never as the sole 3' splice site for an intron, and we observed a distance constraint for TG-AG splice site tandems. Conclusion Since TGs splice sites are exclusively found as alternative 3' splice sites, the U2 spliceosome apparently accomplishes perfect specificity for 3' AGs at an early splicing step, but may choose 3' TGs during later steps. Given the tiny fraction of TG 3' splice sites compared to the vast amount of non-viable TGs, cis-acting sequence signals must significantly contribute to splice site definition. Thus, we consider TG-AG 3' splice site tandems as promising subjects for studies on the mechanisms of 3' splice site selection.
Collapse
Affiliation(s)
- Karol Szafranski
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| | - Stefanie Schindler
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| | - Stefan Taudien
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| | - Michael Hiller
- Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee, 79110 Freiburg, Germany
| | - Klaus Huse
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| | - Niels Jahn
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| | - Stefan Schreiber
- Institute of Clinical Molecular Biology, Christian Albrechts University Kiel, Schittenhelmstr., 24105 Kiel, Germany
| | - Rolf Backofen
- Institute of Computer Science, Bioinformatics Group, Albert-Ludwigs-University Freiburg, Georges-Koehler-Allee, 79110 Freiburg, Germany
| | - Matthias Platzer
- Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr., 07745 Jena, Germany
| |
Collapse
|
10
|
Yao A, Charlab R, Li P. Systematic identification of pseudogenes through whole genome expression evidence profiling. Nucleic Acids Res 2006; 34:4477-85. [PMID: 16945953 PMCID: PMC1636364 DOI: 10.1093/nar/gkl591] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2005] [Revised: 07/28/2006] [Accepted: 07/31/2006] [Indexed: 01/23/2023] Open
Abstract
The identification of pseudogenes is an integral and significant part of the genome annotation because of their abundance and their impact on the experimental analysis of functional genes. Most of the computational annotation systems are not optimized for systematic pseudogene recognition, often annotating pseudogenes as functional genes, and users then propagate these errors to subsequent analyses and interpretations. In order to validate gene annotations and to identify pseudogenes that are potentially mis-annotated, we developed a novel approach based on whole genome profiling of existing transcript and protein sequences. This method has two important features: (i) equally detects both processed and non-processed pseudogenes and (ii) can identify transcribed pseudogenes. Applying this method to the human Ensembl gene predictions, we discovered that 2011 (9% of total) Ensembl genes in the categories of known and novel might be pseudogenes based on expression evidence. Of these, 1200 genes are found to have no existing evidence of transcription, and 811 genes are found with transcription evidence but contain significant translation disruption. Approximately 40% of the 2011 identified pseudogenes presented a multi-exon structure, representing non-processed pseudogenes. We have demonstrated the power of whole genome profiling of expression sequences to improve the accuracy of gene annotations.
Collapse
Affiliation(s)
- Alison Yao
- Celera Genomics45 West Gude Dr, Rockville, MD 20850, USA
- Applied Biosystems Inc45 West Gude Dr, Rockville, MD 20850, USA
| | - Rosane Charlab
- Applied Biosystems Inc45 West Gude Dr, Rockville, MD 20850, USA
| | - Peter Li
- Applied Biosystems Inc45 West Gude Dr, Rockville, MD 20850, USA
| |
Collapse
|
11
|
Suyama M, Harrington E, Bork P, Torrents D. Identification and analysis of genes and pseudogenes within duplicated regions in the human and mouse genomes. PLoS Comput Biol 2006; 2:e76. [PMID: 16846249 PMCID: PMC1484586 DOI: 10.1371/journal.pcbi.0020076] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2005] [Accepted: 05/16/2006] [Indexed: 11/23/2022] Open
Abstract
The identification and classification of genes and pseudogenes in duplicated regions still constitutes a challenge for standard automated genome annotation procedures. Using an integrated homology and orthology analysis independent of current gene annotation, we have identified 9,484 and 9,017 gene duplicates in human and mouse, respectively. On the basis of the integrity of their coding regions, we have classified them into functional and inactive duplicates, allowing us to define the first consistent and comprehensive collection of 1,811 human and 1,581 mouse unprocessed pseudogenes. Furthermore, of the total of 14,172 human and mouse duplicates predicted to be functional genes, as many as 420 are not included in current reference gene databases and therefore correspond to likely novel mammalian genes. Some of these correspond to partial duplicates with less than half of the length of the original source genes, yet they are conserved and syntenic among different mammalian lineages. The genes and unprocessed pseudogenes obtained here will enable further studies on the mechanisms involved in gene duplication as well as of the fate of duplicated genes. The duplication of genes is considered one of the major sources of biological diversity, as it provides the necessary conditions for the generation of new gene types and functions. Even though, after a gene is duplicated, one of the copies normally undergoes inactivation, it can eventually establish in the genome as a novel gene with new functionality. The identification of the molecular basis of gene duplication and the forces that determine the fate of the resulting copies is essential to understand how genes and, ultimately, organisms evolve. The first step in this direction is the identification of duplicated genes and pseudogenes, which still remains a challenge for standard procedures of automated genome annotation. The authors have developed a methodology that comprehensively identifies and classifies these regions, and provide the collections of duplicated genes and pseudogenes found in the human and mouse genomes. Among these, there are 420 previously unidentified potentially functional genes, which include examples of partial duplicates with less than half of the length of the original source genes. Furthermore, they also provide preliminary novel biological insight into the mechanism of gene duplication, which will constitute the starting point for further studies of the fates and evolution of duplicated genes.
Collapse
Affiliation(s)
- Mikita Suyama
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | | - Peer Bork
- European Molecular Biology Laboratory, Heidelberg, Germany
- Max Delbrück Center for Molecular Medicine, Berlin-Buch, Germany
- * To whom correspondence should be addressed. E-mail: (PB); (DT)
| | - David Torrents
- European Molecular Biology Laboratory, Heidelberg, Germany
- * To whom correspondence should be addressed. E-mail: (PB); (DT)
| |
Collapse
|
12
|
Thompson J, Gopal S. Genetic algorithm learning as a robust approach to RNA editing site prediction. BMC Bioinformatics 2006; 7:145. [PMID: 16542417 PMCID: PMC1459874 DOI: 10.1186/1471-2105-7-145] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2005] [Accepted: 03/16/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA editing is one of several post-transcriptional modifications that may contribute to organismal complexity in the face of limited gene complement in a genome. One form, known as C --> U editing, appears to exist in a wide range of organisms, but most instances of this form of RNA editing have been discovered serendipitously. With the large amount of genomic and transcriptomic data now available, a computational analysis could provide a more rapid means of identifying novel sites of C --> U RNA editing. Previous efforts have had some success but also some limitations. We present a computational method for identifying C --> U RNA editing sites in genomic sequences that is both robust and generalizable. We evaluate its potential use on the best data set available for these purposes: C --> U editing sites in plant mitochondrial genomes. RESULTS Our method is derived from a machine learning approach known as a genetic algorithm. REGAL (RNA Editing site prediction by Genetic Algorithm Learning) is 87% accurate when tested on three mitochondrial genomes, with an overall sensitivity of 82% and an overall specificity of 91%. REGAL's performance significantly improves on other ab initio approaches to predicting RNA editing sites in this data set. REGAL has a comparable sensitivity and higher specificity than approaches which rely on sequence homology, and it has the advantage that strong sequence conservation is not required for reliable prediction of edit sites. CONCLUSION Our results suggest that ab initio methods can generate robust classifiers of putative edit sites, and we highlight the value of combinatorial approaches as embodied by genetic algorithms. We present REGAL as one approach with the potential to be generalized to other organisms exhibiting C --> U RNA editing.
Collapse
Affiliation(s)
- James Thompson
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| | - Shuba Gopal
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| |
Collapse
|
13
|
Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Gräf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kähäri A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hubbard TJP. Ensembl 2006. Nucleic Acids Res 2006; 34:D556-61. [PMID: 16381931 PMCID: PMC1347495 DOI: 10.1093/nar/gkj133] [Citation(s) in RCA: 323] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
The Ensembl () project provides a comprehensive and integrated source of annotation of large genome sequences. Over the last year the number of genomes available from the Ensembl site has increased from 4 to 19, with the addition of the mammalian genomes of Rhesus macaque and Opossum, the chordate genome of Ciona intestinalis and the import and integration of the yeast genome. The year has also seen extensive improvements to both data analysis and presentation, with the introduction of a redesigned website, the addition of RNA gene and regulatory annotation and substantial improvements to the integration of human genome variation data.
Collapse
Affiliation(s)
- E Birney
- European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
The use of standards in gene expression measurements with DNA microarrays is ubiquitous--they just are not yet the kind of standards that have yielded microarray gene expression profiles that can be readily compared across different studies and different laboratories. They also are not yet enabling microarray measurements of the known, verifiable quality needed so they can be used with confidence in genomic medicine in regulated environments.
Collapse
Affiliation(s)
- Marc Salit
- Chemical Science and Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| |
Collapse
|
15
|
Kumar S, Filipski A, Swarna V, Walker A, Hedges SB. Placing confidence limits on the molecular age of the human-chimpanzee divergence. Proc Natl Acad Sci U S A 2005; 102:18842-7. [PMID: 16365310 PMCID: PMC1316887 DOI: 10.1073/pnas.0509585102] [Citation(s) in RCA: 101] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Molecular clocks have been used to date the divergence of humans and chimpanzees for nearly four decades. Nonetheless, this date and its confidence interval remain to be firmly established. In an effort to generate a genomic view of the human-chimpanzee divergence, we have analyzed 167 nuclear protein-coding genes and built a reliable confidence interval around the calculated time by applying a multifactor bootstrap-resampling approach. Bayesian and maximum likelihood analyses of neutral DNA substitutions show that the human-chimpanzee divergence is close to 20% of the ape-Old World monkey (OWM) divergence. Therefore, the generally accepted range of 23.8-35 millions of years ago for the ape-OWM divergence yields a range of 4.98-7.02 millions of years ago for human-chimpanzee divergence. Thus, the older time estimates for the human-chimpanzee divergence, from molecular and paleontological studies, are unlikely to be correct. For a given the ape-OWM divergence time, the 95% confidence interval of the human-chimpanzee divergence ranges from -12% to 19% of the estimated time. Computer simulations suggest that the 95% confidence intervals obtained by using a multifactor bootstrap-resampling approach contain the true value with >95% probability, whether deviations from the molecular clock are random or correlated among lineages. Analyses revealed that the use of amino acid sequence differences is not optimal for dating human-chimpanzee divergence and that the inclusion of additional genes is unlikely to narrow the confidence interval significantly. We conclude that tests of hypotheses about the timing of human-chimpanzee divergence demand more precise fossil-based calibrations.
Collapse
Affiliation(s)
- Sudhir Kumar
- Center for Evolutionary Functional Genomics, Biodesign Institute, and School of Life Sciences, Arizona State University, Tempe, AZ 85287-5301, USA.
| | | | | | | | | |
Collapse
|
16
|
Abstract
Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every gene--the first plateau on the long climb toward a comprehensive catalog. These strategies--sequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curation--will have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0.
Collapse
Affiliation(s)
- Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA.
| |
Collapse
|
17
|
Tiffany-Castiglioni E, Venkatraj V, Qian Y. Genetic polymorphisms and mechanisms of neurotoxicity: overview. Neurotoxicology 2005; 26:641-9. [PMID: 16026840 DOI: 10.1016/j.neuro.2005.05.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2004] [Revised: 05/30/2005] [Accepted: 05/31/2005] [Indexed: 10/25/2022]
Affiliation(s)
- Evelyn Tiffany-Castiglioni
- Department of Integrative Biosciences, College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, TX 77845-4458, USA.
| | | | | |
Collapse
|
18
|
Meier SM, Huebner H, Buchholz R. Single-cell-bioreactors as end of miniaturization approaches in biotechnology: progresses with characterised bioreactors and a glance into the future. Bioprocess Biosyst Eng 2005; 28:95-107. [PMID: 16096764 DOI: 10.1007/s00449-005-0003-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2005] [Accepted: 05/06/2005] [Indexed: 11/26/2022]
Abstract
Incidents with single cells and their genesis have not been the major focus of science up to now. This fact is supported by the difficulties one faces when wanting to monitor and cultivate small populations of cells in a defined compartment under controlled conditions, in vitro. Several approaches of up- and down-scaling have often led to poorly understood results which might be better elucidated by understanding the cellular genesis as a function of its microenvironment. This review of the approaches of scale-up and scale-down processes illustrates technical possibilities and shows up their limitations with regard to obtainable data for the characterisation of cellular genesis and impact of the cellular microenvironment. For example, stem cell research advances underline the lack of information about the impact of the microenvironment on cellular development. Finally, a proposal of future research efforts is given on how to overcome this lack of data via a novel bioreactor setup.
Collapse
Affiliation(s)
- Stephan Michael Meier
- Institute of Bioprocess Engineering, University of Erlangen-Nuremberg, Erlangen, Germany.
| | | | | |
Collapse
|
19
|
Eisenberg E, Adamsky K, Cohen L, Amariglio N, Hirshberg A, Rechavi G, Levanon EY. Identification of RNA editing sites in the SNP database. Nucleic Acids Res 2005; 33:4612-7. [PMID: 16100382 PMCID: PMC1185576 DOI: 10.1093/nar/gki771] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The relationship between human inherited genomic variations and phenotypic differences has been the focus of much research effort in recent years. These studies benefit from millions of single-nucleotide polymorphism (SNP) records available in public databases, such as dbSNP. The importance of identifying false dbSNP records increases with the growing role played by SNPs in linkage analysis for disease traits. In particular, the emerging understanding of the abundance of DNA and RNA editing calls for a careful distinction between inherited SNPs and somatic DNA and RNA modifications. In order to demonstrate that some of the SNP database records are actually somatic modification, we focus on one type of these modifications, namely A-to-I RNA editing, and present evidence for hundreds of dbSNP records that are actually editing sites. We provide a list of 102 RNA editing sites previously annotated in dbSNP database as SNPs, and experimentally validate seven of these. Interestingly, we show how dbSNP can serve as a starting point to look for new editing sites. Our results, for this particular type of RNA editing, demonstrate the need for a careful analysis of SNP databases in light of the increasing recognition of the significance of somatic sequence modifications.
Collapse
Affiliation(s)
- Eli Eisenberg
- School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, TAUTel Aviv 69978, Israel
| | - Konstantin Adamsky
- Department of Pediatric Hemato-Oncology, Safra Children's Hospital, Sheba Medical Center and Sackler School of Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
| | - Lital Cohen
- Department of Pediatric Hemato-Oncology, Safra Children's Hospital, Sheba Medical Center and Sackler School of Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
| | - Ninette Amariglio
- Department of Pediatric Hemato-Oncology, Safra Children's Hospital, Sheba Medical Center and Sackler School of Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
| | - Abraham Hirshberg
- Department of Oral Pathology, School of Dental Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
| | - Gideon Rechavi
- Department of Pediatric Hemato-Oncology, Safra Children's Hospital, Sheba Medical Center and Sackler School of Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
| | - Erez Y. Levanon
- Department of Pediatric Hemato-Oncology, Safra Children's Hospital, Sheba Medical Center and Sackler School of Medicine, Tel Aviv UniversityTel Aviv 69978, Israel
- Compugen Ltd72 Pinchas Rosen Street, Tel Aviv 69512, Israel
- To whom correspondence should be addressed. Tel: +972 3 765 8503; Fax: +972 3 765 8555;
| |
Collapse
|
20
|
Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 2005; 21:2814-20. [PMID: 15827081 DOI: 10.1093/bioinformatics/bti442] [Citation(s) in RCA: 165] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The NCBI dbSNP database lists over 9 million single nucleotide polymorphisms (SNPs) in the human genome, but currently contains limited annotation information. SNPs that result in amino acid residue changes (nsSNPs) are of critical importance in variation between individuals, including disease and drug sensitivity. RESULTS We have developed LS-SNP, a genomic scale software pipeline to annotate nsSNPs. LS-SNP comprehensively maps nsSNPs onto protein sequences, functional pathways and comparative protein structure models, and predicts positions where nsSNPs destabilize proteins, interfere with the formation of domain-domain interfaces, have an effect on protein-ligand binding or severely impact human health. It currently annotates 28,043 validated SNPs that produce amino acid residue substitutions in human proteins from the SwissProt/TrEMBL database. Annotations can be viewed via a web interface either in the context of a genomic region or by selecting sets of SNPs, genes, proteins or pathways. These results are useful for identifying candidate functional SNPs within a gene, haplotype or pathway and in probing molecular mechanisms responsible for functional impacts of nsSNPs. AVAILABILITY http://www.salilab.org/LS-SNP CONTACT: rachelk@salilab.org SUPPLEMENTARY INFORMATION http://salilab.org/LS-SNP/supp-info.pdf.
Collapse
Affiliation(s)
- Rachel Karchin
- Department of Biopharmaceutical Sciences, University of California at San Francisco, San Francisco, CA 94143, USA.
| | | | | | | | | | | | | | | |
Collapse
|
21
|
Castelo R, Reymond A, Wyss C, Câmara F, Parra G, Antonarakis SE, Guigó R, Eyras E. Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes. Nucleic Acids Res 2005; 33:1935-9. [PMID: 15809229 PMCID: PMC1074396 DOI: 10.1093/nar/gki328] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT-PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.
Collapse
Affiliation(s)
- Robert Castelo
- Research Unit on Biomedical Informatics, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, Centre de Regulació Genòmica E08003 Barcelona, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Finishing the euchromatic sequence of the human genome. Nature 2004; 431:931-45. [PMID: 15496913 DOI: 10.1038/nature03001] [Citation(s) in RCA: 2827] [Impact Index Per Article: 141.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2004] [Accepted: 09/07/2004] [Indexed: 12/13/2022]
Abstract
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.
Collapse
|