1
|
Sidi T, Bahiri-Elitzur S, Tuller T, Kolodny R. Predicting gene sequences with AI to study codon usage patterns. Proc Natl Acad Sci U S A 2025; 122:e2410003121. [PMID: 39739812 DOI: 10.1073/pnas.2410003121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 11/27/2024] [Indexed: 01/02/2025] Open
Abstract
Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. We trained AI models to predict codons given their amino acid sequence in the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis to study the extent to which we can learn patterns in naturally occurring codons to improve predictions. We trained our models on a subset of the proteins and evaluated their predictions on large, separate sets of proteins of varying lengths and expression levels. Our models significantly outperformed naïve frequency-based approaches, demonstrating that there are learnable dependencies in evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes and is greater in bacteria than eukaryotes, supporting the hypothesis that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In S. cerevisiae and bacteria, our models were more accurate for longer proteins, suggesting that the learned patterns may be related to cotranslational folding. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on prediction accuracy, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. Our study employing contemporary AI methods offers a unique perspective and a deep-learning-based prediction tool for evolutionary-selected codons. We hope that these can be useful to optimize codon usage in endogenous and heterologous proteins.
Collapse
Affiliation(s)
- Tomer Sidi
- Department of Computer Science, University of Haifa, Haifa 3303221, Israel
| | - Shir Bahiri-Elitzur
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv 6139001, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv 6139001, Israel
- The Sagol School of Neuroscience, Tel-Aviv University, Tel Aviv 6139001, Israel
| | - Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa 3303221, Israel
| |
Collapse
|
2
|
Cohen S, Bergman S, Lynn N, Tuller T. A tool for CRISPR-Cas9 sgRNA evaluation based on computational models of gene expression. Genome Med 2024; 16:152. [PMID: 39716183 DOI: 10.1186/s13073-024-01420-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 12/02/2024] [Indexed: 12/25/2024] Open
Abstract
BACKGROUND CRISPR is widely used to silence genes by inducing mutations expected to nullify their expression. While numerous computational tools have been developed to design single-guide RNAs (sgRNAs) with high cutting efficiency and minimal off-target effects, only a few tools focus specifically on predicting gene knockouts following CRISPR. These tools consider factors like conservation, amino acid composition, and frameshift likelihood. However, they neglect the impact of CRISPR on gene expression, which can dramatically affect the success of CRISPR-induced gene silencing attempts. Furthermore, information regarding gene expression can be useful even when the objective is not to silence a gene. Therefore, a tool that considers gene expression when predicting CRISPR outcomes is lacking. RESULTS We developed EXPosition, the first computational tool that combines models predicting gene knockouts after CRISPR with models that forecast gene expression, offering more accurate predictions of gene knockout outcomes. EXPosition leverages deep-learning models to predict key steps in gene expression: transcription, splicing, and translation initiation. We showed our tool performs better at predicting gene knockout than existing tools across 6 datasets, 4 cell types and ~207k sgRNAs. We also validated our gene expression models using the ClinVar dataset by showing enrichment of pathogenic mutations in high-scoring mutations according to our models. CONCLUSIONS We believe EXPosition will enhance both the efficiency and accuracy of genome editing projects, by directly predicting CRISPR's effect on various aspects of gene expression. EXPosition is available at http://www.cs.tau.ac.il/~tamirtul/EXPosition . The source code is available at https://github.com/shaicoh3n/EXPosition .
Collapse
Affiliation(s)
- Shai Cohen
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Shaked Bergman
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Nicolas Lynn
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel.
- Sagol School of Neuroscience, Tel Aviv University, Tel-Aviv, 6997801, Israel.
| |
Collapse
|
3
|
Ly J, Xiang K, Su KC, Sissoko GB, Bartel DP, Cheeseman IM. Nuclear release of eIF1 restricts start-codon selection during mitosis. Nature 2024; 635:490-498. [PMID: 39443796 PMCID: PMC11605796 DOI: 10.1038/s41586-024-08088-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 09/19/2024] [Indexed: 10/25/2024]
Abstract
Regulated start-codon selection has the potential to reshape the proteome through the differential production of upstream open reading frames, canonical proteins, and alternative translational isoforms1-3. However, conditions under which start codon selection is altered remain poorly defined. Here, using transcriptome-wide translation-initiation-site profiling4, we reveal a global increase in the stringency of start-codon selection during mammalian mitosis. Low-efficiency initiation sites are preferentially repressed in mitosis, resulting in pervasive changes in the translation of thousands of start sites and their corresponding protein products. This enhanced stringency of start-codon selection during mitosis results from increased association between the 40S ribosome and the key regulator of start-codon selection, eIF1. We find that increased eIF1-40S ribosome interaction during mitosis is mediated by the release of a nuclear pool of eIF1 upon nuclear envelope breakdown. Selectively depleting the nuclear pool of eIF1 eliminates the change to translational stringency during mitosis, resulting in altered synthesis of thousands of protein isoforms. In addition, preventing mitotic translational rewiring results in substantially increased cell death and decreased mitotic slippage in cells that experience a mitotic delay induced by anti-mitotic chemotherapies. Thus, cells globally control stringency of translation initiation, which has critical roles during the mammalian cell cycle in preserving mitotic cell physiology.
Collapse
Affiliation(s)
- Jimmy Ly
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kehui Xiang
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Kuan-Chung Su
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Gunter B Sissoko
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - David P Bartel
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Iain M Cheeseman
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA.
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
4
|
Chen Y, Sheng G, Wang G. CapsNet-TIS: Predicting translation initiation site based on multi-feature fusion and improved capsule network. Gene 2024; 924:148598. [PMID: 38782224 DOI: 10.1016/j.gene.2024.148598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 04/22/2024] [Accepted: 05/20/2024] [Indexed: 05/25/2024]
Abstract
Genes are the basic units of protein synthesis in organisms, and accurately identifying the translation initiation site (TIS) of genes is crucial for understanding the regulation, transcription, and translation processes of genes. However, the existing models cannot adequately extract the feature information in TIS sequences, and they also inadequately capture the complex hierarchical relationships among features. Therefore, a novel predictor named CapsNet-TIS is proposed in this paper. CapsNet-TIS first fully extracts the TIS sequence information using four encoding methods, including One-hot encoding, physical structure property (PSP) encoding, nucleotide chemical property (NCP) encoding, and nucleotide density (ND) encoding. Next, multi-scale convolutional neural networks are used to perform feature fusion of the encoded features to enhance the comprehensiveness of the feature representation. Finally, the fused features are classified using capsule network as the main network of the classification model to capture the complex hierarchical relationships among the features. Moreover, we improve the capsule network by introducing residual block, channel attention, and BiLSTM to enhance the model's feature extraction and sequence data modeling capabilities. In this paper, the performance of CapsNet-TIS is evaluated using TIS datasets from four species: human, mouse, bovine, and fruit fly, and the effectiveness of each part is demonstrated by performing ablation experiments. By comparing the experimental results with models proposed by other researchers, the results demonstrate the superior performance of CapsNet-TIS.
Collapse
Affiliation(s)
- Yu Chen
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China.
| | - Guojun Sheng
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Gang Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
5
|
Fan X, Chang T, Chen C, Hafner M, Wang Z. Analysis of RNA translation with a deep learning architecture provides new insight into translation control. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.08.548206. [PMID: 39005319 PMCID: PMC11244891 DOI: 10.1101/2023.07.08.548206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Accurate annotation of coding regions in RNAs is essential for understanding gene translation. We developed a deep neural network to directly predict and analyze translation initiation and termination sites from RNA sequences. Trained with human transcripts, our model learned hidden rules of translation control and achieved a near perfect prediction of canonical translation sites across entire human transcriptome. Surprisingly, this model revealed a new role of codon usage in regulating translation termination, which was experimentally validated. We also identified thousands of new open reading frames in mRNAs or lncRNAs, some of which were confirmed experimentally. The model trained with human mRNAs achieved high prediction accuracy of canonical translation sites in all eukaryotes and good prediction in polycistronic transcripts from prokaryotes or RNA viruses, suggesting a high degree of conservation in translation control. Collectively, we present a general and efficient deep learning model for RNA translation, generating new insights into the complexity of translation regulation.
Collapse
Affiliation(s)
- Xiaojuan Fan
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- RNA Molecular Biology Laboratory, National Institute of Arthritis and Musculoskeletal and Skin Disease, Bethesda, MD, USA
| | - Tiangen Chang
- Laboratory of Cancer Data Science, National Cancer Institute, Bethesda, MD, USA
| | - Chuyun Chen
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Markus Hafner
- RNA Molecular Biology Laboratory, National Institute of Arthritis and Musculoskeletal and Skin Disease, Bethesda, MD, USA
| | - Zefeng Wang
- Bio-med Big Data Center, CAS Key Laboratory of Computational Biology, CAS Center for Excellence in Molecular Cell Science, Shanghai Institute of Nutrition and Health
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
6
|
Ly J, Xiang K, Su KC, Sissoko GB, Bartel DP, Cheeseman IM. Nuclear release of eIF1 globally increases stringency of start-codon selection to preserve mitotic arrest physiology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.06.588385. [PMID: 38617206 PMCID: PMC11014515 DOI: 10.1101/2024.04.06.588385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
Regulated start-codon selection has the potential to reshape the proteome through the differential production of uORFs, canonical proteins, and alternative translational isoforms. However, conditions under which start-codon selection is altered remain poorly defined. Here, using transcriptome-wide translation initiation site profiling, we reveal a global increase in the stringency of start-codon selection during mammalian mitosis. Low-efficiency initiation sites are preferentially repressed in mitosis, resulting in pervasive changes in the translation of thousands of start sites and their corresponding protein products. This increased stringency of start-codon selection during mitosis results from increased interactions between the key regulator of start-codon selection, eIF1, and the 40S ribosome. We find that increased eIF1-40S ribosome interactions during mitosis are mediated by the release of a nuclear pool of eIF1 upon nuclear envelope breakdown. Selectively depleting the nuclear pool of eIF1 eliminates the changes to translational stringency during mitosis, resulting in altered mitotic proteome composition. In addition, preventing mitotic translational rewiring results in substantially increased cell death and decreased mitotic slippage following treatment with anti-mitotic chemotherapeutics. Thus, cells globally control translation initiation stringency with critical roles during the mammalian cell cycle to preserve mitotic cell physiology.
Collapse
|
7
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
8
|
Gu LL, Yang RQ, Wang ZY, Jiang D, Fang M. Ensemble learning for integrative prediction of genetic values with genomic variants. BMC Bioinformatics 2024; 25:120. [PMID: 38515026 PMCID: PMC10956256 DOI: 10.1186/s12859-024-05720-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 02/26/2024] [Indexed: 03/23/2024] Open
Abstract
BACKGROUND Whole genome variants offer sufficient information for genetic prediction of human disease risk, and prediction of animal and plant breeding values. Many sophisticated statistical methods have been developed for enhancing the predictive ability. However, each method has its own advantages and disadvantages, so far, no one method can beat others. RESULTS We herein propose an Ensemble Learning method for Prediction of Genetic Values (ELPGV), which assembles predictions from several basic methods such as GBLUP, BayesA, BayesB and BayesCπ, to produce more accurate predictions. We validated ELPGV with a variety of well-known datasets and a serious of simulated datasets. All revealed that ELPGV was able to significantly enhance the predictive ability than any basic methods, for instance, the comparison p-value of ELPGV over basic methods were varied from 4.853E-118 to 9.640E-20 for WTCCC dataset. CONCLUSIONS ELPGV is able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods and it is simple to implement, fast to run, without using genotype data. is promising for wide application in genetic predictions.
Collapse
Affiliation(s)
- Lin-Lin Gu
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China
| | - Run-Qing Yang
- Research Center for Aquatic Biotechnology, Chinese Academy of Fishery Sciences, Beijing, People's Republic of China
| | - Zhi-Yong Wang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
| | - Dan Jiang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
| | - Ming Fang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
- Life Science College, Heilongjiang Bayi Agricultural University, Daqing, People's Republic of China.
| |
Collapse
|
9
|
Wu TY, Li YR, Chang KJ, Fang JC, Urano D, Liu MJ. Modeling alternative translation initiation sites in plants reveals evolutionarily conserved cis-regulatory codes in eukaryotes. Genome Res 2024; 34:272-285. [PMID: 38479836 PMCID: PMC10984385 DOI: 10.1101/gr.278100.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 02/15/2024] [Indexed: 03/22/2024]
Abstract
mRNA translation relies on identifying translation initiation sites (TISs) in mRNAs. Alternative TISs are prevalent across plant transcriptomes, but the mechanisms for their recognition are unclear. Using ribosome profiling and machine learning, we developed models for predicting alternative TISs in the tomato (Solanum lycopersicum). Distinct feature sets were predictive of AUG and nonAUG TISs in 5' untranslated regions and coding sequences, including a novel CU-rich sequence that promoted plant TIS activity, a translational enhancer found across dicots and monocots, and humans and viruses. Our results elucidate the mechanistic and evolutionary basis of TIS recognition, whereby cis-regulatory RNA signatures affect start site selection. The TIS prediction model provides global estimates of TISs to discover neglected protein-coding genes across plant genomes. The prevalence of cis-regulatory signatures across plant species, humans, and viruses suggests their broad and critical roles in reprogramming the translational landscape.
Collapse
Affiliation(s)
- Ting-Ying Wu
- Institute of Plant and Microbial Biology, Academia Sinica, Taipei 11529, Taiwan;
| | - Ya-Ru Li
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan
| | - Kai-Jyun Chang
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan
- Institute of Tropical Plant Sciences, National Cheng Kung University, Tainan 701, Taiwan
| | - Jhen-Cheng Fang
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan
| | - Daisuke Urano
- Temasek Life Sciences Laboratory, Singapore 117604, Singapore
- Department of Biological Sciences, National University of Singapore, Singapore 117558, Singapore
| | - Ming-Jung Liu
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan;
- Institute of Tropical Plant Sciences, National Cheng Kung University, Tainan 701, Taiwan
- Agricultural Biotechnology Research Center, Academia Sinica, Taipei 115, Taiwan
| |
Collapse
|
10
|
Lynn N, Tuller T. Detecting and understanding meaningful cancerous mutations based on computational models of mRNA splicing. NPJ Syst Biol Appl 2024; 10:25. [PMID: 38453965 PMCID: PMC10920900 DOI: 10.1038/s41540-024-00351-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open
Abstract
Cancer research has long relied on non-silent mutations. Yet, it has become overwhelmingly clear that silent mutations can affect gene expression and cancer cell fitness. One fundamental mechanism that apparently silent mutations can severely disrupt is alternative splicing. Here we introduce Oncosplice, a tool that scores mutations based on models of proteomes generated using aberrant splicing predictions. Oncosplice leverages a highly accurate neural network that predicts splice sites within arbitrary mRNA sequences, a greedy transcript constructor that considers alternate arrangements of splicing blueprints, and an algorithm that grades the functional divergence between proteins based on evolutionary conservation. By applying this tool to 12M somatic mutations we identify 8K deleterious variants that are significantly depleted within the healthy population; we demonstrate the tool's ability to identify clinically validated pathogenic variants with a positive predictive value of 94%; we show strong enrichment of predicted deleterious mutations across pan-cancer drivers. We also achieve improved patient survival estimation using a proposed set of novel cancer-involved genes. Ultimately, this pipeline enables accelerated insight-gathering of sequence-specific consequences for a class of understudied mutations and provides an efficient way of filtering through massive variant datasets - functionalities with immediate experimental and clinical applications.
Collapse
Affiliation(s)
- Nicolas Lynn
- Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv University, Tel-Aviv, 69978, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv University, Tel-Aviv, 69978, Israel.
| |
Collapse
|
11
|
Tournayre J, Polonais V, Wawrzyniak I, Akossi RF, Parisot N, Lerat E, Delbac F, Souvignet P, Reichstadt M, Peyretaillade E. MicroAnnot: A Dedicated Workflow for Accurate Microsporidian Genome Annotation. Int J Mol Sci 2024; 25:880. [PMID: 38255958 PMCID: PMC10815200 DOI: 10.3390/ijms25020880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/29/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024] Open
Abstract
With nearly 1700 species, Microsporidia represent a group of obligate intracellular eukaryotes with veterinary, economic and medical impacts. To help understand the biological functions of these microorganisms, complete genome sequencing is routinely used. Nevertheless, the proper prediction of their gene catalogue is challenging due to their taxon-specific evolutionary features. As innovative genome annotation strategies are needed to obtain a representative snapshot of the overall lifestyle of these parasites, the MicroAnnot tool, a dedicated workflow for microsporidian sequence annotation using data from curated databases of accurately annotated microsporidian genes, has been developed. Furthermore, specific modules have been implemented to perform small gene (<300 bp) and transposable element identification. Finally, functional annotation was performed using the signature-based InterProScan software. MicroAnnot's accuracy has been verified by the re-annotation of four microsporidian genomes for which structural annotation had previously been validated. With its comparative approach and transcriptional signal identification method, MicroAnnot provides an accurate prediction of translation initiation sites, an efficient identification of transposable elements, as well as high specificity and sensitivity for microsporidian genes, including those under 300 bp.
Collapse
Affiliation(s)
- Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Valérie Polonais
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Reginald Florian Akossi
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Nicolas Parisot
- UMR 203, BF2I, INRAE, INSA Lyon, Université de Lyon, 69621 Villeurbanne, France
| | - Emmanuelle Lerat
- VAS, CNRS, UMR5558, LBBE, Université Claude Bernard Lyon 1, 69622 Villeurbanne, France;
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| | - Pierre Souvignet
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Matthieu Reichstadt
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, 63122 Saint-Genès-Champanelle, France; (J.T.); (P.S.); (M.R.)
| | - Eric Peyretaillade
- LMGE, CNRS, Université Clermont Auvergne, 63000 Clermont-Ferrand, France; (V.P.); (I.W.); (R.F.A.); (F.D.)
| |
Collapse
|
12
|
Novikova PV, Bhanu Busi S, Probst AJ, May P, Wilmes P. Functional prediction of proteins from the human gut archaeome. ISME COMMUNICATIONS 2024; 4:ycad014. [PMID: 38486809 PMCID: PMC10939349 DOI: 10.1093/ismeco/ycad014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 12/16/2023] [Accepted: 12/19/2023] [Indexed: 03/17/2024]
Abstract
The human gastrointestinal tract contains diverse microbial communities, including archaea. Among them, Methanobrevibacter smithii represents a highly active and clinically relevant methanogenic archaeon, being involved in gastrointestinal disorders, such as inflammatory bowel disease and obesity. Herein, we present an integrated approach using sequence and structure information to improve the annotation of M. smithii proteins using advanced protein structure prediction and annotation tools, such as AlphaFold2, trRosetta, ProFunc, and DeepFri. Of an initial set of 873 481 archaeal proteins, we found 707 754 proteins exclusively present in the human gut. Having analysed archaeal proteins together with 87 282 994 bacterial proteins, we identified unique archaeal proteins and archaeal-bacterial homologs. We then predicted and characterized functional domains and structures of 73 unique and homologous archaeal protein clusters linked the human gut and M. smithii. We refined annotations based on the predicted structures, extending existing sequence similarity-based annotations. We identified gut-specific archaeal proteins that may be involved in defense mechanisms, virulence, adhesion, and the degradation of toxic substances. Interestingly, we identified potential glycosyltransferases that could be associated with N-linked and O-glycosylation. Additionally, we found preliminary evidence for interdomain horizontal gene transfer between Clostridia species and M. smithii, which includes sporulation Stage V proteins AE and AD. Our study broadens the understanding of archaeal biology, particularly M. smithii, and highlights the importance of considering both sequence and structure for the prediction of protein function.
Collapse
Affiliation(s)
- Polina V Novikova
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| | - Susheel Bhanu Busi
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
- UK Centre for Ecology and Hydrology, Wallingford, OX10 8 BB, United Kingdom
| | - Alexander J Probst
- Environmental Metagenomics, Department of Chemistry, Research Center One Health Ruhr of the University Alliance Ruhr, for Environmental Microbiology and Biotechnology, University Duisburg-Essen, Duisburg 47057, Germany
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| | - Paul Wilmes
- Systems Ecology, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Esch-sur-Alzette L-4362, Luxembourg
| |
Collapse
|
13
|
Zheng W, Fong JHC, Wan YK, Chu AHY, Huang Y, Wong ASL, Ho JWK. Discovery of regulatory motifs in 5' untranslated regions using interpretable multi-task learning models. Cell Syst 2023; 14:1103-1112.e6. [PMID: 38016465 DOI: 10.1016/j.cels.2023.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/18/2023] [Accepted: 10/31/2023] [Indexed: 11/30/2023]
Abstract
The sequence in the 5' untranslated regions (UTRs) is known to affect mRNA translation rates. However, the underlying regulatory grammar remains elusive. Here, we propose MTtrans, a multi-task translation rate predictor capable of learning common sequence patterns from datasets across various experimental techniques. The core premise is that common motifs are more likely to be genuinely involved in translation control. MTtrans outperforms existing methods in both accuracy and the ability to capture transferable motifs across species, highlighting its strength in identifying evolutionarily conserved sequence motifs. Our independent fluorescence-activated cell sorting coupled with deep sequencing (FACS-seq) experiment validates the impact of most motifs identified by MTtrans. Additionally, we introduce "GRU-rewiring," a technique to interpret the hidden states of the recurrent units. Gated recurrent unit (GRU)-rewiring allows us to identify regulatory element-enriched positions and examine the local effects of 5' UTR mutations. MTtrans is a powerful tool for deciphering the translation regulatory motifs.
Collapse
Affiliation(s)
- Weizhong Zheng
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - John H C Fong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Yuk Kei Wan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Athena H Y Chu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China; Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| | - Alan S L Wong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Laboratory of Data Discovery for Health (D24H) Limited, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
14
|
Mohammadi H, Thirunarayan K, Chen L. CVII: Enhancing Interpretability in Intelligent Sensor Systems via Computer Vision Interpretability Index. SENSORS (BASEL, SWITZERLAND) 2023; 23:9893. [PMID: 38139738 PMCID: PMC10747164 DOI: 10.3390/s23249893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Revised: 12/09/2023] [Accepted: 12/14/2023] [Indexed: 12/24/2023]
Abstract
In the realm of intelligent sensor systems, the dependence on Artificial Intelligence (AI) applications has heightened the importance of interpretability. This is particularly critical for opaque models such as Deep Neural Networks (DNN), as understanding their decisions is essential, not only for ethical and regulatory compliance, but also for fostering trust in AI-driven outcomes. This paper introduces the novel concept of a Computer Vision Interpretability Index (CVII). The CVII framework is designed to emulate human cognitive processes, specifically in tasks related to vision. It addresses the intricate challenge of quantifying interpretability, a task that is inherently subjective and varies across domains. The CVII is rigorously evaluated using a range of computer vision models applied to the COCO (Common Objects in Context) dataset, a widely recognized benchmark in the field. The findings established a robust correlation between image interpretability, model selection, and CVII scores. This research makes a substantial contribution to enhancing interpretability for human comprehension, as well as within intelligent sensor applications. By promoting transparency and reliability in AI-driven decision-making, the CVII framework empowers its stakeholders to effectively harness the full potential of AI technologies.
Collapse
Affiliation(s)
| | - Krishnaprasad Thirunarayan
- Department of Computer Science and Engineering, Wright State University, Dayton, OH 45435, USA; (H.M.); (L.C.)
| | | |
Collapse
|
15
|
He S, Gao B, Sabnis R, Sun Q. Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions. ACS Synth Biol 2023; 12:3205-3214. [PMID: 37916871 PMCID: PMC10863451 DOI: 10.1021/acssynbio.3c00154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 11/03/2023]
Abstract
Much work has been done to apply machine learning and deep learning to genomics tasks, but these applications usually require extensive domain knowledge, and the resulting models provide very limited interpretability. Here, we present the Nucleic Transformer, a conceptually simple but effective and interpretable model architecture that excels in the classification of DNA sequences. The Nucleic Transformer employs self-attention and convolutions on nucleic acid sequences, leveraging two prominent deep learning strategies commonly used in computer vision and natural language analysis. We demonstrate that the Nucleic Transformer can be trained without much domain knowledge to achieve high performance in Escherichia coli promoter classification, viral genome identification, enhancer classification, and chromatin profile predictions.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Baizhen Gao
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Rushant Sabnis
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| | - Qing Sun
- Department of Chemical
Engineering, Texas A&M University, College Station, Texas 77840, United States
| |
Collapse
|
16
|
Zhang J, Lang M, Zhou Y, Zhang Y. Predicting RNA structures and functions by artificial intelligence. Trends Genet 2023; 40:S0168-9525(23)00229-9. [PMID: 39492264 DOI: 10.1016/j.tig.2023.10.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/22/2023] [Accepted: 10/03/2023] [Indexed: 11/05/2024]
Abstract
RNA functions by interacting with its intended targets structurally. However, due to the dynamic nature of RNA molecules, RNA structures are difficult to determine experimentally or predict computationally. Artificial intelligence (AI) has revolutionized many biomedical fields and has been progressively utilized to deduce RNA structures, target binding, and associated functionality. Integrating structural and target binding information could also help improve the robustness of AI-based RNA function prediction and RNA design. Given the rapid development of deep learning (DL) algorithms, AI will provide an unprecedented opportunity to elucidate the sequence-structure-function relation of RNAs.
Collapse
Affiliation(s)
- Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, 518060, China
| | - Mei Lang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China.
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, Shenzhen, Guangdong, 518055, China.
| |
Collapse
|
17
|
Fang JC, Liu MJ. Translation initiation at AUG and non-AUG triplets in plants. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2023; 335:111822. [PMID: 37574140 DOI: 10.1016/j.plantsci.2023.111822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 07/22/2023] [Accepted: 08/07/2023] [Indexed: 08/15/2023]
Abstract
In plants and other eukaryotes, precise selection of translation initiation site (TIS) on mRNAs shapes the proteome in response to cellular events or environmental cues. The canonical translation of mRNAs initiates at a 5' proximal AUG codon in a favorable context. However, the coding and non-coding regions of plant genomes contain numerous unannotated alternative AUG and non-AUG TISs. Determining how and why these unexpected and prevalent TISs are activated in plants has emerged as an exciting research area. In this review, we focus on the selection of plant TISs and highlight studies that revealed previously unannotated TISs used in vivo via comparative genomics and genome-wide profiling of ribosome positioning and protein N-terminal ends. The biological signatures of non-AUG TIS-initiated open reading frames (ORFs) in plants are also discussed. We describe what is understood about cis-regulatory RNA elements and trans-acting eukaryotic initiation factors (eIFs) in the site selection for translation initiation by featuring the findings in plants along with supporting findings in non-plant species. The prevalent, unannotated TISs provide a hidden reservoir of ORFs that likely help reshape plant proteomes in response to developmental or environmental cues. These findings underscore the importance of understanding the mechanistic basis of TIS selection to functionally annotate plant genomes, especially for crops with large genomes.
Collapse
Affiliation(s)
- Jhen-Cheng Fang
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan
| | - Ming-Jung Liu
- Biotechnology Center in Southern Taiwan, Academia Sinica, Tainan 711, Taiwan; Agricultural Biotechnology Research Center, Academia Sinica, Taipei 115, Taiwan.
| |
Collapse
|
18
|
Nicolle R, Altin N, Siquier-Pernet K, Salignac S, Blanc P, Munnich A, Bole-Feysot C, Malan V, Caron B, Nitschké P, Desguerre I, Boddaert N, Rio M, Rausell A, Cantagrel V. A non-coding variant in the Kozak sequence of RARS2 strongly decreases protein levels and causes pontocerebellar hypoplasia. BMC Med Genomics 2023; 16:143. [PMID: 37344844 DOI: 10.1186/s12920-023-01582-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 06/16/2023] [Indexed: 06/23/2023] Open
Abstract
Bi-allelic variants in the mitochondrial arginyl-transfer RNA synthetase (RARS2) gene have been involved in early-onset encephalopathies classified as pontocerebellar hypoplasia (PCH) type 6 and in epileptic encephalopathy. A variant (NM_020320.3:c.-2A > G) in the promoter and 5'UTR of the RARS2 gene has been previously identified in a family with PCH. Only a mild impact of this variant on the mRNA level has been detected. As RARS2 is non-dosage-sensitive, this observation is not conclusive in regard of the pathogenicity of the variant.We report and describe here a new patient with the same variant in the RARS2 gene, at the homozygous state. This patient presents with a clinical phenotype consistent with PCH6 although in the absence of lactic acidosis. In agreement with the previous study, we measured RARS2 mRNA levels in patient's fibroblasts and detected a partially preserved gene expression compared to control. Importantly, this variant is located in the Kozak sequence that controls translation initiation. Therefore, we investigated the impact on protein translation using a bioinformatic approach and western blotting. We show here that this variant, additionally to its effect on the transcription, also disrupts the consensus Kozak sequence, and has a major impact on RARS2 protein translation. Through the identification of this additional case and the characterization of the molecular consequences, we clarified the involvement of this Kozak variant in PCH and on protein synthesis. This work also points to the current limitation in the pathogenicity prediction of variants located in the translation initiation region.
Collapse
Affiliation(s)
- Romain Nicolle
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Nami Altin
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Karine Siquier-Pernet
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Sherlina Salignac
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
| | - Pierre Blanc
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Arnold Munnich
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Christine Bole-Feysot
- Genomics Platform, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Valérie Malan
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Barthélémy Caron
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
| | - Patrick Nitschké
- Bioinformatics Core Facility, Université Paris Cité, INSERM UMR 1163, Imagine Institute, 75015, Paris, France
| | - Isabelle Desguerre
- Département de Neurologie Pédiatrique, AP-HP, Necker Hospital for Sick Children, 75015, Paris, France
| | - Nathalie Boddaert
- Département de Radiologie Pédiatrique, AP-HP, Necker Hospital for Sick Children and Université Paris Cité, INSERM UMR 1163 and INSERM U1299, Imagine Institute, Paris, 75015, France
| | - Marlène Rio
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Antonio Rausell
- Clinical Bioinformatics Laboratory, Université Paris Cité, INSERM UMR 1163, Imagine Institute, Paris, 75015, France
- Fédération de Génétique et Médecine Génomique, Service de Médecine Génomique des Maladies Rares, AP-HP, Necker Hospital for Sick Children, Paris, 75015, France
| | - Vincent Cantagrel
- Developmental Brain Disorders Laboratory, Université Paris Cité, INSERM UMR1163, Imagine Institute, 75015, Paris, France.
| |
Collapse
|
19
|
Bedran G, Gasser HC, Weke K, Wang T, Bedran D, Laird A, Battail C, Zanzotto FM, Pesquita C, Axelson H, Rajan A, Harrison DJ, Palkowski A, Pawlik M, Parys M, O'Neill JR, Brennan PM, Symeonides SN, Goodlett DR, Litchfield K, Fahraeus R, Hupp TR, Kote S, Alfaro JA. The Immunopeptidome from a Genomic Perspective: Establishing the Noncanonical Landscape of MHC Class I-Associated Peptides. Cancer Immunol Res 2023; 11:747-762. [PMID: 36961404 PMCID: PMC10236148 DOI: 10.1158/2326-6066.cir-22-0621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 11/25/2022] [Accepted: 03/16/2023] [Indexed: 03/25/2023]
Abstract
Tumor antigens can emerge through multiple mechanisms, including translation of noncoding genomic regions. This noncanonical category of tumor antigens has recently gained attention; however, our understanding of how they recur within and between cancer types is still in its infancy. Therefore, we developed a proteogenomic pipeline based on deep learning de novo mass spectrometry (MS) to enable the discovery of noncanonical MHC class I-associated peptides (ncMAP) from noncoding regions. Considering that the emergence of tumor antigens can also involve posttranslational modifications (PTM), we included an open search component in our pipeline. Leveraging the wealth of MS-based immunopeptidomics, we analyzed data from 26 MHC class I immunopeptidomic studies across 11 different cancer types. We validated the de novo identified ncMAPs, along with the most abundant PTMs, using spectral matching and controlled their FDR to 1%. The noncanonical presentation appeared to be 5 times enriched for the A03 HLA supertype, with a projected population coverage of 55%. The data reveal an atlas of 8,601 ncMAPs with varying levels of cancer selectivity and suggest 17 cancer-selective ncMAPs as attractive therapeutic targets according to a stringent cutoff. In summary, the combination of the open-source pipeline and the atlas of ncMAPs reported herein could facilitate the identification and screening of ncMAPs as targets for T-cell therapies or vaccine development.
Collapse
Affiliation(s)
- Georges Bedran
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | | | - Kenneth Weke
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Tongjie Wang
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
| | - Dominika Bedran
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Alexander Laird
- Urology Department, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom
- Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - Christophe Battail
- CEA, Grenoble Alpes University, INSERM, IRIG, Biosciences and Bioengineering for Health Laboratory (BGE) - UA13 INSERM-CEA-UGA, Grenoble, France
| | | | - Catia Pesquita
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | - Håkan Axelson
- Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden
| | - Ajitha Rajan
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
| | - David J. Harrison
- School of Medicine, University of St Andrews, St Andrews, United Kingdom
| | - Aleksander Palkowski
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Maciej Pawlik
- Academic Computer Centre CYFRONET, AGH University of Science and Technology, Cracow, Poland
| | - Maciej Parys
- Royal (Dick) School of Veterinary Studies and The Roslin Institute, University of Edinburgh, Edinburgh, United Kingdom
| | - J. Robert O'Neill
- Cambridge Oesophagogastric Centre, Cambridge University Hospitals NHS Foundation Trust, Cambridge, United Kingdom
| | - Paul M. Brennan
- Translational Neurosurgery, Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Stefan N. Symeonides
- Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - David R. Goodlett
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, Canada
- University of Victoria Genome BC Proteome Centre, Victoria, Canada
| | - Kevin Litchfield
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, United Kingdom
- Tumour Immunogenomics and Immunosurveillance Laboratory, University College London Cancer Institute, London, United Kingdom
| | - Robin Fahraeus
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
- Inserm UMRS1131, Institut de Génétique Moléculaire, Université Paris 7, Paris, France
| | - Ted R. Hupp
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
- Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - Sachin Kote
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
| | - Javier A. Alfaro
- International Centre for Cancer Vaccine Science, University of Gdansk, Gdansk, Poland
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, Canada
| |
Collapse
|
20
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
21
|
Clauwaert J, McVey Z, Gupta R, Menschaert G. TIS Transformer: remapping the human proteome using deep learning. NAR Genom Bioinform 2023; 5:lqad021. [PMID: 36879896 PMCID: PMC9985340 DOI: 10.1093/nargab/lqad021] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/20/2023] [Accepted: 02/14/2023] [Indexed: 03/07/2023] Open
Abstract
The correct mapping of the proteome is an important step towards advancing our understanding of biological systems and cellular mechanisms. Methods that provide better mappings can fuel important processes such as drug discovery and disease understanding. Currently, true determination of translation initiation sites is primarily achieved by in vivo experiments. Here, we propose TIS Transformer, a deep learning model for the determination of translation start sites solely utilizing the information embedded in the transcript nucleotide sequence. The method is built upon deep learning techniques first designed for natural language processing. We prove this approach to be best suited for learning the semantics of translation, outperforming previous approaches by a large margin. We demonstrate that limitations in the model performance are primarily due to the presence of low-quality annotations against which the model is evaluated against. Advantages of the method are its ability to detect key features of the translation process and multiple coding sequences on a transcript. These include micropeptides encoded by short Open Reading Frames, either alongside a canonical coding sequence or within long non-coding RNAs. To demonstrate the use of our methods, we applied TIS Transformer to remap the full human proteome.
Collapse
Affiliation(s)
- Jim Clauwaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Oost-Vlaanderen 9000, Belgium
| | - Zahra McVey
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Crawley, South East England, RH6 0PA, UK
| | - Ramneek Gupta
- Novo Nordisk Research Centre Oxford, Novo Nordisk Ltd., Crawley, South East England, RH6 0PA, UK
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Oost-Vlaanderen 9000, Belgium
| |
Collapse
|
22
|
A dynamical stochastic model of yeast translation across the cell cycle. Heliyon 2023; 9:e13101. [PMID: 36793957 PMCID: PMC9922973 DOI: 10.1016/j.heliyon.2023.e13101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 01/04/2023] [Accepted: 01/16/2023] [Indexed: 01/27/2023] Open
Abstract
Translation is a central step in gene expression, however its quantitative and time-resolved regulation is poorly understood. We developed a discrete, stochastic model for protein translation in S. cerevisiae in a whole-transcriptome, single-cell context. A "base case" scenario representing an average cell highlights translation initiation rates as the main co-translational regulatory parameters. Codon usage bias emerges as a secondary regulatory mechanism through ribosome stalling. Demand for anticodons with low abundancy is shown to cause above-average ribosome dwelling times. Codon usage bias correlates strongly both with protein synthesis rates and elongation rates. Applying the model to a time-resolved transcriptome estimated by combining data from FISH and RNA-Seq experiments, it could be shown that increased total transcript abundance during the cell cycle decreases translation efficiency at single transcript level. Translation efficiency grouped by gene function shows highest values for ribosomal and glycolytic genes. Ribosomal proteins peak in S phase while glycolytic proteins rank highest in later cell cycle phases.
Collapse
|
23
|
He S, Gao B, Sabnis R, Sun Q. RNAdegformer: accurate prediction of mRNA degradation at nucleotide resolution with deep learning. Brief Bioinform 2023; 24:bbac581. [PMID: 36633966 PMCID: PMC9851316 DOI: 10.1093/bib/bbac581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/14/2022] [Accepted: 11/28/2022] [Indexed: 01/13/2023] Open
Abstract
Messenger RNA-based therapeutics have shown tremendous potential, as demonstrated by the rapid development of messenger RNA based vaccines for COVID-19. Nevertheless, distribution of mRNA vaccines worldwide has been hampered by mRNA's inherent thermal instability due to in-line hydrolysis, a chemical degradation reaction. Therefore, predicting and understanding RNA degradation is a crucial and urgent task. Here we present RNAdegformer, an effective and interpretable model architecture that excels in predicting RNA degradation. RNAdegformer processes RNA sequences with self-attention and convolutions, two deep learning techniques that have proved dominant in the fields of computer vision and natural language processing, while utilizing biophysical features of RNA. We demonstrate that RNAdegformer outperforms previous best methods at predicting degradation properties at nucleotide resolution for COVID-19 mRNA vaccines. RNAdegformer predictions also exhibit improved correlation with RNA in vitro half-life compared with previous best methods. Additionally, we showcase how direct visualization of self-attention maps assists informed decision-making. Further, our model reveals important features in determining mRNA degradation rates via leave-one-feature-out analysis.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Baizhen Gao
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Rushant Sabnis
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Qing Sun
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| |
Collapse
|
24
|
Barbero-Aparicio JA, Cuesta-Lopez S, García-Osorio CI, Pérez-Rodríguez J, García-Pedrajas N. Nonlinear physics opens a new paradigm for accurate transcription start site prediction. BMC Bioinformatics 2022; 23:565. [PMID: 36585618 PMCID: PMC9801560 DOI: 10.1186/s12859-022-05129-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022] Open
Abstract
There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction. In this article, one such model is used as an additional information source that, when used by a machine learning (ML) model, improves the results of current methods for the prediction of TSSs. In addition, we provide evidence on the validity of the physical model, as it is able by itself to predict TSSs with high accuracy. This opens an exciting avenue of research at the intersection of statistical mechanics and ML, where ML models in bioinformatics can be improved using physical models of DNA as feature extractors.
Collapse
Affiliation(s)
- José Antonio Barbero-Aparicio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Santiago Cuesta-Lopez
- grid.23520.360000 0000 8569 1592Universidad de Burgos, Hospital del Rey, s/n, 09001 Burgos, Spain ,ICAMCyL Foundation, Internacional Center for Advanced Materials and Raw Materials of Castilla y León, León Technology Park, main building, first floor, offices 106-108, C/Julia Morros s/n, Armunia, 24009 León, Spain
| | - César Ignacio García-Osorio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Javier Pérez-Rodríguez
- grid.449008.10000 0004 1795 4150Departamento de Métodos Cuantitativos, Universidad de Loyola Andalucía, Escritor Castilla Aguayo, 4, 14004 Córdoba, Spain
| | - Nicolás García-Pedrajas
- grid.411901.c0000 0001 2183 9102Department of Computing and Numerical Analysis, University of Córdoba, Edificio Albert Einstein, Campus de Rabanales, 14071 Córdoba, Spain
| |
Collapse
|
25
|
Jiang P, Gao F, Liu S, Zhang S, Zhang X, Xia Z, Zhang W, Jiang T, Zhu JL, Zhang Z, Shu Q, Snyder M, Li J. Longitudinally tracking personal physiomes for precision management of childhood epilepsy. PLOS DIGITAL HEALTH 2022; 1:e0000161. [PMID: 36812648 PMCID: PMC9931296 DOI: 10.1371/journal.pdig.0000161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 11/13/2022] [Indexed: 12/24/2022]
Abstract
Our current understanding of human physiology and activities is largely derived from sparse and discrete individual clinical measurements. To achieve precise, proactive, and effective health management of an individual, longitudinal, and dense tracking of personal physiomes and activities is required, which is only feasible by utilizing wearable biosensors. As a pilot study, we implemented a cloud computing infrastructure to integrate wearable sensors, mobile computing, digital signal processing, and machine learning to improve early detection of seizure onsets in children. We recruited 99 children diagnosed with epilepsy and longitudinally tracked them at single-second resolution using a wearable wristband, and prospectively acquired more than one billion data points. This unique dataset offered us an opportunity to quantify physiological dynamics (e.g., heart rate, stress response) across age groups and to identify physiological irregularities upon epilepsy onset. The high-dimensional personal physiome and activity profiles displayed a clustering pattern anchored by patient age groups. These signatory patterns included strong age and sex-specific effects on varying circadian rhythms and stress responses across major childhood developmental stages. For each patient, we further compared the physiological and activity profiles associated with seizure onsets with the personal baseline and developed a machine learning framework to accurately capture these onset moments. The performance of this framework was further replicated in another independent patient cohort. We next referenced our predictions with the electroencephalogram (EEG) signals on selected patients and demonstrated that our approach could detect subtle seizures not recognized by humans and could detect seizures prior to clinical onset. Our work demonstrated the feasibility of a real-time mobile infrastructure in a clinical setting, which has the potential to be valuable in caring for epileptic patients. Extension of such a system has the potential to be leveraged as a health management device or longitudinal phenotyping tool in clinical cohort studies.
Collapse
Affiliation(s)
- Peifang Jiang
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Feng Gao
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Sixing Liu
- SensOmics, Inc. Burlingame, California, United States of America
| | - Sai Zhang
- SensOmics, Inc. Burlingame, California, United States of America
| | - Xicheng Zhang
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Genetics, Center for Genomics and Personalized Medicine, Stanford University School of Medicine, Stanford, California, United States of America
| | - Zhezhi Xia
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Weiqin Zhang
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Tiejia Jiang
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jason L. Zhu
- SensOmics, Inc. Burlingame, California, United States of America
| | - Zhaolei Zhang
- SensOmics, Inc. Burlingame, California, United States of America
- Donnelly Centre, Department of Computer Science and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- * E-mail: (ZZ); (QS); (MS); (JL)
| | - Qiang Shu
- National Clinical Research Center for Child Health, Children’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
- * E-mail: (ZZ); (QS); (MS); (JL)
| | - Michael Snyder
- SensOmics, Inc. Burlingame, California, United States of America
- * E-mail: (ZZ); (QS); (MS); (JL)
| | - Jingjing Li
- SensOmics, Inc. Burlingame, California, United States of America
- * E-mail: (ZZ); (QS); (MS); (JL)
| |
Collapse
|
26
|
Gleason AC, Ghadge G, Sonobe Y, Roos RP. Kozak Similarity Score Algorithm Identifies Alternative Translation Initiation Codons Implicated in Cancers. Int J Mol Sci 2022; 23:ijms231810564. [PMID: 36142475 PMCID: PMC9506484 DOI: 10.3390/ijms231810564] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/05/2022] [Accepted: 09/08/2022] [Indexed: 11/16/2022] Open
Abstract
Ribosome profiling and mass spectroscopy have identified canonical and noncanonical translation initiation codons (TICs) that are upstream of the main translation initiation site and used to translate oncogenic proteins. There have previously been conflicting reports about the patterns of nucleotides that surround noncanonical TICs. Here, we use a Kozak Similarity Score algorithm to find that nearly all of these TICs have flanking nucleotides closely matching the Kozak sequence. Remarkably, the nucleotides flanking alternative noncanonical TICs are frequently closer to the Kozak sequence than the nucleotides flanking TICs used to translate the gene’s main protein. Of note, the 5′ untranslated region (5‘UTR) of cancer-associated genes with an upstream TIC tend to be significantly longer than the same region in genes not associated with cancer. The presence of a longer-than-typical 5′UTR increases the likelihood of ribosome binding to upstream noncanonical TICs, and may be a distinguishing feature of a number of genes overexpressed in cancer. Noncanonical TICs that are located in the 5′UTR, although thought by some to be disadvantageous and suppressed by evolution, may translate oncogenic proteins because of their flanking nucleotides.
Collapse
|
27
|
Liu Q, Fang H, Wang X, Wang M, Li S, Coin LJM, Li F, Song J. DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions. Bioinformatics 2022; 38:4053-4061. [PMID: 35799358 DOI: 10.1093/bioinformatics/btac454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/11/2022] [Accepted: 07/06/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Accurate annotation of different genomic signals and regions (GSRs) from DNA sequences is fundamentally important for understanding gene structure, regulation and function. Numerous efforts have been made to develop machine learning-based predictors for in silico identification of GSRs. However, it remains a great challenge to identify GSRs as the performance of most existing approaches is unsatisfactory. As such, it is highly desirable to develop more accurate computational methods for GSRs prediction. RESULTS In this study, we propose a general deep learning framework termed DeepGenGrep, a general predictor for the systematic identification of multiple different GSRs from genomic DNA sequences. DeepGenGrep leverages the power of hybrid neural networks comprising a three-layer convolutional neural network and a two-layer long short-term memory to effectively learn useful feature representations from sequences. Benchmarking experiments demonstrate that DeepGenGrep outperforms several state-of-the-art approaches on identifying polyadenylation signals, translation initiation sites and splice sites across four eukaryotic species including Homo sapiens, Mus musculus, Bos taurus and Drosophila melanogaster. Overall, DeepGenGrep represents a useful tool for the high-throughput and cost-effective identification of potential GSRs in eukaryotic genomes. AVAILABILITY AND IMPLEMENTATION The webserver and source code are freely available at http://bigdata.biocie.cn/deepgengrep/home and Github (https://github.com/wx-cie/DeepGenGrep/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Honglin Fang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Xiao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Lachlan J M Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Fuyi Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
28
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
29
|
Machine learning predicts translation initiation sites in neurologic diseases with nucleotide repeat expansions. PLoS One 2022; 17:e0256411. [PMID: 35648796 PMCID: PMC9159584 DOI: 10.1371/journal.pone.0256411] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 05/16/2022] [Indexed: 11/19/2022] Open
Abstract
A number of neurologic diseases associated with expanded nucleotide repeats, including an inherited form of amyotrophic lateral sclerosis, have an unconventional form of translation called repeat-associated non-AUG (RAN) translation. It has been speculated that the repeat regions in the RNA fold into secondary structures in a length-dependent manner, promoting RAN translation. Repeat protein products are translated, accumulate, and may contribute to disease pathogenesis. Nucleotides that flank the repeat region, especially ones closest to the initiation site, are believed to enhance translation initiation. A machine learning model has been published to help identify ATG and near-cognate translation initiation sites; however, this model has diminished predictive power due to its extensive feature selection and limited training data. Here, we overcome this limitation and increase prediction accuracy by the following: a) capture the effect of nucleotides most critical for translation initiation via feature reduction, b) implement an alternative machine learning algorithm better suited for limited data, c) build comprehensive and balanced training data (via sampling without replacement) that includes previously unavailable sequences, and d) split ATG and near-cognate translation initiation codon data to train two separate models. We also design a supplementary scoring system to provide an additional prognostic assessment of model predictions. The resultant models have high performance, with ~85-88% accuracy, exceeding that of the previously published model by >18%. The models presented here are used to identify translation initiation sites in genes associated with a number of neurologic repeat expansion disorders. The results confirm a number of sites of translation initiation upstream of the expanded repeats that have been found experimentally, and predict sites that are not yet established.
Collapse
|
30
|
Jo T, Nho K, Bice P, Saykin AJ. Deep learning-based identification of genetic variants: application to Alzheimer's disease classification. Brief Bioinform 2022; 23:bbac022. [PMID: 35183061 PMCID: PMC8921609 DOI: 10.1093/bib/bbac022] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 01/13/2022] [Accepted: 01/17/2022] [Indexed: 01/29/2023] Open
Abstract
Deep learning is a promising tool that uses nonlinear transformations to extract features from high-dimensional data. Deep learning is challenging in genome-wide association studies (GWAS) with high-dimensional genomic data. Here we propose a novel three-step approach (SWAT-CNN) for identification of genetic variants using deep learning to identify phenotype-related single nucleotide polymorphisms (SNPs) that can be applied to develop accurate disease classification models. In the first step, we divided the whole genome into nonoverlapping fragments of an optimal size and then ran convolutional neural network (CNN) on each fragment to select phenotype-associated fragments. In the second step, using a Sliding Window Association Test (SWAT), we ran CNN on the selected fragments to calculate phenotype influence scores (PIS) and identify phenotype-associated SNPs based on PIS. In the third step, we ran CNN on all identified SNPs to develop a classification model. We tested our approach using GWAS data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) including (N = 981; cognitively normal older adults (CN) = 650 and AD = 331). Our approach identified the well-known APOE region as the most significant genetic locus for AD. Our classification model achieved an area under the curve (AUC) of 0.82, which was compatible with traditional machine learning approaches, random forest and XGBoost. SWAT-CNN, a novel deep learning-based genome-wide approach, identified AD-associated SNPs and a classification model for AD and may hold promise for a range of biomedical applications.
Collapse
Affiliation(s)
- Taeho Jo
- Department of Radiology and Imaging Sciences, Center for Neuroimaging, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana Alzheimer’s Disease Research Center, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana University Network Science Institute, Bloomington, IN, USA
| | - Kwangsik Nho
- Department of Radiology and Imaging Sciences, Center for Neuroimaging, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana Alzheimer’s Disease Research Center, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana University Network Science Institute, Bloomington, IN, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Paula Bice
- Department of Radiology and Imaging Sciences, Center for Neuroimaging, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana Alzheimer’s Disease Research Center, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Center for Neuroimaging, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana Alzheimer’s Disease Research Center, Indiana University School of Medicine, Indianapolis, IN, USA
- Indiana University Network Science Institute, Bloomington, IN, USA
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
| | | |
Collapse
|
31
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
32
|
Guo Y, Zhou D, Cao J, Nie R, Ruan X, Liu Y. Gated residual neural networks with self-normalization for translation initiation site recognition. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107783] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
33
|
Willems P, Ndah E, Jonckheere V, Van Breusegem F, Van Damme P. To New Beginnings: Riboproteogenomics Discovery of N-Terminal Proteoforms in Arabidopsis Thaliana. FRONTIERS IN PLANT SCIENCE 2022; 12:778804. [PMID: 35069635 PMCID: PMC8770321 DOI: 10.3389/fpls.2021.778804] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 11/18/2021] [Indexed: 06/14/2023]
Abstract
Alternative translation initiation is a widespread event in biology that can shape multiple protein forms or proteoforms from a single gene. However, the respective contribution of alternative translation to protein complexity remains largely enigmatic. By complementary ribosome profiling and N-terminal proteomics (i.e., riboproteogenomics), we provide clear-cut evidence for ~90 N-terminal proteoform pairs shaped by (alternative) translation initiation in Arabidopsis thaliana. Next to several cases additionally confirmed by directed mutagenesis, identified alternative protein N-termini follow the enzymatic rules of co-translational N-terminal protein acetylation and initiator methionine removal. In contrast to other eukaryotic models, N-terminal acetylation in plants cannot generally be considered as a proxy of translation initiation because of its posttranslational occurrence on mature proteolytic neo-termini (N-termini) localized in the chloroplast stroma. Quantification of N-terminal acetylation revealed differing co- vs. posttranslational N-terminal acetylation patterns. Intriguingly, our data additionally hints to alternative translation initiation serving as a common mechanism to supply protein copies in multiple cellular compartments, as alternative translation sites are often in close proximity to cleavage sites of N-terminal transit sequences of nuclear-encoded chloroplastic and mitochondrial proteins. Overall, riboproteogenomics screening enables the identification of (differential localized) N-terminal proteoforms raised upon alternative translation.
Collapse
Affiliation(s)
- Patrick Willems
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Vlaams Instituut voor Biotechnologie (VIB)-Center for Plant Systems Biology, Ghent, Belgium
| | - Elvis Ndah
- integrative Riboproteogenomics, Interactomics and Proteomics Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| | - Veronique Jonckheere
- integrative Riboproteogenomics, Interactomics and Proteomics Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| | - Frank Van Breusegem
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
- Vlaams Instituut voor Biotechnologie (VIB)-Center for Plant Systems Biology, Ghent, Belgium
| | - Petra Van Damme
- integrative Riboproteogenomics, Interactomics and Proteomics Unit, Laboratory of Microbiology, Department of Biochemistry and Microbiology, Ghent University, Ghent, Belgium
| |
Collapse
|
34
|
Kang YJ, Li JY, Ke L, Jiang S, Yang DC, Hou M, Gao G. Quantitative model suggests both intrinsic and contextual features contribute to the transcript coding ability determination in cells. Brief Bioinform 2021; 23:6445106. [PMID: 34849565 DOI: 10.1093/bib/bbab483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/18/2021] [Accepted: 10/23/2021] [Indexed: 11/13/2022] Open
Abstract
Gene transcription and protein translation are two key steps of the 'central dogma.' It is still a major challenge to quantitatively deconvolute factors contributing to the coding ability of transcripts in mammals. Here, we propose ribosome calculator (RiboCalc) for quantitatively modeling the coding ability of RNAs in human genome. In addition to effectively predicting the experimentally confirmed coding abundance via sequence and transcription features with high accuracy, RiboCalc provides interpretable parameters with biological information. Large-scale analysis further revealed a number of transcripts with a variety of coding ability for distinct types of cells (i.e. context-dependent coding transcripts), suggesting that, contrary to conventional wisdom, a transcript's coding ability should be modeled as a continuous spectrum with a context-dependent nature.
Collapse
Affiliation(s)
- Yu-Jian Kang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - Jing-Yi Li
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - Lan Ke
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - Shuai Jiang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - De-Chang Yang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - Mei Hou
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, 100871, China
| |
Collapse
|
35
|
Zhang Y, Tino P, Leonardis A, Tang K. A Survey on Neural Network Interpretability. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3100641] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
36
|
Ichihara K, Matsumoto A, Nishida H, Kito Y, Shimizu H, Shichino Y, Iwasaki S, Imami K, Ishihama Y, Nakayama KI. Combinatorial analysis of translation dynamics reveals eIF2 dependence of translation initiation at near-cognate codons. Nucleic Acids Res 2021; 49:7298-7317. [PMID: 34226921 PMCID: PMC8287933 DOI: 10.1093/nar/gkab549] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 06/08/2021] [Accepted: 06/11/2021] [Indexed: 02/05/2023] Open
Abstract
Although ribosome-profiling and translation initiation sequencing (TI-seq) analyses have identified many noncanonical initiation codons, the precise detection of translation initiation sites (TISs) remains a challenge, mainly because of experimental artifacts of such analyses. Here, we describe a new method, TISCA (TIS detection by translation Complex Analysis), for the accurate identification of TISs. TISCA proved to be more reliable for TIS detection compared with existing tools, and it identified a substantial number of near-cognate codons in Kozak-like sequence contexts. Analysis of proteomics data revealed the presence of methionine at the NH2-terminus of most proteins derived from near-cognate initiation codons. Although eukaryotic initiation factor 2 (eIF2), eIF2A and eIF2D have previously been shown to contribute to translation initiation at near-cognate codons, we found that most noncanonical initiation events are most probably dependent on eIF2, consistent with the initial amino acid being methionine. Comprehensive identification of TISs by TISCA should facilitate characterization of the mechanism of noncanonical initiation.
Collapse
Affiliation(s)
- Kazuya Ichihara
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan
| | - Akinobu Matsumoto
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan
| | - Hiroshi Nishida
- Department of Molecular and Cellular Bioanalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Yuki Kito
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan
| | - Hideyuki Shimizu
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan
| | - Yuichi Shichino
- RNA Systems Biochemistry Laboratory, RIKEN Cluster for Pioneering Research, Wako, Saitama 351-0198, Japan
| | - Shintaro Iwasaki
- RNA Systems Biochemistry Laboratory, RIKEN Cluster for Pioneering Research, Wako, Saitama 351-0198, Japan.,Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.,AMED-CREST, Japan Agency for Medical Research and Development, Wako, Saitama 351-0198, Japan
| | - Koshi Imami
- Department of Molecular and Cellular Bioanalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Yasushi Ishihama
- Department of Molecular and Cellular Bioanalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Keiichi I Nakayama
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka 812-8582, Japan
| |
Collapse
|
37
|
Sinha T, Panigrahi C, Das D, Chandra Panda A. Circular RNA translation, a path to hidden proteome. WILEY INTERDISCIPLINARY REVIEWS-RNA 2021; 13:e1685. [PMID: 34342387 DOI: 10.1002/wrna.1685] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 07/07/2021] [Accepted: 07/08/2021] [Indexed: 11/06/2022]
Abstract
Functional proteins in the cell are translated from the messenger RNA (mRNA) molecules, constituting less than 5% of the cellular transcriptome. The majority of the RNA molecules in the cell are noncoding RNAs, including rRNA, tRNA, snRNA, piRNA, lncRNA, microRNA, and poorly characterized circular RNAs (circRNAs). Recent studies established that circRNAs regulate gene expression by associating with RNA-binding proteins and microRNAs. With the growing understanding of circRNA functions, a subset of circRNAs has been reported to translate into proteins. Interestingly, the presence of Open Reading Frames (ORFs), N6-methyladenosine (m6A) modifications, and internal ribosomal entry sites (IRES) in the circRNA sequences indicate their coding potential through the cap-independent translation initiation mechanism. The purpose of this review is to highlight the mechanism of circRNA translation and the importance of circRNA-encoded proteins (circ-proteins) in cellular physiology and pathology. Here, we discuss the computational and molecular methods currently utilized to systematically identify translatable circRNAs and the functional characterization of the circ-proteins. We foresee that the ongoing and future studies on circRNA translation will uncover the hidden proteome and their therapeutic implications in human health. This article is categorized under: RNA Methods > RNA Analyses in Cells Regulatory RNAs/RNAi/Riboswitches > Regulatory RNAs Translation > Mechanisms.
Collapse
Affiliation(s)
- Tanvi Sinha
- Institute of Life Sciences, Nalco Square, Bhubaneswar, Odisha, India
| | - Chirag Panigrahi
- Institute of Life Sciences, Nalco Square, Bhubaneswar, Odisha, India
| | - Debojyoti Das
- Institute of Life Sciences, Nalco Square, Bhubaneswar, Odisha, India.,School of Biotechnology, KIIT University, Bhubaneswar, Odisha, India
| | | |
Collapse
|
38
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
39
|
Takata A, Hamanaka K, Matsumoto N. Refinement of the clinical variant interpretation framework by statistical evidence and machine learning. MED 2021; 2:611-632.e9. [PMID: 35590234 DOI: 10.1016/j.medj.2021.02.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 09/28/2020] [Accepted: 02/16/2021] [Indexed: 12/29/2022]
Abstract
BACKGROUND Although the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines for variant interpretation are used widely in clinical genetics, there is room for improvement of these knowledge-based guidelines. METHODS Statistical assessment of average deleteriousness of start-lost, stop-lost, and in-frame insertion and deletion (indel) variants and extraction of deleterious subsets was performed, being informed by proportions of rare variants in the general population of the Genome Aggregation Database (gnomAD). A machine learning-based model scoring the pathogenicity of start-lost variants (the PoStaL model) was constructed by predicting possible translation initiation sites on transcripts by deep learning and training a random forest on known pathogenic and likely benign variants. FINDINGS The proportion of rare variants was highest in stop-lost variants, followed by in-frame indels and start-lost variants, suggesting that the criteria in the ACMG/AMP guidelines assigning PVS (pathogenic very strong) to start-lost variants and PM (pathogenic moderate) to stop-lost and in-frame indel variants would not be appropriate. Regarding deleterious subsets, stop-lost variants introducing extensions of more than 30 amino acids and in-frame indels computationally predicted to be damaging are enriched for rare and known pathogenic variants. For start-lost variants, we developed the PoStaL model, which outperforms existing tools. We also provide comprehensive lists of the PoStaL scores for start-lost variants and the length of extended amino acids by stop-lost variants. CONCLUSIONS Our study could contribute to refinement of the ACMG/AMP guidelines, provides resources for future investigation, and provides an example of how to improve knowledge-based frameworks by data-driven approaches. FUNDING The study was supported by grants from the Japan Agency for Medical Research and Development (AMED) and the Japan Society for the Promotion of Science (JSPS).
Collapse
Affiliation(s)
- Atsushi Takata
- Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama, Kanagawa 236-0004, Japan; Laboratory for Molecular Pathology of Psychiatric Disorders, RIKEN Center for Brain Science, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan; Laboratory for Molecular Dynamics of Mental Disorders, RIKEN Center for Brain Science, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan.
| | - Kohei Hamanaka
- Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama, Kanagawa 236-0004, Japan
| | - Naomichi Matsumoto
- Department of Human Genetics, Yokohama City University Graduate School of Medicine, 3-9 Fukuura, Kanazawa-ku, Yokohama, Kanagawa 236-0004, Japan.
| |
Collapse
|
40
|
Karollus A, Avsec Ž, Gagneur J. Predicting mean ribosome load for 5'UTR of any length using deep learning. PLoS Comput Biol 2021; 17:e1008982. [PMID: 33970899 PMCID: PMC8136849 DOI: 10.1371/journal.pcbi.1008982] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 05/20/2021] [Accepted: 04/19/2021] [Indexed: 01/07/2023] Open
Abstract
The 5’ untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5’UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)—a proxy for translation rate—directly from 5’UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5’UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5’UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants. The human genome carries a complex code. It consists of genes, which provide blueprints to assemble proteins, and regulatory elements, which control when, where, and how often particular genes are transcribed and translated into protein. To read the genome correctly and specifically to find the causes of inherited diseases, we need to be able to find and interpret these regulatory elements. Here, we focus on particular regions of the genome, the so-called 5’ untranslated regions, which play an important role in determining how often a transcribed gene is translated into protein. We develop deep learning models which can quantitatively interpret regulatory elements in human 5’ untranslated regions and use this information to predict a proxy of the translation efficiency. Our model generalizes a previous model to 5’ untranslated regions of any length, just as they are encountered in natural human genes. Because this model requires only the sequence as input, it can give estimates for the impact of mutations in the sequence, even if these particular mutations are very rare or entirely novel. Such estimates could help pinpoint mutations that disrupt the normal functioning of gene regulation, which could be used to better diagnose patients suffering from rare genetic disorders.
Collapse
Affiliation(s)
- Alexander Karollus
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
- Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, Technical University of Munich, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany
- * E-mail:
| |
Collapse
|
41
|
A machine learning-based framework for modeling transcription elongation. Proc Natl Acad Sci U S A 2021; 118:2007450118. [PMID: 33526657 DOI: 10.1073/pnas.2007450118] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
RNA polymerase II (Pol II) generally pauses at certain positions along gene bodies, thereby interrupting the transcription elongation process, which is often coupled with various important biological functions, such as precursor mRNA splicing and gene expression regulation. Characterizing the transcriptional elongation dynamics can thus help us understand many essential biological processes in eukaryotic cells. However, experimentally measuring Pol II elongation rates is generally time and resource consuming. We developed PEPMAN (polymerase II elongation pausing modeling through attention-based deep neural network), a deep learning-based model that accurately predicts Pol II pausing sites based on the native elongating transcript sequencing (NET-seq) data. Through fully taking advantage of the attention mechanism, PEPMAN is able to decipher important sequence features underlying Pol II pausing. More importantly, we demonstrated that the analyses of the PEPMAN-predicted results around various types of alternative splicing sites can provide useful clues into understanding the cotranscriptional splicing events. In addition, associating the PEPMAN prediction results with different epigenetic features can help reveal important factors related to the transcription elongation process. All these results demonstrated that PEPMAN can provide a useful and effective tool for modeling transcription elongation and understanding the related biological factors from available high-throughput sequencing data.
Collapse
|
42
|
Abstract
We have used the Nanopore long-read sequencing platform to demonstrate how amazingly complex the human adenovirus type 2 (Ad2) transcriptome is with a flexible splicing machinery producing a range of novel mRNAs both from the early and late transcription units. In total we report more than 900 alternatively spliced mRNAs produced from the Ad2 transcriptome whereof more than 850 are novel mRNAs. A surprising finding was that more than 50% of all E1A transcripts extended upstream of the previously defined transcriptional start site. The novel start sites mapped close to the inverted terminal repeat (ITR) and within the E1A enhancer region. We speculate that novel promoters or enhancer driven transcription, so-called eRNA transcription, is responsible for producing these novel mRNAs. Their existence was verified by a peptide in the Ad2 proteome that was unique for the E1A ITR mRNA. Although we show a high complexity of alternative splicing from most early and late regions, the E3 region was by far the most complex when expressed at late times of infection. More than 400 alternatively spliced mRNAs were observed in this region alone. These mRNAs included extended L4 mRNAs containing E3 and L5 sequences and readthrough mRNAs combining E3 and L5 sequences. Our findings demonstrate that the virus has a remarkable capacity to produce novel exon combinations, which will offer the virus an evolutionary advantage to change the gene expression repertoire and protein production in an evolving environment.IMPORTANCE Work in the adenovirus system led to the groundbreaking discovery of RNA splicing and alternative RNA splicing in 1977. These mechanisms are essential in mammalian evolution by increasing the coding capacity of a genome. Here, we have used a long-read sequencing technology to characterize the complexity of human adenovirus pre-mRNA splicing in detail. It is mindboggling that the viral genome, which only houses around 36,000 bp, not being much larger than a single cellular gene, generates more than 900 alternatively spliced mRNAs. Recently, adenoviruses have been used as the backbone in several promising SARS-CoV-2 vaccines. Further improvement of adenovirus-based vaccines demands that the virus can be tamed into an innocent carrier of foreign genes. This requires a full understanding of the components that govern adenovirus replication and gene expression.
Collapse
|
43
|
Wei C, Zhang J, Yuan X, He Z, Liu G, Wu J. NeuroTIS: Enhancing the prediction of translation initiation sites in mRNA sequences via a hybrid dependency network and deep learning framework. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106459] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
44
|
Du Z, Xiao X, Uversky VN. DeepA-RBPBS: A hybrid convolution and recurrent neural network combined with attention mechanism for predicting RBP binding site. J Biomol Struct Dyn 2020; 40:4250-4258. [PMID: 33272122 DOI: 10.1080/07391102.2020.1854861] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
It's important to infer the binding site of RNA-binding proteins (RBP) for understanding the interaction between RBP and its RNA targets and decipher the mechanisms of transcriptional regulation. However, experimental detection of RBP binding sites is still time-intensive and expensive. Algorithms based on machine learning can speed up detection of RBP binding sites. In this article, we propose a new deep learning method, DeepA-RBPBS, which can use RNA sequences and structural features to predict RBP binding site. DeepA-RBPBS uses CNN and BiGRU to extract sequences and structural features without long-term dependence issues. It also utilizes an attention mechanism to enhance the contribution of key features. The comparison shows that the performance of DeepA-RBPBS is better than that of the state-of-the-art predictors. In the testing on 31 datasets of CLIP-seq experiments over 19 proteins, MCC (AUC) is 8% (5%) higher than those of the latest method based on deep learning, iDeepS. We also apply DeepA-RBPBS to the target RNA data of RBPs related to diabetes (LIN28, RBFOX2, FTO, IGF2BP2, CELF1 and HuR). The results show that DeepA-RBPBS correctly predicted 41,693 samples, where iDeepS predicted 31,381 samples.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, P.R. China
| | - Xiangdong Xiao
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, P.R. China
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.,USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.,Laboratory of New Methods in Biology, Institute for Biological Instrumentation, Russian Academy of Sciences, Moscow, Russia
| |
Collapse
|
45
|
Angenent-Mari NM, Garruss AS, Soenksen LR, Church G, Collins JJ. A deep learning approach to programmable RNA switches. Nat Commun 2020; 11:5057. [PMID: 33028812 PMCID: PMC7541447 DOI: 10.1038/s41467-020-18677-1] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 07/31/2020] [Indexed: 12/21/2022] Open
Abstract
Engineered RNA elements are programmable tools capable of detecting small molecules, proteins, and nucleic acids. Predicting the behavior of these synthetic biology components remains a challenge, a situation that could be addressed through enhanced pattern recognition from deep learning. Here, we investigate Deep Neural Networks (DNN) to predict toehold switch function as a canonical riboswitch model in synthetic biology. To facilitate DNN training, we synthesize and characterize in vivo a dataset of 91,534 toehold switches spanning 23 viral genomes and 906 human transcription factors. DNNs trained on nucleotide sequences outperform (R2 = 0.43-0.70) previous state-of-the-art thermodynamic and kinetic models (R2 = 0.04-0.15) and allow for human-understandable attention-visualizations (VIS4Map) to identify success and failure modes. This work shows that deep learning approaches can be used for functionality predictions and insight generation in RNA synthetic biology.
Collapse
Affiliation(s)
- Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
| | - Alexander S Garruss
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, 02138, USA
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
| | - Luis R Soenksen
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Department of Mechanical Engineering, MIT, Cambridge, MA, 02139, USA
| | - George Church
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA.
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA.
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA.
- Department of Mechanical Engineering, MIT, Cambridge, MA, 02139, USA.
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
46
|
Kwon MS, Lee BT, Lee SY, Kim HU. Modeling regulatory networks using machine learning for systems metabolic engineering. Curr Opin Biotechnol 2020; 65:163-170. [DOI: 10.1016/j.copbio.2020.02.014] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 02/23/2020] [Accepted: 02/26/2020] [Indexed: 12/18/2022]
|
47
|
Goel N, Singh S, Aseri TC. Global sequence features based translation initiation site prediction in human genomic sequences. Heliyon 2020; 6:e04825. [PMID: 32964155 PMCID: PMC7490824 DOI: 10.1016/j.heliyon.2020.e04825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 05/25/2020] [Accepted: 08/26/2020] [Indexed: 11/26/2022] Open
Abstract
Gene prediction has been increasingly important in genome annotation due to advancements in sequencing technology. Genome annotation further helps in determining the structure and function of these genes. Translation initiation site prediction (TIS) in human genomic sequences is one of the fundamental and essential steps in gene prediction. Thus, accurate prediction of TIS in these sequences is highly desirable. Although many computational methods were developed for this problem, none of them focused on finding these sites in human genomic sequences. In this paper, a new TIS prediction method is proposed by incorporating global sequence based features. Support vector machine is used to assess the prediction power of these features. The proposed method achieved accuracy of above 90% when tested for genomic as well as cDNA sequences. The experimental results indicate that the method works well for both genomic and cDNA sequences. The method can be integrated into gene prediction system in future.
Collapse
Affiliation(s)
- Neelam Goel
- Department of Information Technology, University Institute of Engineering and Technology, Sector-25, Panjab University, Chandigarh 160014, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| | - Trilok Chand Aseri
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector-12, Chandigarh 160012, India
| |
Collapse
|
48
|
Computational discovery and modeling of novel gene expression rules encoded in the mRNA. Biochem Soc Trans 2020; 48:1519-1528. [PMID: 32662820 DOI: 10.1042/bst20191048] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 06/15/2020] [Accepted: 06/17/2020] [Indexed: 11/17/2022]
Abstract
The transcript is populated with numerous overlapping codes that regulate all steps of gene expression. Deciphering these codes is very challenging due to the large number of variables involved, the non-modular nature of the codes, biases and limitations in current experimental approaches, our limited knowledge in gene expression regulation across the tree of life, and other factors. In recent years, it has been shown that computational modeling and algorithms can significantly accelerate the discovery of novel gene expression codes. Here, we briefly summarize the latest developments and different approaches in the field.
Collapse
|
49
|
|
50
|
Gupta A, Bansal M. RNA-mediated translation regulation in viral genomes: computational advances in the recognition of sequences and structures. Brief Bioinform 2020; 21:1151-1163. [PMID: 31204430 PMCID: PMC7109810 DOI: 10.1093/bib/bbz054] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 03/24/2019] [Accepted: 04/15/2019] [Indexed: 12/30/2022] Open
Abstract
RNA structures are widely distributed across all life forms. The global conformation of these structures is defined by a variety of constituent structural units such as helices, hairpin loops, kissing-loop motifs and pseudoknots, which often behave in a modular way. Their ubiquitous distribution is associated with a variety of functions in biological processes. The location of these structures in the genomes of RNA viruses is often coordinated with specific processes in the viral life cycle, where the presence of the structure acts as a checkpoint for deciding the eventual fate of the process. These structures have been found to adopt complex conformations and exert their effects by interacting with ribosomes, multiple host translation factors and small RNA molecules like miRNA. A number of such RNA structures have also been shown to regulate translation in viruses at the level of initiation, elongation or termination. The role of various computational studies in the preliminary identification of such sequences and/or structures and subsequent functional analysis has not been fully appreciated. This review aims to summarize the processes in which viral RNA structures have been found to play an active role in translational regulation, their global conformational features and the bioinformatics/computational tools available for the identification and prediction of these structures.
Collapse
Affiliation(s)
- Asmita Gupta
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Manju Bansal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| |
Collapse
|