1
|
Mozaffari S, Arrías PN, Clementel D, Piovesan D, Ferrari C, Tosatto SCE, Monzon AM. STRPsearch: fast detection of structured tandem repeat proteins. Bioinformatics 2024; 40:btae690. [PMID: 39558588 PMCID: PMC11645253 DOI: 10.1093/bioinformatics/btae690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 09/02/2024] [Accepted: 11/14/2024] [Indexed: 11/20/2024] Open
Abstract
MOTIVATION Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures. RESULTS We introduce STRPsearch, a novel tool for the rapid identification, classification, and mapping of STRPs. Leveraging manually curated entries from RepeatsDB as the known conformational space of STRPs, STRPsearch uses the latest advances in structural alignment for a fast and accurate detection of repeated structural motifs in proteins, followed by an innovative approach to map units and insertions through the generation of TM-score profiles. STRPsearch is highly scalable, efficiently processing large datasets, and can be applied to both experimental structures and predicted models. In addition, it demonstrates superior performance compared to existing tools, offering researchers a reliable and comprehensive solution for STRP analysis across diverse proteomes. AVAILABILITY AND IMPLEMENTATION STRPsearch is coded in Python. All scripts and associated documentation are available from: https://github.com/BioComputingUP/STRPsearch.
Collapse
Affiliation(s)
- Soroush Mozaffari
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
| | - Paula Nazarena Arrías
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
- Department of Protein Science, KTH Royal Institute of Technology, Stockholm SE-10691, Sweden
| | - Damiano Clementel
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padova 35121, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari 70126, Italy
| | | |
Collapse
|
2
|
English AC, Dolzhenko E, Ziaei Jam H, McKenzie SK, Olson ND, De Coster W, Park J, Gu B, Wagner J, Eberle MA, Gymrek M, Chaisson MJP, Zook JM, Sedlazeck FJ. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol 2024:10.1038/s41587-024-02225-z. [PMID: 38671154 DOI: 10.1038/s41587-024-02225-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 03/28/2024] [Indexed: 04/28/2024]
Abstract
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
Collapse
Affiliation(s)
- Adam C English
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| | | | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | | | - Nathan D Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Jonghun Park
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Bida Gu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin Wagner
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Justin M Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA.
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
3
|
Verbiest MA, Lundström O, Xia F, Baudis M, Bilgin Sonay T, Anisimova M. Short tandem repeat mutations regulate gene expression in colorectal cancer. Sci Rep 2024; 14:3331. [PMID: 38336885 PMCID: PMC10858039 DOI: 10.1038/s41598-024-53739-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Accepted: 02/04/2024] [Indexed: 02/12/2024] Open
Abstract
Short tandem repeat (STR) mutations are prevalent in colorectal cancer (CRC), especially in tumours with the microsatellite instability (MSI) phenotype. While STR length variations are known to regulate gene expression under physiological conditions, the functional impact of STR mutations in CRC remains unclear. Here, we integrate STR mutation data with clinical information and gene expression data to study the gene regulatory effects of STR mutations in CRC. We confirm that STR mutability in CRC highly depends on the MSI status, repeat unit size, and repeat length. Furthermore, we present a set of 1244 putative expression STRs (eSTRs) for which the STR length is associated with gene expression levels in CRC tumours. The length of 73 eSTRs is associated with expression levels of cancer-related genes, nine of which are CRC-specific genes. We show that linear models describing eSTR-gene expression relationships allow for predictions of gene expression changes in response to eSTR mutations. Moreover, we found an increased mutability of eSTRs in MSI tumours. Our evidence of gene regulatory roles for eSTRs in CRC highlights a mostly overlooked way through which tumours may modulate their phenotypes. Future extensions of these findings could uncover new STR-based targets in the treatment of cancer.
Collapse
Affiliation(s)
- Max A Verbiest
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland.
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| | - Oxana Lundström
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Feifei Xia
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Michael Baudis
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Tugce Bilgin Sonay
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Institute of Ecology, Evolution and Environmental Biology, Columbia University, New York, USA
| | - Maria Anisimova
- Institute of Computational Life Sciences, Zurich University of Applied Sciences, Wädenswil, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
4
|
Panoyan MA, Wendt FR. The role of tandem repeat expansions in brain disorders. Emerg Top Life Sci 2023; 7:249-263. [PMID: 37401564 DOI: 10.1042/etls20230022] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 06/05/2023] [Accepted: 06/19/2023] [Indexed: 07/05/2023]
Abstract
The human genome contains numerous genetic polymorphisms contributing to different health and disease outcomes. Tandem repeat (TR) loci are highly polymorphic yet under-investigated in large genomic studies, which has prompted research efforts to identify novel variations and gain a deeper understanding of their role in human biology and disease outcomes. We summarize the current understanding of TRs and their implications for human health and disease, including an overview of the challenges encountered when conducting TR analyses and potential solutions to overcome these challenges. By shedding light on these issues, this article aims to contribute to a better understanding of the impact of TRs on the development of new disease treatments.
Collapse
Affiliation(s)
- Mary Anne Panoyan
- Department of Anthropology, University of Toronto, Mississauga, ON, Canada
| | - Frank R Wendt
- Department of Anthropology, University of Toronto, Mississauga, ON, Canada
- Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Forensic Science Program, University of Toronto, Mississauga, ON, Canada
| |
Collapse
|
5
|
Lundström OS, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, Anisimova M, Gymrek M. WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans. J Mol Biol 2023; 435:168260. [PMID: 37678708 DOI: 10.1016/j.jmb.2023.168260] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 08/29/2023] [Accepted: 08/29/2023] [Indexed: 09/09/2023]
Abstract
Short tandem repeats (STRs) are consecutive repetitions of one to six nucleotide motifs. They are hypervariable due to the high prevalence of repeat unit insertions or deletions primarily caused by polymerase slippage during replication. Genetic variation at STRs has been shown to influence a range of traits in humans, including gene expression, cancer risk, and autism. Until recently STRs have been poorly studied since they pose significant challenges to bioinformatics analyses. Moreover, genome-wide analysis of STR variation in population-scale cohorts requires large amounts of data and computational resources. However, the recent advent of genome-wide analysis tools has resulted in multiple large genome-wide datasets of STR variation spanning nearly two million genomic loci in thousands of individuals from diverse populations. Here we present WebSTR, a database of genetic variation and other characteristics of genome-wide STRs across human populations. WebSTR is based on reference panels of more than 1.7 million human STRs created with state of the art repeat annotation methods and can easily be extended to include additional cohorts or species. It currently contains data based on STR genotypes for individuals from the 1000 Genomes Project, H3Africa, the Genotype-Tissue Expression (GTEx) Project and colorectal cancer patients from the TCGA dataset. WebSTR is implemented as a relational database with programmatic access available through an API and a web portal for browsing data. The web portal is publicly available at https://webstr.ucsd.edu.
Collapse
Affiliation(s)
- Oxana Sachenkova Lundström
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden; Vildly AB, Kalmar, Sweden; Institute of Computational Life Sciences, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland. https://twitter.com/merenlin
| | - Max Adriaan Verbiest
- Institute of Computational Life Sciences, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland; Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
| | - Feifei Xia
- Institute of Computational Life Sciences, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland; Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland. https://twitter.com/Feifeix97
| | - Helyaneh Ziaei Jam
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Inti Zlobec
- Institute of Tissue Medicine and Pathology, University of Bern, Switzerland
| | - Maria Anisimova
- Institute of Computational Life Sciences, School of Life Sciences and Facility Management, Zürich University of Applied Sciences (ZHAW), Waedenswil, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA; Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
6
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
7
|
Apsley AT, Domico ER, Verbiest MA, Brogan CA, Buck ER, Burich AJ, Cardone KM, Stone WJ, Anisimova M, Vandenbergh DJ. A novel hypervariable variable number tandem repeat in the dopamine transporter gene ( SLC6A3). Life Sci Alliance 2023; 6:e202201677. [PMID: 36754567 PMCID: PMC9909461 DOI: 10.26508/lsa.202201677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 01/25/2023] [Accepted: 01/26/2023] [Indexed: 02/10/2023] Open
Abstract
The dopamine transporter gene, SLC6A3, has received substantial attention in genetic association studies of various phenotypes. Although some variable number tandem repeats (VNTRs) present in SLC6A3 have been tested in genetic association studies, results have not been consistent. VNTRs in SLC6A3 that have not been examined genetically were characterized. The Tandem Repeat Annotation Library was used to characterize the VNTRs of 64 unrelated long-read haplotype-phased SLC6A3 sequences. Sequence similarity of each repeat unit of the five VNTRs is reported, along with the correlations of SNP-SNP, SNP-VNTR, and VNTR-VNTR alleles across the gene. One of these VNTRs is a novel hyper-VNTR (hyVNTR) in intron 8 of SLC6A3, which contains a range of 3.4-133.4 repeat copies and has a consensus sequence length of 38 bp, with 82% G+C content. The 38-base repeat was predicted to form G-quadruplexes in silico and was confirmed by circular dichroism spectroscopy. In addition, this hyVNTR contains multiple putative binding sites for PRDM9, which, in combination with low levels of linkage disequilibrium around the hyVNTR, suggests it might be a recombination hotspot.
Collapse
Affiliation(s)
- Abner T Apsley
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
- The Molecular, Cellular and Integrative Biosciences Program, The Pennsylvania State University, State College, PA, USA
| | - Emma R Domico
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
| | - Max A Verbiest
- Institute of Computational Life Science, School of Life Sciences and Facility Management, Zürich University of Applied Sciences, Wädenswil, Switzerland
- Department of Molecular Life Sciences, Faculty of Science, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Carly A Brogan
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
| | - Evan R Buck
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
| | - Andrew J Burich
- Department of Information Science and Technologies - Applied Data Sciences, The Pennsylvania State University, State College, PA, USA
| | - Kathleen M Cardone
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
| | - Wesley J Stone
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
| | - Maria Anisimova
- Institute of Computational Life Science, School of Life Sciences and Facility Management, Zürich University of Applied Sciences, Wädenswil, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - David J Vandenbergh
- Department of Biobehavioral Health, The Pennsylvania State University, State College, PA, USA
- The Molecular, Cellular and Integrative Biosciences Program, The Pennsylvania State University, State College, PA, USA
- Institute of the Neurosciences, The Pennsylvania State University, State College, PA, USA
- The Bioinformatics and Genomics Program, The Pennsylvania State University, State College, PA, USA
| |
Collapse
|