1
|
Novakovsky G, Saraswat M, Fornes O, Mostafavi S, Wasserman WW. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol 2021; 22:280. [PMID: 34579793 PMCID: PMC8474956 DOI: 10.1186/s13059-021-02499-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 09/15/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Deep learning has proven to be a powerful technique for transcription factor (TF) binding prediction but requires large training datasets. Transfer learning can reduce the amount of data required for deep learning, while improving overall model performance, compared to training a separate model for each new task. RESULTS We assess a transfer learning strategy for TF binding prediction consisting of a pre-training step, wherein we train a multi-task model with multiple TFs, and a fine-tuning step, wherein we initialize single-task models for individual TFs with the weights learned by the multi-task model, after which the single-task models are trained at a lower learning rate. We corroborate that transfer learning improves model performance, especially if in the pre-training step the multi-task model is trained with biologically relevant TFs. We show the effectiveness of transfer learning for TFs with ~ 500 ChIP-seq peak regions. Using model interpretation techniques, we demonstrate that the features learned in the pre-training step are refined in the fine-tuning step to resemble the binding motif of the target TF (i.e., the recipient of transfer learning in the fine-tuning step). Moreover, pre-training with biologically relevant TFs allows single-task models in the fine-tuning step to learn useful features other than the motif of the target TF. CONCLUSIONS Our results confirm that transfer learning is a powerful technique for TF binding prediction.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
| | - Manu Saraswat
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada.
| | - Sara Mostafavi
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
- Department of Statistics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Canadian Institute for Advanced Research, CIFAR AI Chair, and Child and Brain Development, Toronto, ON, M5G 1 M1, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada.
| |
Collapse
|
2
|
Pranckeviciene E, Hosid S, Liang N, Ioshikhes I. Nucleosome positioning sequence patterns as packing or regulatory. PLoS Comput Biol 2020; 16:e1007365. [PMID: 31986131 PMCID: PMC7004410 DOI: 10.1371/journal.pcbi.1007365] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Revised: 02/06/2020] [Accepted: 12/06/2019] [Indexed: 11/19/2022] Open
Abstract
Nucleosome positioning DNA sequence patterns (NPS)-usually distributions of particular dinucleotides or other sequence elements in nucleosomal DNA-at least partially determine chromatin structure and arrangements of nucleosomes that in turn affect gene expression. Statistically, NPS are defined as oscillations of the dinucleotide periodicity of about 10 base pairs (bp) which reflects the double helix period. We compared the nucleosomal DNA patterns in mouse, human and yeast organisms and observed few distinctive patterns that can be termed as packing and regulatory referring to distinctive modes of chromatin function. For the first time the NPS patterns in nucleus accumbens cells (NAC) in mouse brain were characterized and compared to the patterns in human CD4+ and apoptotic lymphocyte cells and well studied patterns in yeast. The NPS patterns in human CD4+ cells and mouse brain cells had very high positive correlation. However, there was no correlation between them and patterns in human apoptotic lymphocyte cells and yeast, but the latter two were highly correlated with each other. By their dinucleotide arrangements the analyzed NPS patterns classified into stable canonical WW/SS (W = A or T and S = C or G dinucleotide) and less stable RR/YY (R = A or G and Y = C or T dinucleotide) patterns and anti-patterns. In the anti-patterns positioning of the dinucleotides is flipped compared to those in the regular patterns. Stable canonical WW/SS patterns and anti-patterns are ubiquitously observed in many organisms and they had high resemblance between yeast and human apoptotic cells. Less stable RR/YY patterns had higher positive correlation between mouse and normal human cells. Our analysis and evidence from scientific literature lead to idea that various distinct patterns in nucleosomal DNA can be related to the two roles of the chromatin: packing (WW/SS) and regulatory (RR/YY and "anti").
Collapse
Affiliation(s)
- Erinija Pranckeviciene
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
- Department of Human and Medical Genetics, Biomedical Science Institute, Faculty of Medicine, Vilnius University, Vilnius, Lithuania
- * E-mail: (EP); (II)
| | - Sergey Hosid
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| | - Nathan Liang
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| | - Ilya Ioshikhes
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
- Ottawa Institute of Systems Biology (OISB), Ottawa, Ontario, Canada
- * E-mail: (EP); (II)
| |
Collapse
|
3
|
Hosid S, Ioshikhes I. Apoptotic lymphocytes of H. sapiens lose nucleosomes in GC-rich promoters. PLoS Comput Biol 2014; 10:e1003760. [PMID: 25077608 PMCID: PMC4117428 DOI: 10.1371/journal.pcbi.1003760] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2014] [Accepted: 06/16/2014] [Indexed: 11/18/2022] Open
Abstract
We analyzed two sets of human CD4+ nucleosomal DNA directly sequenced by Illumina (Solexa) high throughput sequencing method. The first set has ∼40 M sequences and was produced from the normal CD4+ T lymphocytes by micrococcal nuclease. The second set has ∼44 M sequences and was obtained from peripheral blood lymphocytes by apoptotic nucleases. The different nucleosome sets showed similar dinucleotide positioning AA/TT, GG/CC, and RR/YY (R is purine, Y - pyrimidine) patterns with periods of 10–10.4 bp. Peaks of GG/CC and AA/TT patterns were shifted by 5 bp from each other. Two types of promoters in H. sapiens: AT and GC-rich were identified. AT-rich promoters in apoptotic cell had +1 nucleosome shifts 50–60 bp downstream from those in normal lymphocytes. GC-rich promoters in apoptotic cells lost 80% of nucleosomes around transcription start sites as well as in total DNA. Nucleosome positioning was predicted by combination of {AA, TT}, {GG, CC}, {WW, SS} and {RR, YY} patterns. In our study we found that the combinations of {AA, TT} and {GG, CC} provide the best results and successfully mapped 33% of nucleosomes 147 bp long with precision ±15 bp (only 31/147 or 21% is expected). We analyzed nucleosomal DNA of human CD4+ T normal and apoptotic lymphocytes. Dinucleotide positions (pattern) of AA/TT, GG/CC, WW/SS (W is adenine or thymine, S is guanine or cytosine) and RR/YY (R is purine, Y - pyrimidine) of nucleosome sequences in both cell conditions are similar and have period 10–10.4 bp. We successfully mapped 33% of nucleosomes with precision ±15 bp by combination of {AA, TT}, {GG, CC}, {WW, SS} and {RR, YY} patterns. We identified two types of promoters in H. sapience: AT and GC-rich. AT-rich promoters keep nucleosomes around transcription start site when GC-rich promoters lost 80% of nucleosomes during apoptosis at the same region.
Collapse
Affiliation(s)
- Sergey Hosid
- Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Ilya Ioshikhes
- Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, Ontario, Canada
- * E-mail:
| |
Collapse
|
4
|
Salih B, Tripathi V, Trifonov EN. Visible periodicity of strong nucleosome DNA sequences. J Biomol Struct Dyn 2013; 33:1-9. [PMID: 24266748 DOI: 10.1080/07391102.2013.855143] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Fifteen years ago, Lowary and Widom assembled nucleosomes on synthetic random sequence DNA molecules, selected the strongest nucleosomes and discovered that the TA dinucleotides in these strong nucleosome sequences often appear at 10-11 bases from one another or at distances which are multiples of this period. We repeated this experiment computationally, on large ensembles of natural genomic sequences, by selecting the strongest nucleosomes--i.e. those with such distances between like-named dinucleotides, multiples of 10.4 bases, the structural and sequence period of nucleosome DNA. The analysis confirmed the periodicity of TA dinucleotides in the strong nucleosomes, and revealed as well other periodic sequence elements, notably classical AA and TT dinucleotides. The matrices of DNA bendability and their simple linear forms--nucleosome positioning motifs--are calculated from the strong nucleosome DNA sequences. The motifs are in full accord with nucleosome positioning sequences derived earlier, thus confirming that the new technique, indeed, detects strong nucleosomes. Species- and isochore-specific variations of the matrices and of the positioning motifs are demonstrated. The strong nucleosome DNA sequences manifest the highest hitherto nucleosome positioning sequence signals, showing the dinucleotide periodicities in directly observable rather than in hidden form.
Collapse
Affiliation(s)
- Bilal Salih
- a Genome Diversity Center, Institute of Evolution, University of Haifa , Mount Carmel, Haifa 31905 , Israel
| | | | | |
Collapse
|
5
|
Bettecken T, Frenkel ZM, Altmüller J, Nürnberg P, Trifonov EN. Apoptotic cleavage of DNA in human lymphocyte chromatin shows high sequence specificity. J Biomol Struct Dyn 2012; 30:211-6. [PMID: 22702732 DOI: 10.1080/07391102.2012.677772] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Apoptotic digestion of human lymphocyte chromatin results in the appearance of large amounts of nucleosome size DNA fragments. Sequencing of these fragments and analysis of the distribution of bases around the apoptotic nucleases' cutting sites revealed a rather strong consensus sequence, not observed earlier. The consensus TAAAgTAcTTTA is characterized by complementary symmetry, resembling prokaryotic restriction sites. This consensus also possesses three TA dinucleotide steps, separated by five bases (corresponding to a half-period of the DNA double helix), suggesting strong bending of the DNA at the cut sites which is perhaps required for cutting.
Collapse
Affiliation(s)
- Thomas Bettecken
- CAGT-Center for Applied Genotyping, Max Planck Institute of Psychiatry, Kraepelinstr. 2-10, D-80804, Munich, Germany.
| | | | | | | | | |
Collapse
|