1
|
Wu Q, Li Y, Wang Q, Zhao X, Sun D, Liu B. Identification of DNA motif pairs on paired sequences based on composite heterogeneous graph. Front Genet 2024; 15:1424085. [PMID: 38952710 PMCID: PMC11215013 DOI: 10.3389/fgene.2024.1424085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Accepted: 05/22/2024] [Indexed: 07/03/2024] Open
Abstract
Motivation The interaction between DNA motifs (DNA motif pairs) influences gene expression through partnership or competition in the process of gene regulation. Potential chromatin interactions between different DNA motifs have been implicated in various diseases. However, current methods for identifying DNA motif pairs rely on the recognition of single DNA motifs or probabilities, which may result in local optimal solutions and can be sensitive to the choice of initial values. A method for precisely identifying DNA motif pairs is still lacking. Results Here, we propose a novel computational method for predicting DNA Motif Pairs based on Composite Heterogeneous Graph (MPCHG). This approach leverages a composite heterogeneous graph model to identify DNA motif pairs on paired sequences. Compared with the existing methods, MPCHG has greatly improved the accuracy of motifs prediction. Furthermore, the predicted DNA motifs demonstrate heightened DNase accessibility than the background sequences. Notably, the two DNA motifs forming a pair exhibit functional consistency. Importantly, the interacting TF pairs obtained by predicted DNA motif pairs were significantly enriched with known interacting TF pairs, suggesting their potential contribution to chromatin interactions. Collectively, we believe that these identified DNA motif pairs held substantial implications for revealing gene transcriptional regulation under long-range chromatin interactions.
Collapse
Affiliation(s)
- Qiuqin Wu
- School of Mathematics, Shandong University, Jinan, China
| | - Yang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Qi Wang
- School of Mathematics, Shandong University, Jinan, China
| | - Xiaoyu Zhao
- School of Mathematics, Shandong University, Jinan, China
| | - Duanchen Sun
- School of Mathematics, Shandong University, Jinan, China
| | - Bingqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|
2
|
Chen N, Yu J, Liu Z, Meng L, Li X, Wong KC. Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation. Nucleic Acids Res 2024; 52:4137-4150. [PMID: 38572749 PMCID: PMC11077088 DOI: 10.1093/nar/gkae210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 03/06/2024] [Accepted: 03/12/2024] [Indexed: 04/05/2024] Open
Abstract
DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA-DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen the understanding of DNA-DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs in multiple shape features. Specifically, an existing Gibbs sampling method is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed and developed with promising performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights into low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.
Collapse
Affiliation(s)
- Nanjun Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jixiang Yu
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zhe Liu
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Lingkuan Meng
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun City, Jilin Province, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Hong Kong Institute of Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
| |
Collapse
|
3
|
Liu Z, Wong HM, Chen X, Lin J, Zhang S, Yan S, Wang F, Li X, Wong KC. MotifHub: Detection of trans-acting DNA motif group with probabilistic modeling algorithm. Comput Biol Med 2024; 168:107753. [PMID: 38039889 DOI: 10.1016/j.compbiomed.2023.107753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 10/30/2023] [Accepted: 11/20/2023] [Indexed: 12/03/2023]
Abstract
BACKGROUND Trans-acting factors are of special importance in transcription regulation, which is a group of proteins that can directly or indirectly recognize or bind to the 8-12 bp core sequence of cis-acting elements and regulate the transcription efficiency of target genes. The progressive development in high-throughput chromatin capture technology (e.g., Hi-C) enables the identification of chromatin-interacting sequence groups where trans-acting DNA motif groups can be discovered. The problem difficulty lies in the combinatorial nature of DNA sequence pattern matching and its underlying sequence pattern search space. METHOD Here, we propose to develop MotifHub for trans-acting DNA motif group discovery on grouped sequences. Specifically, the main approach is to develop probabilistic modeling for accommodating the stochastic nature of DNA motif patterns. RESULTS Based on the modeling, we develop global sampling techniques based on EM and Gibbs sampling to address the global optimization challenge for model fitting with latent variables. The results reflect that our proposed approaches demonstrate promising performance with linear time complexities. CONCLUSION MotifHub is a novel algorithm considering the identification of both DNA co-binding motif groups and trans-acting TFs. Our study paves the way for identifying hub TFs of stem cell development (OCT4 and SOX2) and determining potential therapeutic targets of prostate cancer (FOXA1 and MYC). To ensure scientific reproducibility and long-term impact, its matrix-algebra-optimized source code is released at http://bioinfo.cs.cityu.edu.hk/MotifHub.
Collapse
Affiliation(s)
- Zhe Liu
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Hiu-Man Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Shankai Yan
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong, China.
| |
Collapse
|
4
|
Wright DJ, Hall NAL, Irish N, Man AL, Glynn W, Mould A, Angeles ADL, Angiolini E, Swarbreck D, Gharbi K, Tunbridge EM, Haerty W. Long read sequencing reveals novel isoforms and insights into splicing regulation during cell state changes. BMC Genomics 2022; 23:42. [PMID: 35012468 PMCID: PMC8744310 DOI: 10.1186/s12864-021-08261-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 12/15/2021] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Alternative splicing is a key mechanism underlying cellular differentiation and a driver of complexity in mammalian neuronal tissues. However, understanding of which isoforms are differentially used or expressed and how this affects cellular differentiation remains unclear. Long read sequencing allows full-length transcript recovery and quantification, enabling transcript-level analysis of alternative splicing processes and how these change with cell state. Here, we utilise Oxford Nanopore Technologies sequencing to produce a custom annotation of a well-studied human neuroblastoma cell line SH-SY5Y, and to characterise isoform expression and usage across differentiation. RESULTS We identify many previously unannotated features, including a novel transcript of the voltage-gated calcium channel subunit gene, CACNA2D2. We show differential expression and usage of transcripts during differentiation identifying candidates for future research into state change regulation. CONCLUSIONS Our work highlights the potential of long read sequencing to uncover previously unknown transcript diversity and mechanisms influencing alternative splicing.
Collapse
Affiliation(s)
- David J Wright
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Nicola A L Hall
- Department of Psychiatry, Medical Sciences Division, University of Oxford, Oxfordshire, OX3 3JX, UK
- Oxford Health, NHS Foundation Trust, Oxford, Oxfordshire, OX3 7JX, UK
| | - Naomi Irish
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Angela L Man
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Will Glynn
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Arne Mould
- Department of Psychiatry, Medical Sciences Division, University of Oxford, Oxfordshire, OX3 3JX, UK
- Oxford Health, NHS Foundation Trust, Oxford, Oxfordshire, OX3 7JX, UK
| | - Alejandro De Los Angeles
- Department of Psychiatry, Medical Sciences Division, University of Oxford, Oxfordshire, OX3 3JX, UK
- Oxford Health, NHS Foundation Trust, Oxford, Oxfordshire, OX3 7JX, UK
| | - Emily Angiolini
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - David Swarbreck
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Karim Gharbi
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK
| | - Elizabeth M Tunbridge
- Department of Psychiatry, Medical Sciences Division, University of Oxford, Oxfordshire, OX3 3JX, UK
- Oxford Health, NHS Foundation Trust, Oxford, Oxfordshire, OX3 7JX, UK
| | - Wilfried Haerty
- Earlham Institute, Norwich Research Park, Norfolk, NR4 7UZ, UK.
| |
Collapse
|
5
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
6
|
Xiao P, Cai X, Rajasekaran S. EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:27-37. [PMID: 32931433 DOI: 10.1109/tcbb.2020.3024222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering patterns in biological sequences is a crucial step to extract useful information from them. Motifs can be viewed as patterns that occur exactly or with minor changes across some or all of the biological sequences. Motif search has numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity among families of proteins, etc. The general problem of motif search is intractable. One of the most studied models of motif search proposed in literature is Edit-distance based Motif Search (EMS). In EMS, the goal is to find all the patterns of length l that occur with an edit-distance of at most d in each of the input sequences. EMS algorithms existing in the literature do not scale well on challenging instances and large datasets. In this paper, the current state-of-the-art EMS solver is advanced by exploiting the idea of dimension reduction. A novel idea to reduce the cardinality of the alphabet is proposed. The algorithm we propose, EMS3, is an exact algorithm. I.e., it finds all the motifs present in the input sequences. EMS3 can be also viewed as a divide and conquer algorithm. In this paper, we provide theoretical analyses to establish the efficiency of EMS3. Extensive experiments on standard benchmark datasets (synthetic and real-world) show that the proposed algorithm outperforms the existing state-of-the-art algorithm (EMS2).
Collapse
|
7
|
Lanchantin J, Qi Y. Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data. Bioinformatics 2020; 36:i659-i667. [PMID: 33381816 DOI: 10.1093/bioinformatics/btaa793] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging. RESULTS In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows. AVAILABILITY AND IMPLEMENTATION https://github.com/QData/ChromeGCN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| |
Collapse
|
8
|
Wong KC, Lin J, Li X, Lin Q, Liang C, Song YQ. Heterodimeric DNA motif synthesis and validations. Nucleic Acids Res 2019; 47:1628-1636. [PMID: 30590725 PMCID: PMC6393289 DOI: 10.1093/nar/gky1297] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 12/04/2018] [Accepted: 12/19/2018] [Indexed: 02/06/2023] Open
Abstract
Bound by transcription factors, DNA motifs (i.e. transcription factor binding sites) are prevalent and important for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made on elucidating monomeric DNA motif patterns, our knowledge on heterodimeric DNA motifs are still far from complete. Therefore, we propose to develop a computational approach to synthesize a heterodimeric DNA motif from two monomeric DNA motifs. The approach is sequentially divided into two components (Phases A and B). In Phase A, we propose to develop the inference models on how two DNA monomeric motifs can be oriented and overlapped with each other at nucleotide level. In Phase B, given the two monomeric DNA motifs oriented, we further propose to develop DNA-binding family-specific input-output hidden Markov models (IOHMMs) to synthesize a heterodimeric DNA motif. To validate the approach, we execute and cross-validate it with the experimentally verified 618 heterodimeric DNA motifs across 49 DNA-binding family combinations. We observe that our approach can even "rescue" the existing heterodimeric DNA motif pattern (i.e. HOXB2_EOMES) previously published on Nature. Lastly, we apply the proposed approach to infer previously uncharacterized heterodimeric motifs. Their motif instances are supported by DNase accessibility, gene ontology, protein-protein interactions, in vivo ChIP-seq peaks, and even structural data from PDB. A public web-server is built for open accessibility and scientific impact. Its address is listed as follows: http://motif.cs.cityu.edu.hk/custom/MotifKirin.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - You-Qiang Song
- School of Biomedical Sciences, University of Hong Kong, Pokfulam, Hong Kong SAR
| |
Collapse
|
9
|
Sun CX, Yang Y, Wang H, Wang WH. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. ENTROPY (BASEL, SWITZERLAND) 2019; 21:E802. [PMID: 33267515 PMCID: PMC7515331 DOI: 10.3390/e21080802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 08/04/2019] [Accepted: 08/15/2019] [Indexed: 12/25/2022]
Abstract
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.
Collapse
Affiliation(s)
- Chun-xiao Sun
- College of Science, Northwest A&F University, Yangling 712100, China
| | - Yu Yang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hua Wang
- College of Software, Nankai University, Tianjin 300071, China
- Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA
| | - Wen-hu Wang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
| |
Collapse
|
10
|
Blackburn J, Wong T, Madala BS, Barker C, Hardwick SA, Reis ALM, Deveson IW, Mercer TR. Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. Nat Protoc 2019; 14:2119-2151. [PMID: 31217595 DOI: 10.1038/s41596-019-0175-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 04/03/2019] [Indexed: 02/08/2023]
Abstract
Next-generation sequencing (NGS) has been widely adopted to identify genetic variants and investigate their association with disease. However, the analysis of sequencing data remains challenging because of the complexity of human genetic variation and confounding errors introduced during library preparation, sequencing and analysis. We have developed a set of synthetic DNA spike-ins-termed 'sequins' (sequencing spike-ins)-that are directly added to DNA samples before library preparation. Sequins can be used to measure technical biases and to act as internal quantitative and qualitative controls throughout the sequencing workflow. This step-by-step protocol explains the use of sequins for both whole-genome and targeted sequencing of the human genome. This includes instructions regarding the dilution and addition of sequins to human DNA samples, followed by the bioinformatic steps required to separate sequin- and sample-derived sequencing reads and to evaluate the diagnostic performance of the assay. These practical guidelines are accompanied by a broader discussion of the conceptual and statistical principles that underpin the design of sequin standards. This protocol is suitable for users with standard laboratory and bioinformatic experience. The laboratory steps require ~1-4 d and the bioinformatic steps (which can be performed with the provided example data files) take an additional day.
Collapse
Affiliation(s)
- James Blackburn
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, Australia
| | - Ted Wong
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia
| | - Bindu Swapna Madala
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia
| | - Chris Barker
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia
| | - Simon A Hardwick
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, Australia
| | - Andre L M Reis
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, Australia
| | - Ira W Deveson
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia. .,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, Australia.
| | - Tim R Mercer
- Genomics and Epigenetics Division, Garvan Institute of Medical Research, Sydney, Australia. .,St Vincent's Clinical School, Faculty of Medicine, UNSW Australia, Sydney, Australia. .,Altius Institute for Biomedical Sciences, Seattle, WA, USA.
| |
Collapse
|
11
|
Chahal G, Tyagi S, Ramialison M. Navigating the non-coding genome in heart development and Congenital Heart Disease. Differentiation 2019; 107:11-23. [PMID: 31102825 DOI: 10.1016/j.diff.2019.05.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 01/14/2019] [Accepted: 05/06/2019] [Indexed: 12/12/2022]
Abstract
Congenital Heart Disease (CHD) is characterised by a wide range of cardiac defects, from mild to life-threatening, which occur in babies worldwide. To date, there is no cure to CHD, however, progress in surgery has reduced its mortality allowing children affected by CHD to reach adulthood. In an effort to understand its genetic basis, several studies involving whole-genome sequencing (WGS) of patients with CHD have been undertaken and generated a great wealth of information. The majority of putative causative mutations identified in WGS studies fall into the non-coding part of the genome. Unfortunately, due to the lack of understanding of the function of these non-coding mutations, it is challenging to establish a causal link between the non-coding mutation and the disease. Thus, here we review the state-of-the-art approaches to interpret non-coding mutations in the context of CHD and address the following questions: What are the non-coding sequences important for cardiac function? Which technologies are used to identify them? Which resources are available to analyse them? What mutations are expected in these non-coding sequences? Learning from developmental process, what is their expected role in CHD?
Collapse
Affiliation(s)
- Gulrez Chahal
- Australian Regenerative Medicine Institute (ARMI), 15 Innovation Walk, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Systems Biology Institute (SBI), Wellington Road, Clayton, 3800, VIC, Australia
| | - Sonika Tyagi
- School of Biological Sciences, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Australian Genome Research Facility, 305 Grattan Street, Melbourne, VIC, 3000, Australia.
| | - Mirana Ramialison
- Australian Regenerative Medicine Institute (ARMI), 15 Innovation Walk, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Systems Biology Institute (SBI), Wellington Road, Clayton, 3800, VIC, Australia.
| |
Collapse
|
12
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
13
|
Chiral DNA sequences as commutable controls for clinical genomics. Nat Commun 2019; 10:1342. [PMID: 30902988 PMCID: PMC6430799 DOI: 10.1038/s41467-019-09272-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Accepted: 02/15/2019] [Indexed: 12/14/2022] Open
Abstract
Chirality is a property describing any object that is inequivalent to its mirror image. Due to its 5′–3′ directionality, a DNA sequence is distinct from a mirrored sequence arranged in reverse nucleotide-order, and is therefore chiral. A given sequence and its opposing chiral partner sequence share many properties, such as nucleotide composition and sequence entropy. Here we demonstrate that chiral DNA sequence pairs also perform equivalently during molecular and bioinformatic techniques that underpin genetic analysis, including PCR amplification, hybridization, whole-genome, target-enriched and nanopore sequencing, sequence alignment and variant detection. Given these shared properties, synthetic DNA sequences mirroring clinically relevant or analytically challenging regions of the human genome are ideal controls for clinical genomics. The addition of synthetic chiral sequences (sequins) to patient tumor samples can prevent false-positive and false-negative mutation detection to improve diagnosis. Accordingly, we propose that sequins can fulfill the need for commutable internal controls in precision medicine. Any DNA sequence can be represented by a chiral partner sequence – an exact copy arranged in reverse nucleotide order. Here, the authors show that chiral DNA sequence pairs share important properties and show the utility of synthetic chiral sequences (sequins) as controls for clinical genomics.
Collapse
|
14
|
Gohardani SA, Bagherian M, Vaziri H. A multi-objective imperialist competitive algorithm (MOICA) for finding motifs in DNA sequences. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2019; 16:1575-1596. [PMID: 30947433 DOI: 10.3934/mbe.2019075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Motif discovery problem (MDP) is one of the well-known problems in biology which tries to find the transcription factor binding site (TFBS) in DNA sequences. In one aspect, there is not enough biological knowledge on motif sites and on the other side, the problem is NP-hard. Thus, there is not an efficient procedure capable of finding motifs in every dataset. Some algorithms use exhaustive search, which is very time-consuming for large-scale datasets. On the other side, metaheuristic procedures seem to be a good selection for finding a motif quickly that at least has some acceptable biological properties. Most of the previous methods model the problem as a single objective optimization problem; however, considering multi-objectives for modeling the problem leads to improvements in the quality of obtained motifs. Some multi-objective optimization models for MDP have tried to maximize three objectives simultaneously: Motif length, support, and similarity. In this study, the multi-objective Imperialist Competition Algorithm (ICA) is adopted for this problem as an approximation algorithm. ICA is able to simulate more exploration along the solution space, so avoids trapping into local optima. So, it promises to obtain good solutions in a reasonable time. Experimental results show that our method produces good solutions compared to well-known algorithms in the literature, according to computational and biological indicators.
Collapse
Affiliation(s)
| | - Mehri Bagherian
- Department of Applied Mathematics, Faculty of Mathematical Science, University of Guilan, Rasht, Iran
| | - Hamidreza Vaziri
- Department of Biology, Faculty of Science, University of Guilan, Rasht, Iran
| |
Collapse
|
15
|
Yu SP, Liang C, Xiao Q, Li GH, Ding PJ, Luo JW. GLNMDA: a novel method for miRNA-disease association prediction based on global linear neighborhoods. RNA Biol 2018; 15:1215-1227. [PMID: 30244645 PMCID: PMC6284594 DOI: 10.1080/15476286.2018.1521210] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 08/22/2018] [Accepted: 08/24/2018] [Indexed: 01/11/2023] Open
Abstract
Recently, increasing studies have shown that miRNAs are involved in the development and progression of various complex diseases. Consequently, predicting potential miRNA-disease associations makes an important contribution to understanding the pathogenesis of diseases, developing new drugs as well as designing individualized diagnostic and therapeutic approaches for different human diseases. Nonetheless, the inherent noise and incompleteness in the existing biological datasets have limited the prediction accuracy of current computational models. To solve this issue, in this paper, we propose a novel method for miRNA-disease association prediction based on global linear neighborhoods (GLNMDA). Specifically, our method obtains a new miRNA/disease similarity matrix by linearly reconstructing each miRNA/disease according to the known experimentally verified miRNA-disease associations. We then adopt label propagation to infer the potential associations between miRNAs and diseases. As a result, GLNMDA achieved reliable performance in the frameworks of both local and global LOOCV (AUCs of 0.867 and 0.929, respectively) and 5-fold cross validation (average AUC of 0.926). Case studies on five common human diseases further confirmed the utility of our method in discovering latent miRNA-disease pairs. Taken together, GLNMDA could serve as a reliable computational tool for miRNA-disease association prediction.
Collapse
Affiliation(s)
- Sheng-Peng Yu
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Qiu Xiao
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
| | - Guang-Hui Li
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Ping-Jian Ding
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Jia-Wei Luo
- College of Information Science and Engineering, Hunan University, Changsha, China
| |
Collapse
|
16
|
Su H, Liu M, Sun S, Peng Z, Yang J. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics 2018; 35:930-936. [DOI: 10.1093/bioinformatics/bty756] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Revised: 08/02/2018] [Accepted: 08/28/2018] [Indexed: 12/31/2022] Open
Affiliation(s)
- Hong Su
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Mengchen Liu
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Saisai Sun
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| |
Collapse
|
17
|
Hardwick SA, Chen WY, Wong T, Kanakamedala BS, Deveson IW, Ongley SE, Santini NS, Marcellin E, Smith MA, Nielsen LK, Lovelock CE, Neilan BA, Mercer TR. Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis. Nat Commun 2018; 9:3096. [PMID: 30082706 PMCID: PMC6078961 DOI: 10.1038/s41467-018-05555-0] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2018] [Accepted: 06/20/2018] [Indexed: 12/12/2022] Open
Abstract
The complexity of microbial communities, combined with technical biases in next-generation sequencing, pose a challenge to metagenomic analysis. Here, we develop a set of internal DNA standards, termed “sequins” (sequencing spike-ins), that together constitute a synthetic community of artificial microbial genomes. Sequins are added to environmental DNA samples prior to library preparation, and undergo concurrent sequencing with the accompanying sample. We validate the performance of sequins by comparison to mock microbial communities, and demonstrate their use in the analysis of real metagenome samples. We show how sequins can be used to measure fold change differences in the size and structure of accompanying microbial communities, and perform quantitative normalization between samples. We further illustrate how sequins can be used to benchmark and optimize new methods, including nanopore long-read sequencing technology. We provide metagenome sequins, along with associated data sets, protocols, and an accompanying software toolkit, as reference standards to aid in metagenomic studies. Complex microbial communities pose a challenge to metagenomic analysis. Here the authors develop ‘sequins’, internal DNA standards that represent a synthetic community of artificial genomes.
Collapse
Affiliation(s)
- Simon A Hardwick
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia.,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Sydney, 2052, NSW, Australia
| | - Wendy Y Chen
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia.,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Sydney, 2052, NSW, Australia
| | - Ted Wong
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia
| | | | - Ira W Deveson
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia.,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Sydney, 2052, NSW, Australia
| | - Sarah E Ongley
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, 2052, NSW, Australia.,School of Environmental and Life Sciences, The University of Newcastle, Callaghan, 2308, NSW, Australia
| | - Nadia S Santini
- Centre for Marine Bioinnovation UNSW Sydney, Sydney, 2052, NSW, Australia.,Instituto de Ecologia, Universidad Nacional Autonoma de Mexico, Mexico City, 04500, Mexico
| | - Esteban Marcellin
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, 4072, Queensland, Australia
| | - Martin A Smith
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia.,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Sydney, 2052, NSW, Australia
| | - Lars K Nielsen
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, 4072, Queensland, Australia
| | - Catherine E Lovelock
- School of Biological Sciences, The University of Queensland, Brisbane, 4072, QLD, Australia
| | - Brett A Neilan
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, 2052, NSW, Australia.,School of Environmental and Life Sciences, The University of Newcastle, Callaghan, 2308, NSW, Australia
| | - Tim R Mercer
- Garvan Institute of Medical Research, Sydney, 2010, NSW, Australia. .,St. Vincent's Clinical School, Faculty of Medicine, UNSW Sydney, Sydney, 2052, NSW, Australia. .,Altius Institute for Biomedical Sciences, Seattle, 98121, WA, USA.
| |
Collapse
|
18
|
Yu Q, Wei D, Huo H. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets. BMC Bioinformatics 2018; 19:228. [PMID: 29914360 PMCID: PMC6006848 DOI: 10.1186/s12859-018-2242-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Accepted: 06/12/2018] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Dingbang Wei
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| |
Collapse
|