1
|
Yang M, Zhang S, Zheng Z, Zhang P, Liang Y, Tang S. Employing bimodal representations to predict DNA bendability within a self-supervised pre-trained framework. Nucleic Acids Res 2024; 52:e33. [PMID: 38375921 PMCID: PMC11014357 DOI: 10.1093/nar/gkae099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/10/2024] [Accepted: 02/01/2024] [Indexed: 02/21/2024] Open
Abstract
The bendability of genomic DNA, which measures the DNA looping rate, is crucial for numerous biological processes of DNA. Recently, an advanced high-throughput technique known as 'loop-seq' has made it possible to measure the inherent cyclizability of DNA fragments. However, quantifying the bendability of large-scale DNA is costly, laborious, and time-consuming. To close the gap between rapidly evolving large language models and expanding genomic sequence information, and to elucidate the DNA bendability's impact on critical regulatory sequence motifs such as super-enhancers in the human genome, we introduce an innovative computational model, named MIXBend, to forecast the DNA bendability utilizing both nucleotide sequences and physicochemical properties. In MIXBend, a pre-trained language model DNABERT and convolutional neural network with attention mechanism are utilized to construct both sequence- and physicochemical-based extractors for the sophisticated refinement of DNA sequence representations. These bimodal DNA representations are then fed to a k-mer sequence-physicochemistry matching module to minimize the semantic gap between each modality. Lastly, a self-attention fusion layer is employed for the prediction of DNA bendability. In conclusion, the experimental results validate MIXBend's superior performance relative to other state-of-the-art methods. Additionally, MIXBend reveals both novel and known motifs from the yeast. Moreover, MIXBend discovers significant bendability fluctuations within super-enhancer regions and transcription factors binding sites in the human genome.
Collapse
Affiliation(s)
- Minghao Yang
- Bioscience and Biomedical Engineering Thrust, System Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511466, China
| | - Shichen Zhang
- Bioscience and Biomedical Engineering Thrust, System Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511466, China
| | - Zhihang Zheng
- Bioscience and Biomedical Engineering Thrust, System Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511466, China
| | - Pengfei Zhang
- Bioscience and Biomedical Engineering Thrust, System Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511466, China
| | - Yan Liang
- School of Artificial Intelligence, South China Normal University, Foshan 528225, China
| | - Shaojun Tang
- Bioscience and Biomedical Engineering Thrust, System Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511466, China
- Division of Life Science, Hong Kong University of Science and Technology, Hong Kong SAR 999077, China
| |
Collapse
|
2
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
3
|
Rajput V, Mulay P. Fact Finding Instructor-based Clustering Technique for BP Estimation using Human Speech Signals. Comput Methods Biomech Biomed Engin 2023:1-16. [PMID: 37929760 DOI: 10.1080/10255842.2023.2273203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 10/14/2023] [Indexed: 11/07/2023]
Abstract
Blood Pressure (BP) is considered an essential factor that provides information regarding cardiovascular function. Regular monitoring of the BP is required for proper healthcare maintenance that avoids the high risk of life due to high and low BP. Several methods were devised for the estimation of BP, but the estimation accuracy is still a challenging task. Hence this research introduces an efficient BP estimation technique using the Fact Finding Instructor (FFI) based clustering method by considering the speech signal of the patients. An efficient BP extraction technique is introduced using the FFI Optimization algorithm an integration of the mannerism of the fact finder that identifies the suspect who commits the criminal offense and, with the instructor with good knowledge, these make the trainee more efficient. The detection and suspect's arrest contain two phases, the fact-finding phase and the chasing phase. Initially, the speech signal is collected from the database and pre-processed for removing noise and artifacts. Then feature extraction is used for the minimization of the computation overhead that generates a feature vector. The clustering of BP is employed with the k-means clustering algorithm and the proposed FFI optimization algorithm. The FFI Optimization algorithm provides a fast convergence rate due to the fact-finding phase and provides accurate detection of the suspect's location along with that the clustering of classes of patients' BP by considering the feature of the speech signal. The clusters formed using the FFI optimization algorithm are combined with the K-means clustering, by multiplying the clusters the BP estimation is implemented on three criteria Low BP, Normal, and, High BP. Finally, the output generated by both the clustering operations is multiplied together for the estimation of the BP. The performance of the proposed method is evaluated using the metrics like Davies Bouldin score, Homogeneity score, Completeness score, Jacquard Similarity score, Silhouette score, and Dunn's Index which acquired the improvement rate of 0.98, 0.96, 0.96, 0.98, 0.95, and 0.98 for training percentage 90, respectively to the existing Teaching Learning Based Optimization(TLBO) clustering technique.
Collapse
Affiliation(s)
- Vaishali Rajput
- Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Vishwakarma Institute of Technology, Pune, India
| | - Preeti Mulay
- Vishwakarma Institute of Technology, Pune, India
| |
Collapse
|
4
|
Greenberg G, Ravi AN, Shomorony I. LexicHash: sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics 2023; 39:btad652. [PMID: 37878809 PMCID: PMC10628434 DOI: 10.1093/bioinformatics/btad652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 10/11/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. RESULTS In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how "lexicographically similar" the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision-recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. AVAILABILITY AND IMPLEMENTATION LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
Collapse
Affiliation(s)
- Grant Greenberg
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Aditya Narayan Ravi
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| |
Collapse
|
5
|
Nakshathram S, Duraisamy R. Protein remote homology recognition using local and global structural sequence alignment. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-213522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Protein Remote Homology and fold Recognition (PRHR) is the most crucial task to predict the protein patterns. To achieve this task, Sequence-Order Frequency Matrix-Sampling and Deep learning with Smith-Waterman (SOFM-SDSW) were designed using large-scale Protein Sequences (PSs), which take more time to determine the high-dimensional attributes. Also, it was ineffective since the SW was only applied for local alignment, which cannot find the most matches between the PSs. Hence, in this manuscript, a rapid semi-global alignment algorithm called SOFM-SD-GlobalSW (SOFM-SDGSW) is proposed that facilitates the affine-gap scoring and uses sequence similarity to align the PSs. The major aim of this paper is to enhance the alignment of SW algorithm in both locally and globally for PRHR. In this algorithm, the Maximal Exact Matches (MEMs) are initially obtained by the bit-level parallelism rather than to align the individual characters. After that, a subgroup of MEMs is obtained to determine the global Alignment Score (AS) using the new adaptive programming scheme. Also, the SW local alignment scheme is used to determine the local AS. Then, both local and global ASs are combined to produce a final AS. Further, this resultant AS is considered to train the Support Vector Machine (SVM) classifier to recognize the PRH and folds. Finally, the test results reveal the SOFM-SDGSW algorithm on SCOP 1.53, SCOP 1.67 and Superfamily databases attains an ROC of 0.97, 0.941 and 0.938, respectively, as well as, an ROC50 of 0.819, 0.846 and 0.86, respectively compared to the conventional PRHR algorithms.
Collapse
|
6
|
Qi P, Wang F, Huang Y, Yang X. Integrating functional data analysis with case-based reasoning for hypertension prognosis and diagnosis based on real-world electronic health records. BMC Med Inform Decis Mak 2022; 22:149. [PMID: 35659217 PMCID: PMC9169301 DOI: 10.1186/s12911-022-01894-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 05/31/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Hypertension is the fifth chronic disease causing death worldwide. The early prognosis and diagnosis are critical in the hypertension care process. Inspired by human philosophy, CBR is an empirical knowledge reasoning method for early detection and intervention of hypertension by only reusing electronic health records. However, the traditional similarity calculation method often ignores the internal characteristics and potential information of medical examination data. METHODS In this paper, we first calculate the weights of input attributes by a random forest algorithm. Then, the risk value of hypertension from each medical examination can be evaluated according to the input data and the attribute weights. By fitting the risk values into a risk curve of hypertension, we calculate the similarity between different community residents, and obtain the most similar case according to the similarity. Finally, the diagnosis and treatment protocol of the new case can be given. RESULTS The experiment data comes from the medical examination of Tianqiao Community (Tongling City, Anhui Province, China) from 2012 to 2021. It contains 4143 community residents and 43,676 medical examination records. We first discuss the effect of the influence factor and the decay factor on similarity calculation. Then we evaluate the performance of the proposed FDA-CBR algorithm against the GRA-CBR algorithm and the CS-CBR algorithm. The experimental results demonstrate that the proposed algorithm is highly efficient and accurate. CONCLUSIONS The experiment results show that the proposed FDA-CBR algorithm can effectively describe the variation tendency of the risk value and always find the most similar case. The accuracy of FDA-CBR algorithm is higher than GRA-CBR algorithm and CS-CBR algorithm, increasing by 9.94 and 16.41%, respectively.
Collapse
Affiliation(s)
- Ping Qi
- Department of Mathematics and Computer Science, Tongling University, Tongling, 244061, China.
| | - Fucheng Wang
- Department of Mathematics and Computer Science, Tongling University, Tongling, 244061, China
| | - Yong Huang
- School of Public Health, Anhui Medical University, Hefei, 230032, China
| | - Xiaoling Yang
- Tianqiao Community Health Service Station, Tongling Municipal Hospital, Tongling, 244061, China
| |
Collapse
|
7
|
Abu‐Hashem M, Gutub A. Efficient computation of Hash Hirschberg protein alignment utilizing hyper threading multi‐core sharing technology. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2021. [DOI: 10.1049/cit2.12070] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Affiliation(s)
- Muhannad Abu‐Hashem
- Department of Geomatics Faculty of Architecture and Planning King Abdulaziz University Jeddah Saudi Arabia
| | - Adnan Gutub
- Department of Computer Engineering College of Computer & Information Systems Umm Al‐Qura University Makkah Saudi Arabia
| |
Collapse
|