1
|
Ray D, Parrinello M. Data-driven classification of ligand unbinding pathways. Proc Natl Acad Sci U S A 2024; 121:e2313542121. [PMID: 38412121 PMCID: PMC10927508 DOI: 10.1073/pnas.2313542121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 01/26/2024] [Indexed: 02/29/2024] Open
Abstract
Studying the pathways of ligand-receptor binding is essential to understand the mechanism of target recognition by small molecules. The binding free energy and kinetics of protein-ligand complexes can be computed using molecular dynamics (MD) simulations, often in quantitative agreement with experiments. However, only a qualitative picture of the ligand binding/unbinding paths can be obtained through a conventional analysis of the MD trajectories. Besides, the higher degree of manual effort involved in analyzing pathways limits its applicability in large-scale drug discovery. Here, we address this limitation by introducing an automated approach for analyzing molecular transition paths with a particular focus on protein-ligand dissociation. Our method is based on the dynamic time-warping algorithm, originally designed for speech recognition. We accurately classified molecular trajectories using a very generic descriptor set of contacts or distances. Our approach outperforms manual classification by distinguishing between parallel dissociation channels, within the pathways identified by visual inspection. Most notably, we could compute exit-path-specific ligand-dissociation kinetics. The unbinding timescale along the fastest path agrees with the experimental residence time, providing a physical interpretation to our entirely data-driven protocol. In combination with appropriate enhanced sampling algorithms, this technique can be used for the initial exploration of ligand-dissociation pathways as well as for calculating path-specific thermodynamic and kinetic properties.
Collapse
Affiliation(s)
- Dhiman Ray
- Simulations Research Line, Italian Institute of Technology, Via Enrico Melen 83, GenovaGE16152, Italy
| | - Michele Parrinello
- Simulations Research Line, Italian Institute of Technology, Via Enrico Melen 83, GenovaGE16152, Italy
| |
Collapse
|
2
|
Barriga Rubio RH, Otero M. Stochastic modeling of Dalbulus maidis, vector of maize diseases. Theor Popul Biol 2023; 154:51-66. [PMID: 37669715 DOI: 10.1016/j.tpb.2023.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 08/01/2023] [Accepted: 08/18/2023] [Indexed: 09/07/2023]
Abstract
We developed a simple linear stochastic model for Dalbulus maidis dependent exclusively on temperature, whose parameters were determined from published field and laboratory studies performed at different temperatures. This model takes into account the principal stages and events of the life cycle of this pest, which is vector of maize diseases. We implemented the effect of distributed delays or Linear Chain Trick (LCT) considering a fixed number of sub-stages for egg and nymph stages of Dalbulus maidis in order to accurately represent what is observed in nature. A sensitivity analysis allows us to observe that the speed of the dynamics is sensitive to changes in the development rates, but not to the longevity of each stage or the fecundity, which almost exclusively affect insect abundance. We used our model to study its predictive and explanatory capacity considering a published experiment as a case study. Although the simulation results show a behavior qualitatively equivalent to that observed in the experimental results it is not possible to explain accurately the magnitude, nor the times in which the maximum abundances of second-generation nymphs and adults are reached. Therefore, we evaluated three possible scenarios for the insect that allow us to glimpse some of the advantages of having a computational model in order to find out what processes, taken into account in the model, may explain the differences observed between published experimental results and model results. The three proposed scenarios, based on variations in the parameterized rates of the model, can satisfactorily explain the experimental observations. We observed that in order to better simulate the experimental results it is not necessary to modify fecundity or mortality rates. However, it is necessary to accelerate the average development rates of our model by 20 to 40 %, compatible with extreme values of the rates close to the upper edges of the confidence bands of our parameterization rate curves, according to insects with faster development rates already reported in literature.
Collapse
Affiliation(s)
- R H Barriga Rubio
- Departamento de Física, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - M Otero
- Departamento de Física, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina; Instituto de Física de Buenos Aires (IFIBA), FCEN-UBA and CONICET, Buenos Aires, Argentina.
| |
Collapse
|
3
|
Chan NB, Li W, Aung T, Bazuaye E, Montero RM. Machine Learning-Based Time in Patterns for Blood Glucose Fluctuation Pattern Recognition in Type 1 Diabetes Management: Development and Validation Study. JMIR AI 2023; 2:e45450. [PMID: 38875568 PMCID: PMC11041419 DOI: 10.2196/45450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Revised: 02/15/2023] [Accepted: 02/24/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND Continuous glucose monitoring (CGM) for diabetes combines noninvasive glucose biosensors, continuous monitoring, cloud computing, and analytics to connect and simulate a hospital setting in a person's home. CGM systems inspired analytics methods to measure glycemic variability (GV), but existing GV analytics methods disregard glucose trends and patterns; hence, they fail to capture entire temporal patterns and do not provide granular insights about glucose fluctuations. OBJECTIVE This study aimed to propose a machine learning-based framework for blood glucose fluctuation pattern recognition, which enables a more comprehensive representation of GV profiles that could present detailed fluctuation information, be easily understood by clinicians, and provide insights about patient groups based on time in blood fluctuation patterns. METHODS Overall, 1.5 million measurements from 126 patients in the United Kingdom with type 1 diabetes mellitus (T1DM) were collected, and prevalent blood fluctuation patterns were extracted using dynamic time warping. The patterns were further validated in 225 patients in the United States with T1DM. Hierarchical clustering was then applied on time in patterns to form 4 clusters of patients. Patient groups were compared using statistical analysis. RESULTS In total, 6 patterns depicting distinctive glucose levels and trends were identified and validated, based on which 4 GV profiles of patients with T1DM were found. They were significantly different in terms of glycemic statuses such as diabetes duration (P=.04), glycated hemoglobin level (P<.001), and time in range (P<.001) and thus had different management needs. CONCLUSIONS The proposed method can analytically extract existing blood fluctuation patterns from CGM data. Thus, time in patterns can capture a rich view of patients' GV profile. Its conceptual resemblance with time in range, along with rich blood fluctuation details, makes it more scalable, accessible, and informative to clinicians.
Collapse
Affiliation(s)
- Nicholas Berin Chan
- Informatics Research Centre, Henley Business School, University of Reading, Reading, United Kingdom
| | - Weizi Li
- Informatics Research Centre, Henley Business School, University of Reading, Reading, United Kingdom
| | - Theingi Aung
- Royal Berkshire NHS Foundation Trust, Reading, United Kingdom
| | - Eghosa Bazuaye
- Royal Berkshire NHS Foundation Trust, Reading, United Kingdom
| | | |
Collapse
|
4
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
5
|
Power spectrum and dynamic time warping for DNA sequences classification. EVOLVING SYSTEMS 2020. [DOI: 10.1007/s12530-019-09306-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
6
|
Ranjard L, Wong TKF, Rodrigo AG. Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage. BMC Bioinformatics 2019; 20:654. [PMID: 31829137 PMCID: PMC6907241 DOI: 10.1186/s12859-019-3287-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/20/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND In short-read DNA sequencing experiments, the read coverage is a key parameter to successfully assemble the reads and reconstruct the sequence of the input DNA. When coverage is very low, the original sequence reconstruction from the reads can be difficult because of the occurrence of uncovered gaps. Reference guided assembly can then improve these assemblies. However, when the available reference is phylogenetically distant from the sequencing reads, the mapping rate of the reads can be extremely low. Some recent improvements in read mapping approaches aim at modifying the reference according to the reads dynamically. Such approaches can significantly improve the alignment rate of the reads onto distant references but the processing of insertions and deletions remains challenging. RESULTS Here, we introduce a new algorithm to update the reference sequence according to previously aligned reads. Substitutions, insertions and deletions are performed in the reference sequence dynamically. We evaluate this approach to assemble a western-grey kangaroo mitochondrial amplicon. Our results show that more reads can be aligned and that this method produces assemblies of length comparable to the truth while limiting error rate when classic approaches fail to recover the correct length. Finally, we discuss how the core algorithm of this method could be improved and combined with other approaches to analyse larger genomic sequences. CONCLUSIONS We introduced an algorithm to perform dynamic alignment of reads on a distant reference. We showed that such approach can improve the reconstruction of an amplicon compared to classically used bioinformatic pipelines. Although not portable to genomic scale in the current form, we suggested several improvements to be investigated to make this method more flexible and allow dynamic alignment to be used for large genome assemblies.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
7
|
Alakus TB, Das B, Turkoglu I. DNA encoding with entropy based numerical mapping technique for phylogenetic analysis. 2019 INTERNATIONAL ARTIFICIAL INTELLIGENCE AND DATA PROCESSING SYMPOSIUM (IDAP) 2019. [DOI: 10.1109/idap.2019.8875937] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
8
|
Bi JH, Tong YF, Qiu ZW, Yang XF, Minna J, Gazdar AF, Song K. ClickGene: an open cloud-based platform for big pan-cancer data genome-wide association study, visualization and exploration. BioData Min 2019; 12:12. [PMID: 31391866 PMCID: PMC6595587 DOI: 10.1186/s13040-019-0202-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Accepted: 06/17/2019] [Indexed: 12/15/2022] Open
Abstract
Tremendous amount of whole-genome sequencing data have been provided by large consortium projects such as TCGA (The Cancer Genome Atlas), COSMIC and so on, which creates incredible opportunities for functional gene research and cancer associated mechanism uncovering. While the existing web servers are valuable and widely used, many whole genome analysis functions urgently needed by experimental biologists are still not adequately addressed. A cloud-based platform, named CG (ClickGene), therefore, was developed for DIY analyzing of user's private in-house data or public genome data without any requirement of software installation or system configuration. CG platform provides key interactive and customized functions including Bee-swarm plot, linear regression analyses, Mountain plot, Directional Manhattan plot, Deflection plot and Volcano plot. Using these tools, global profiling or individual gene distributions for expression and copy number variation (CNV) analyses can be generated by only mouse button clicking. The easy accessibility of such comprehensive pan-cancer genome analysis greatly facilitates data mining in wide research areas, such as therapeutic discovery process. Therefore, it fills in the gaps between big cancer genomics data and the delivery of integrated knowledge to end-users, thus helping unleash the value of the current data resources. More importantly, unlike other R-based web platforms, Dubbo, a cloud distributed service governance framework for 'big data' stream global transferring, was used to develop CG platform. After being developed, CG is run on an independent cloud-server, which ensures its steady global accessibility. More than 2 years running history of CG proved that advanced plots for hundreds of whole-genome data can be created through it within seconds by end-users anytime and anywhere. CG is available at http://www.clickgenome.org/.
Collapse
Affiliation(s)
- Jia-Hao Bi
- 1School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072 China
| | - Yi-Fan Tong
- 1School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072 China
| | - Zhe-Wei Qiu
- 1School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072 China
| | - Xing-Feng Yang
- 2School of Computer Software, Tianjin University, Tianjin, 300072 China
| | - John Minna
- 3Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA.,4Department of Pharmacology, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA.,5Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA
| | - Adi F Gazdar
- 3Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA.,6Department of Pathology, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA
| | - Kai Song
- 1School of Chemical Engineering and Technology, Tianjin University, Tianjin, 300072 China.,3Hamon Center for Therapeutic Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390 USA
| |
Collapse
|
9
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
10
|
Skutkova H, Maderankova D, Sedlar K, Jugas R, Vitek M. A degeneration-reducing criterion for optimal digital mapping of genetic codes. Comput Struct Biotechnol J 2019; 17:406-414. [PMID: 30984363 PMCID: PMC6444178 DOI: 10.1016/j.csbj.2019.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 02/07/2019] [Accepted: 03/15/2019] [Indexed: 01/08/2023] Open
Abstract
Bioinformatics may seem to be a scientific field processing primarily large string datasets, as nucleotides and amino acids are represented with dedicated characters. On the other hand, many computational tasks that bioinformatics challenges are mathematical problems understandable as operations with digits. In fact, many computational tasks are solved this way in the background. One of the most widely used digital representations is mapping of nucleotides and amino acids with integers 0–3 and 0–20, respectively. The limitation of this mapping occurs when the digital signal of nucleotides has to be translated into a digital signal of amino acids as the genetic code is degenerated. This causes non-monotonies in a mapping function. Although map for reducing this undesirable effect has already been proposed, it is defined theoretically and for standard genetic codes only. In this study, we derived a novel optimal criterion for reducing the influence of degeneration by utilizing a large dataset of real sequences with various genetic codes. As a result, we proposed a new robust global optimal map suitable for any genetic code as well as specialized optimal maps for particular genetic codes. Optimization of 1D numerical representation for DNA to protein translation. Reducing genetic code degeneracy in numerical representation of DNA sequences. More robust numerical conversion used for genomic-proteomic analysis.
Collapse
Affiliation(s)
- Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Denisa Maderankova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Robin Jugas
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| | - Martin Vitek
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech republic
| |
Collapse
|
11
|
Maderankova D, Jugas R, Sedlar K, Vitek M, Skutkova H. Rapid Bacterial Species Delineation Based on Parameters Derived From Genome Numerical Representations. Comput Struct Biotechnol J 2019; 17:118-126. [PMID: 30728919 PMCID: PMC6352304 DOI: 10.1016/j.csbj.2018.12.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2018] [Revised: 12/07/2018] [Accepted: 12/20/2018] [Indexed: 01/29/2023] Open
Abstract
Species delineation based on bacterial genomes is an essential part of the research of prokaryotes. In silico genome-to-genome comparison methods are computationally demanding, but much less tedious and error prone than the wet-lab methods. In this paper, we present a novel method for the delineation of bacterial genomes based on genomic signal processing. The proposed method uses numerical representations of whole bacterial genomes, phase signal and cumulated phase signal, from which four parameters are derived for each genome. The parameters characterize a genome and their calculation is independent of the other genomes comprising a delineation dataset. The delineation itself is processed as a calculation of the parameters' average similarity. The method was statistically verified on 1826 bacterial genomes. A similarity threshold of 96% was set based on the receiver operating characteristic curve that featured sensitivity of 99.78% and specificity of 97.25%. Additionally, comparative analysis on another 33 bacterial genomes was conducted using standard delineation tools as these tools were not able to process the dataset of 1826 genomes using desktop computer. The proposed method achieved comparable or better delineation results in comparison with the standard tools. Besides the excellent delineation results, another great advantage of the method is its small computational demands, which enables the delineation of thousands of genomes on a desktop computer. The calculation of the parameters takes tens of minutes for thousands of genomes. Moreover, they can be calculated in advance by creating a database, meaning the delineation itself is then completed in a matter of seconds.
Collapse
Affiliation(s)
- Denisa Maderankova
- Department of Biomedical Engineering, Faculty of Electrical Engineering and Communication, Brno University of Technology, Technicka 12, 61600 Brno, Czech Republic
| | | | | | | | | |
Collapse
|
12
|
Retrieval of Similar Evolution Patterns from Satellite Image Time Series. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8122435] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Technological evolution in the remote sensing domain has allowed the acquisition of large archives of satellite image time series (SITS) for Earth Observation. In this context, the need to interpret Earth Observation image time series is continuously increasing and the extraction of information from these archives has become difficult without adequate tools. In this paper, we propose a fast and effective two-step technique for the retrieval of spatio-temporal patterns that are similar to a given query. The method is based on a query-by-example procedure whose inputs are evolution patterns provided by the end-user and outputs are other similar spatio-temporal patterns. The comparison between the temporal sequences and the queries is performed using the Dynamic Time Warping alignment method, whereas the separation between similar and non-similar patterns is determined via Expectation-Maximization. The experiments, which are assessed on both short and long SITS, prove the effectiveness of the proposed SITS retrieval method for different application scenarios. For the short SITS, we considered two application scenarios, namely the construction of two accumulation lakes and flooding caused by heavy rain. For the long SITS, we used a database formed of 88 Landsat images, and we showed that the proposed method is able to retrieve similar patterns of land cover and land use.
Collapse
|
13
|
Rifaioglu AS, Doğan T, Saraç ÖS, Ersahin T, Saidi R, Atalay MV, Martin MJ, Cetin-Atalay R. Large-scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins 2017; 86:135-151. [PMID: 29098713 DOI: 10.1002/prot.25416] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2017] [Revised: 10/24/2017] [Accepted: 11/01/2017] [Indexed: 12/24/2022]
Abstract
Recent advances in computing power and machine learning empower functional annotation of protein sequences and their transcript variations. Here, we present an automated prediction system UniGOPred, for GO annotations and a database of GO term predictions for proteomes of several organisms in UniProt Knowledgebase (UniProtKB). UniGOPred provides function predictions for 514 molecular function (MF), 2909 biological process (BP), and 438 cellular component (CC) GO terms for each protein sequence. UniGOPred covers nearly the whole functionality spectrum in Gene Ontology system and it can predict both generic and specific GO terms. UniGOPred was run on CAFA2 challenge target protein sequences and it is categorized within the top 10 best performing methods for the molecular function category. In addition, the performance of UniGOPred is higher compared to the baseline BLAST classifier in all categories of GO. UniGOPred predictions are compared with UniProtKB/TrEMBL database annotations as well. Furthermore, the proposed tool's ability to predict negatively associated GO terms that defines the functions that a protein does not possess, is discussed. UniGOPred annotations were also validated by case studies on PTEN protein variants experimentally and on CHD8 protein variants with literature. UniGOPred protein functional annotation system is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.
Collapse
Affiliation(s)
- Ahmet Sureyya Rifaioglu
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey.,Department of Computer Engineering, İskenderun Technical University, Hatay, 31200, Turkey
| | - Tunca Doğan
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.,CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Ömer Sinan Saraç
- Department of Computer Engineering, Istanbul Technical University, İstanbul, 34467, Turkey
| | - Tulin Ersahin
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| | - Rabie Saidi
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Mehmet Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, 06800, Turkey
| | - Maria Jesus Martin
- Protein Function Development Team, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Rengul Cetin-Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, 06800, Turkey
| |
Collapse
|
14
|
Zhang G, Dai M, Yang L, Li W, Li H, Xu C, Shi X, Dong X, Fu F. Fast detection and data compensation for electrodes disconnection in long-term monitoring of dynamic brain electrical impedance tomography. Biomed Eng Online 2017; 16:7. [PMID: 28086909 PMCID: PMC5234124 DOI: 10.1186/s12938-016-0294-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2016] [Accepted: 12/04/2016] [Indexed: 11/18/2022] Open
Abstract
Background Electrode disconnection is a common occurrence during long-term monitoring of brain electrical impedance tomography (EIT) in clinical settings. The data acquisition system suffers remarkable data loss which results in image reconstruction failure. The aim of this study was to: (1) detect disconnected electrodes and (2) account for invalid data. Methods Weighted correlation coefficient for each electrode was calculated based on the measurement differences between well-connected and disconnected electrodes. Disconnected electrodes were identified by filtering out abnormal coefficients with discrete wavelet transforms. Further, previously valid measurements were utilized to establish grey model. The invalid frames after electrode disconnection were substituted with the data estimated by grey model. The proposed approach was evaluated on resistor phantom and with eight patients in clinical settings. Results The proposed method was able to detect 1 or 2 disconnected electrodes with an accuracy of 100%; to detect 3 and 4 disconnected electrodes with accuracy of 92 and 84% respectively. The time cost of electrode detection was within 0.018 s. Further, the proposed method was capable to compensate at least 60 subsequent frames of data and restore the normal image reconstruction within 0.4 s and with a mean relative error smaller than 0.01%. Conclusions In this paper, we proposed a two-step approach to detect multiple disconnected electrodes and to compensate the invalid frames of data after disconnection. Our method is capable of detecting more disconnected electrodes with higher accuracy compared to methods proposed in previous studies. Further, our method provides estimations during the faulty measurement period until the medical staff reconnects the electrodes. This work would improve the clinical practicability of dynamic brain EIT and contribute to its further promotion.
Collapse
Affiliation(s)
- Ge Zhang
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Meng Dai
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Lin Yang
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Weichen Li
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Haoting Li
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Canhua Xu
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Xuetao Shi
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Xiuzhen Dong
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China.
| | - Feng Fu
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China.
| |
Collapse
|
15
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
16
|
Progressive alignment of genomic signals by multiple dynamic time warping. J Theor Biol 2015; 385:20-30. [PMID: 26300069 DOI: 10.1016/j.jtbi.2015.08.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Revised: 07/21/2015] [Accepted: 08/03/2015] [Indexed: 11/22/2022]
Abstract
This paper presents the utilization of progressive alignment principle for positional adjustment of a set of genomic signals with different lengths. The new method of multiple alignment of signals based on dynamic time warping is tested for the purpose of evaluating the similarity of different length genes in phylogenetic studies. Two sets of phylogenetic markers were used to demonstrate the effectiveness of the evaluation of intraspecies and interspecies genetic variability. The part of the proposed method is modification of pairwise alignment of two signals by dynamic time warping with using correlation in a sliding window. The correlation based dynamic time warping allows more accurate alignment dependent on local homologies in sequences without the need of scoring matrix or evolutionary models, because mutual similarities of residues are included in the numerical code of signals.
Collapse
|
17
|
Sedlar K, Skutkova H, Vitek M, Provaznik I. Set of rules for genomic signal downsampling. Comput Biol Med 2015; 69:308-14. [PMID: 26078051 DOI: 10.1016/j.compbiomed.2015.05.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 05/25/2015] [Accepted: 05/26/2015] [Indexed: 12/14/2022]
Abstract
Comparison and classification of organisms based on molecular data is an important task of computational biology, since at least parts of DNA sequences for many organisms are available. Unfortunately, methods for comparison are computationally very demanding, suitable only for short sequences. In this paper, we focus on the redundancy of genetic information stored in DNA sequences. We proposed rules for downsampling of DNA signals of cumulated phase. According to the length of an original sequence, we are able to significantly reduce the amount of data with only slight loss of original information. Dyadic wavelet transform was chosen for fast downsampling with minimum influence on signal shape carrying the biological information. We proved the usability of such new short signals by measuring percentage deviation of pairs of original and downsampled signals while maintaining spectral power of signals. Minimal loss of biological information was proved by measuring the Robinson-Foulds distance between pairs of phylogenetic trees reconstructed from the original and downsampled signals. The preservation of inter-species and intra-species information makes these signals suitable for fast sequence identification as well as for more detailed phylogeny reconstruction.
Collapse
Affiliation(s)
- Karel Sedlar
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Helena Skutkova
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic.
| | - Martin Vitek
- International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| | - Ivo Provaznik
- Department of Biomedical Engineering, Brno University of Technology, Technicka 12, 616 00 Brno, Czech Republic; International Clinical Research Center - Center of Biomedical Engineering, St. Anne׳s University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic.
| |
Collapse
|
18
|
Relationship of Bacteria Using Comparison of Whole Genome Sequences in Frequency Domain. ACTA ACUST UNITED AC 2014. [DOI: 10.1007/978-3-319-06593-9_35] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2023]
|