1
|
Wong AKC, Zhou PY, Lee AES. Theory and rationale of interpretable all-in-one pattern discovery and disentanglement system. NPJ Digit Med 2023; 6:92. [PMID: 37217691 DOI: 10.1038/s41746-023-00816-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 04/04/2023] [Indexed: 05/24/2023] Open
Abstract
In machine learning (ML), association patterns in the data, paths in decision trees, and weights between layers of the neural network are often entangled due to multiple underlying causes, thus masking the pattern-to-source relation, weakening prediction, and defying explanation. This paper presents a revolutionary ML paradigm: pattern discovery and disentanglement (PDD) that disentangles associations and provides an all-in-one knowledge system capable of (a) disentangling patterns to associate with distinct primary sources; (b) discovering rare/imbalanced groups, detecting anomalies and rectifying discrepancies to improve class association, pattern and entity clustering; and (c) organizing knowledge for statistically supported interpretability for causal exploration. Results from case studies have validated such capabilities. The explainable knowledge reveals pattern-source relations on entities, and underlying factors for causal inference, and clinical study and practice; thus, addressing the major concern of interpretability, trust, and reliability when applying ML to healthcare, which is a step towards closing the AI chasm.
Collapse
Affiliation(s)
- Andrew K C Wong
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Pei-Yuan Zhou
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
| | - Annie E-S Lee
- Computer Science Department, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
2
|
Athanasopoulou K, Adamopoulos PG, Scorilas A. Structural characterization and expression analysis of novel MAPK1 transcript variants with the development of a multiplexed targeted nanopore sequencing approach. Int J Biochem Cell Biol 2022; 150:106272. [PMID: 35878809 DOI: 10.1016/j.biocel.2022.106272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 06/17/2022] [Accepted: 07/21/2022] [Indexed: 11/27/2022]
Abstract
Mitogen-activated protein kinases (MAPKs) represent a protein family firmly involved in many signaling cascades, regulating a vast spectrum of stimulated cellular processes. Studies have shown that alternatively spliced isoforms of MAPKs play a crucial role in determining the desired cell fate in response to specific stimulations. Although the implication of most MAPKs transcript variants in the MAPK signaling cascades has been clarified, the transcriptional profile of a pivotal member, MAPK1, has not been investigated for the existence of additional isoforms. In the current study we developed and implemented targeted long-read and short-read sequencing approaches to identify novel MAPK1 splice variants. The combination of nanopore sequencing and NGS enabled the implementation of a long-read polishing pipeline using error-rate correction algorithms, which empowered the high accuracy of the results and increased the sequencing efficiency. The utilized multiplexing option in the nanopore sequencing approach allowed not only the identification of novel MAPK1 mRNAs, but also elucidated their expression profile in multiple human malignancies and non-cancerous cell lines. Our study highlights for the first time the existence of ten previously undescribed MAPK1 mRNAs (MAPK1 v.3 - v.12) and evaluates their relative expression levels in comparison to the main MAPK1 v.1. The optimization and employment of qPCR assays revealed that MAPK1 v.3 - v.12 can be quantified in a wide spectrum of human cell lines with notable specificity. Finally, our findings suggest that the novel protein-coding mRNAs are highly expected to participate in the regulation of MAPK pathways, demonstrating differential localizations and functionalities.
Collapse
Affiliation(s)
- Konstantina Athanasopoulou
- Department of Biochemistry and Molecular Biology, National and Kapodistrian University of Athens, Athens, Greece
| | - Panagiotis G Adamopoulos
- Department of Biochemistry and Molecular Biology, National and Kapodistrian University of Athens, Athens, Greece
| | - Andreas Scorilas
- Department of Biochemistry and Molecular Biology, National and Kapodistrian University of Athens, Athens, Greece.
| |
Collapse
|
3
|
Annie Lee ES, Zhou P, Wong AKC. WeMine Aligned Pattern Clustering System for Biosequence Pattern Analysis. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
4
|
Pattern Discovery and Disentanglement for Aligned Pattern Cluster Analysis and Protein Binding Complexes Detection. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open
|
5
|
Sarkar A, Murugan TS. Analysis on dual algorithms for optimal cluster head selection in wireless sensor network. EVOLUTIONARY INTELLIGENCE 2021. [DOI: 10.1007/s12065-020-00546-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
6
|
Pattern discovery and disentanglement on relational datasets. Sci Rep 2021; 11:5688. [PMID: 33707478 PMCID: PMC7952710 DOI: 10.1038/s41598-021-84869-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 02/11/2021] [Indexed: 11/09/2022] Open
Abstract
Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.
Collapse
|
7
|
Quadri GJ, Rosen P. Modeling the Influence of Visual Density on Cluster Perception in Scatterplots Using Topology. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:1829-1839. [PMID: 33048695 DOI: 10.1109/tvcg.2020.3030365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Scatterplots are used for a variety of visual analytics tasks, including cluster identification, and the visual encodings used on a scatterplot play a deciding role on the level of visual separation of clusters. For visualization designers, optimizing the visual encodings is crucial to maximizing the clarity of data. This requires accurately modeling human perception of cluster separation, which remains challenging. We present a multi-stage user study focusing on four factors-distribution size of clusters, number of points, size of points, and opacity of points-that influence cluster identification in scatterplots. From these parameters, we have constructed two models, a distance-based model, and a density-based model, using the merge tree data structure from Topological Data Analysis. Our analysis demonstrates that these factors play an important role in the number of clusters perceived, and it verifies that the distance-based and density-based models can reasonably estimate the number of clusters a user observes. Finally, we demonstrate how these models can be used to optimize visual encodings on real-world data.
Collapse
|
8
|
Zhou PY, Sze-To A, Wong AKC. Discovery and disentanglement of aligned residue associations from aligned pattern clusters to reveal subgroup characteristics. BMC Med Genomics 2018; 11:103. [PMID: 30453949 PMCID: PMC6245498 DOI: 10.1186/s12920-018-0417-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background A protein family has similar and diverse functions locally conserved. An aligned pattern cluster (APC) can reflect the conserved functionality. Discovering aligned residue associations (ARAs) in APCs can reveal subtle inner working characteristics of conserved regions of protein families. However, ARAs corresponding to different functionalities/subgroups/classes could be entangled because of subtle multiple entwined factors. Methods To discover and disentangle patterns from mixed-mode datasets, such as APCs when the residues are replaced by their fundamental biochemical properties list, this paper presents a novel method, Extended Aligned Residual Association Discovery and Disentanglement (E-ARADD). E-ARADD discretizes the numerical dataset to transform the mixed-mode dataset into an event-value dataset, constructs an ARA Frequency Matrix and then converts it into an adjusted Statistical Residual (SR) Vector Space (SRV) capturing statistical deviation from randomness. By applying Principal Component (PC) Decomposition on SRV, PCs ranked by their variance are obtained. Finally, the disentangled ARAs are discovered when the projections on a PC is re-projected to a vector space with the same basis vectors of SRV. Results Experiments on synthetic, cytochrome c and class A scavenger data have shown that E-ARADD can a) disentangle the entwined ARAs in APCs (with residues or biochemical properties), b) reveal subtle AR clusters relating to classes, subtle subgroups or specific functionalities. Conclusions E-ARADD can discover and disentangle ARs and ARAs entangled in functionality and location of protein families to reveal functional subgroups and subgroup characteristics of biological conserved regions. Experimental results on synthetic data provides the proof-of-concept validation on the successful disentanglement that reveals class-associated ARAs with or without class labels as input. Experiments on cytochrome c data proved the efficacy of E-ARADD in handing both types of residue data. Our novel methodology is not only able to discover and disentangle ARs and ARAs in specific statistical/functional (PCs and RSRVs) spaces, but also their locations in the protein family functional domains. The success of E-ARADD shows its great potential to proteomic research, drug discovery and precision and personalized genetic medicine.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Antonio Sze-To
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Andrew K C Wong
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
| |
Collapse
|
9
|
Wong AKC, Sze-To HY, Johanning GL. Pattern to Knowledge: Deep Knowledge-Directed Machine Learning for Residue-Residue Interaction Prediction. Sci Rep 2018; 8:14841. [PMID: 30287904 PMCID: PMC6172270 DOI: 10.1038/s41598-018-32834-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2018] [Accepted: 09/17/2018] [Indexed: 11/21/2022] Open
Abstract
Residue-residue close contact (R2R-C) data procured from three-dimensional protein-protein interaction (PPI) experiments is currently used for predicting residue-residue interaction (R2R-I) in PPI. However, due to complex physiochemical environments, R2R-I incidences, facilitated by multiple factors, are usually entangled in the source environment and masked in the acquired data. Here we present a novel method, P2K (Pattern to Knowledge), to disentangle R2R-I patterns and render much succinct discriminative information expressed in different specific R2R-I statistical/functional spaces. Since such knowledge is not visible in the data acquired, we refer to it as deep knowledge. Leveraging the deep knowledge discovered to construct machine learning models for sequence-based R2R-I prediction, without trial-and-error combination of the features over external knowledge of sequences, our R2R-I predictor was validated for its effectiveness under stringent leave-one-complex-out-alone cross-validation in a benchmark dataset, and was surprisingly demonstrated to perform better than an existing sequence-based R2R-I predictor by 28% (p: 1.9E-08). P2K is accessible via our web server on https://p2k.uwaterloo.ca .
Collapse
Affiliation(s)
- Andrew K C Wong
- Department of Systems Design Engineering, University of Waterloo, 200 University Avenue West, Waterloo, N2L 3G1, Ontario, Canada.
| | - Ho Yin Sze-To
- Department of Systems Design Engineering, University of Waterloo, 200 University Avenue West, Waterloo, N2L 3G1, Ontario, Canada
| | - Gary L Johanning
- Biosciences Division, SRI International, 333 Ravenswood Ave, Menlo Park, CA, USA
| |
Collapse
|
10
|
Sze-To A, Wong AKC. Discovering Patterns From Sequences Using Pattern-Directed Aligned Pattern Clustering. IEEE Trans Nanobioscience 2018; 17:209-218. [PMID: 29994222 DOI: 10.1109/tnb.2018.2845741] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Functional region identification is of fundamental importance for protein sequences analysis. Such knowledge provides better scientific understanding and could assist drug discovery. Up-to-date, domain annotation is one approach, but it needs to leverage existing databases. For de novo discovery, motif discovery locates and aligns locally homologous sub-sequences to obtain a position-weight matrix (PWM), which is a fixed-length representation model, whereas protein functional region size varies. It thus requires computational expensive exhaustive search to obtain a PWM with width of optimal range. This paper presents a new method known as pattern-directed aligned pattern clustering (PD-APCn) to discover and align patterns in conserved protein functional regions. It adopts aligned pattern cluster (APC) with patterns of variable length and strong support to direct the incremental APC expansion. It allows substitution and frame-shift mutations until a robust termination condition is reached. The concept of breakpoint gap is introduced to identify spots of mutations, such as substitution and frame shifts. Experiments on synthetic data sets with different sizes and noise levels showed that PD-APCn outperforms MEME with much higher recall and Fmeasure and computational speed 665 times faster that MEME. When applying to Cytochrome C and Ubiquitin families, it found all key binding sites within the APCs.
Collapse
|
11
|
Zhou PY, Lee ESA, Sze-To A, Wong AKC. Revealing Subtle Functional Subgroups in Class A Scavenger Receptors by Pattern Discovery and Disentanglement of Aligned Pattern Clusters. Proteomes 2018; 6:proteomes6010010. [PMID: 29419792 PMCID: PMC5874769 DOI: 10.3390/proteomes6010010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Revised: 02/01/2018] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
A protein family has similar and diverse functions locally conserved as aligned sequence segments. Further discovering their association patterns could reveal subtle family subgroup characteristics. Since aligned residues associations (ARAs) in Aligned Pattern Clusters (APCs) are complex and intertwined due to entangled function, factors, and variance in the source environment, we have recently developed a novel method: Aligned Residue Association Discovery and Disentanglement (ARADD) to solve this problem. ARADD first obtains from an APC an ARA Frequency Matrix and converts it to an adjusted statistical residual vectorspace (SRV). It then disentangles the SRV into Principal Components (PCs) and Re-projects their vectors to a SRV to reveal succinct orthogonal AR groups. In this study, we applied ARADD to class A scavenger receptors (SR-A), a subclass of a diverse protein family binding to modified lipoproteins with diverse biological functionalities not explicitly known. Our experimental results demonstrated that ARADD can unveil subtle subgroups in sequence segments with diverse functionality and highly variable sequence lengths. We also demonstrated that the ARAs captured in a Position Weight Matrix or an APC were entangled in biological function and domain location but disentangled by ARADD to reveal different subclasses without knowing their actual occurrence positions.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- VaryWave Technology Co., Ltd., 538A, Core Building 2, Hong Kong Science Park, Shatin, NT, Hong Kong.
| | - En-Shiun Annie Lee
- VerticalScope Inc., 111 Peter Street, Suite 900, Toronto, ON M5V 2H1, Canada.
| | - Antonio Sze-To
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| | - Andrew K C Wong
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| |
Collapse
|
12
|
Lee ESA, Sze-To HYA, Wong MH, Leung KS, Lau TCK, Wong AKC. Discovering Protein-DNA Binding Cores by Aligned Pattern Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:254-263. [PMID: 26336137 DOI: 10.1109/tcbb.2015.2474376] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity. This study presents a new representation for modeling binding cores by incorporating variations and an algorithm to discover them from only sequence data. Our algorithm takes protein and DNA sequences from TRANSFAC (a Protein-DNA Binding Database) as input; discovers from both sets of sequences conserved regions in Aligned Pattern Clusters (APCs); associates them as Protein-DNA Co-Occurring APCs; ranks the Protein-DNA Co-Occurring APCs according to their co-occurrence, and among the top ones, finds three-dimensional structures to support each binding core candidate. If successful, candidates are verified as binding cores. Otherwise, homology modeling is applied to their close matches in PDB to attain new chemically feasible binding cores. Our algorithm obtains binding cores with higher precision and much faster runtime ( ≥ 1,600x) than that of its contemporaries, discovering candidates that do not co-occur as one-to-one associated patterns in the raw data. AVAILABILITY http://www.pami.uwaterloo.ca/~ealee/files/tcbbPnDna2015/Release.zip.
Collapse
|
13
|
Guo D, Yuan E, Hu X, Wu X. Co-occurrence pattern mining based on a biological approximation scoring matrix. Pattern Anal Appl 2017. [DOI: 10.1007/s10044-017-0609-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
14
|
Sze-To A, Fung S, Lee ESA, Wong AK. Prediction of Protein–Protein Interaction via co-occurring Aligned Pattern Clusters. Methods 2016; 110:26-34. [DOI: 10.1016/j.ymeth.2016.07.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2016] [Revised: 06/25/2016] [Accepted: 07/26/2016] [Indexed: 10/21/2022] Open
|
15
|
Zhang J, Wang Y, Zhang C, Shi Y. Mining Contiguous Sequential Generators in Biological Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:855-867. [PMID: 26529774 DOI: 10.1109/tcbb.2015.2495132] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The discovery of conserved sequential patterns in biological sequences is essential to unveiling common shared functions. Mining sequential generators as well as mining closed sequential patterns can contribute to a more concise result set than mining all sequential patterns, especially in the analysis of big data in bioinformatics. Previous studies have also presented convincing arguments that the generator is preferable to the closed pattern in inductive inference and classification. However, classic sequential generator mining algorithms, due to the lack of consideration on the contiguous constraint along with the lower-closed one, still pose a great challenge at spawning a large number of inefficient and redundant patterns, which is too huge for effective usage. Driven by some extensive applications of patterns with contiguous feature, we propose ConSgen, an efficient algorithm for discovering contiguous sequential generators. It adopts the n-gram model, called shingles, to generate potential frequent subsequences and leverages several pruning techniques to prune the unpromising parts of search space. And then, the contiguous sequential generators are identified by using the equivalence class-based lower-closure checking scheme. Our experiments on both DNA and protein data sets demonstrate the compactness, efficiency, and scalability of ConSgen.
Collapse
|
16
|
Lee ESA, Whelan FJ, Bowdish DME, Wong AKC. Partitioning and correlating subgroup characteristics from Aligned Pattern Clusters. Bioinformatics 2016; 32:2427-34. [PMID: 27153647 DOI: 10.1093/bioinformatics/btw211] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 04/13/2016] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Evolutionarily conserved amino acids within proteins characterize functional or structural regions. Conversely, less conserved amino acids within these regions are generally areas of evolutionary divergence. A priori knowledge of biological function and species can help interpret the amino acid differences between sequences. However, this information is often erroneous or unavailable, hampering discovery with supervised algorithms. Also, most of the current unsupervised methods depend on full sequence similarity, which become inaccurate when proteins diverge (e.g. inversions, deletions, insertions). Due to these and other shortcomings, we developed a novel unsupervised algorithm which discovers highly conserved regions and uses two types of information measures: (i) data measures computed from input sequences; and (ii) class measures computed using a priori class groupings in order to reveal subgroups (i.e. classes) or functional characteristics. RESULTS Using known and putative sequences of two proteins belonging to a relatively uncharacterized protein family we were able to group evolutionarily related sequences and identify conserved regions, which are strong homologous association patterns called Aligned Pattern Clusters, within individual proteins and across the members of this family. An initial synthetic demonstration and in silico results reveal that (i) the data measures are unbiased and (ii) our class measures can accurately rank the quality of the evolutionarily relevant groupings. Furthermore, combining our data and class measures allowed us to interpret the results by inferring regions of biological importance within the binding domain of these proteins. Compared to popular supervised methods, our algorithm has a superior runtime and comparable accuracy. AVAILABILITY AND IMPLEMENTATION The dataset and results are available at www.pami.uwaterloo.ca/∼ealee/files/classification2015 CONTACT: akcwong@uwaterloo.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- En-Shiun Annie Lee
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | | | - Dawn M E Bowdish
- Department of Pathology and Molecular Medicine, McMaster University, Hamilton, ON, Canada
| | - Andrew K C Wong
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|