1
|
Pirsch AM, Austin RR, Martin L, Pieczkiewicz D, Monsen KA. Using data visualization to characterize whole-person health of public health nurses. Public Health Nurs 2023; 40:612-620. [PMID: 37424148 DOI: 10.1111/phn.13224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 06/20/2023] [Accepted: 06/21/2023] [Indexed: 07/11/2023]
Abstract
OBJECTIVE To characterize patterns in whole-person health of public health nurses (PHNs). DESIGN AND SAMPLE Survey of a convenience sample of PHNs (n = 132) in 2022. PHNs self-identified as female (96.2%), white (86.4%), between the ages 25-44 (54.5%) and 45-64 (40.2%), had bachelor's degrees (65.9%) and incomes of $50-75,000 (30.3%) and $75-100,000/year (29.5%). MEASUREMENTS Simplified Omaha System Terms (SOST) within the MyStrengths+MyHealth assessment of whole-person health (strengths, challenges, and needs) across Environmental, Psychosocial, Physiological, and Health-related Behaviors domains. RESULTS PHNs had more strengths than challenges; and more challenges than needs. Four patterns were discovered: (1) inverse relationship between strengths and challenges/needs; (2) Many strengths; (3) High needs in Income; (4) Fewest strengths in Sleeping, Emotions, Nutrition, and Exercise. PHNs with Income as a strength (n = 79) had more strengths (t = 5.570, p < .001); fewer challenges (t = -5.270, p < .001) and needs (t = -3.659, p < .001) compared to others (n = 53). CONCLUSIONS PHNs had many strengths compared to previous research with other samples, despite concerning patterns of challenges and needs. Most PHN whole-person health patterns aligned with previous literature. Further research is needed to validate and extend these findings toward improving PHN health.
Collapse
Affiliation(s)
- Anna M Pirsch
- School of Nursing, University of Minnesota, Minneapolis, Minnesota, USA
| | - Robin R Austin
- School of Nursing, University of Minnesota, Minneapolis, Minnesota, USA
| | - Lisa Martin
- School of Nursing, University of Minnesota, Minneapolis, Minnesota, USA
| | - David Pieczkiewicz
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Karen A Monsen
- School of Nursing, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
2
|
Obradovic P, Kovačević V, Li X, Milosavljevic A. An Information-Theoretic Bound on p-Values for Detecting Communities Shared between Weighted Labeled Graphs. Entropy (Basel) 2022; 24:1329. [PMID: 37420347 DOI: 10.3390/e24101329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/12/2022] [Accepted: 09/17/2022] [Indexed: 07/09/2023]
Abstract
Extraction of subsets of highly connected nodes ("communities" or modules) is a standard step in the analysis of complex social and biological networks. We here consider the problem of finding a relatively small set of nodes in two labeled weighted graphs that is highly connected in both. While many scoring functions and algorithms tackle the problem, the typically high computational cost of permutation testing required to establish the p-value for the observed pattern presents a major practical obstacle. To address this problem, we here extend the recently proposed CTD ("Connect the Dots") approach to establish information-theoretic upper bounds on the p-values and lower bounds on the size and connectedness of communities that are detectable. This is an innovation on the applicability of CTD, broadening its use to pairs of graphs.
Collapse
Affiliation(s)
- Predrag Obradovic
- School of Electrical Engineering, University of Belgrade, 11000 Belgrade, Serbia
| | - Vladimir Kovačević
- School of Electrical Engineering, University of Belgrade, 11000 Belgrade, Serbia
| | - Xiqi Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Aleksandar Milosavljevic
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
- Quantitative and Computational Biosciences Program, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
3
|
Li K, Deng H, Morrison J, Habre R, Franklin M, Chiang YY, Sward K, Gilliland FD, Ambite JL, Eckel SP. W-TSS: A Wavelet-Based Algorithm for Discovering Time Series Shapelets. Sensors (Basel) 2021; 21:s21175801. [PMID: 34502692 PMCID: PMC8434226 DOI: 10.3390/s21175801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 08/24/2021] [Accepted: 08/24/2021] [Indexed: 11/16/2022]
Abstract
Many approaches to time series classification rely on machine learning methods. However, there is growing interest in going beyond black box prediction models to understand discriminatory features of the time series and their associations with outcomes. One promising method is time-series shapelets (TSS), which identifies maximally discriminative subsequences of time series. For example, in environmental health applications TSS could be used to identify short-term patterns in exposure time series (shapelets) associated with adverse health outcomes. Identification of candidate shapelets in TSS is computationally intensive. The original TSS algorithm used exhaustive search. Subsequent algorithms introduced efficiencies by trimming/aggregating the set of candidates or training candidates from initialized values, but these approaches have limitations. In this paper, we introduce Wavelet-TSS (W-TSS) a novel intelligent method for identifying candidate shapelets in TSS using wavelet transformation discovery. We tested W-TSS on two datasets: (1) a synthetic example used in previous TSS studies and (2) a panel study relating exposures from residential air pollution sensors to symptoms in participants with asthma. Compared to previous TSS algorithms, W-TSS was more computationally efficient, more accurate, and was able to discover more discriminative shapelets. W-TSS does not require pre-specification of shapelet length.
Collapse
Affiliation(s)
- Kenan Li
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA 90089, USA
- Correspondence:
| | - Huiyu Deng
- Applied AI and Data Science, City of Hope National Medical Center, Duarte, CA 91010, USA;
| | - John Morrison
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
| | - Rima Habre
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
| | - Meredith Franklin
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
| | - Yao-Yi Chiang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Katherine Sward
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT 84108, USA;
| | - Frank D. Gilliland
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
| | - José Luis Ambite
- Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA;
| | - Sandrah P. Eckel
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90032, USA; (J.M.); (R.H.); (M.F.); (F.D.G.); (S.P.E.)
| |
Collapse
|
4
|
Barfar A, Padmanabhan B. Pattern discovery, validation, and online experiments: a methodology for discovering television shows for public health announcements. J Am Med Inform Assoc 2021; 28:1374-1382. [PMID: 33677589 DOI: 10.1093/jamia/ocab008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 11/29/2020] [Accepted: 01/14/2021] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Public Health Announcements (PHAs) on television are a means of raising awareness about risk behaviors and chronic conditions. PHAs' scarce airtime puts stress on their target audience reach. We seek to help health campaigns select television shows for their PHAs about smoking, binge drinking, drug overdose, obesity, diabetes, STDs, and other conditions using available statistics. MATERIALS AND METHODS Using Nielsen's TV viewership database for the entire US panel, we presented a novel show discovery methodology for PHAs that combined (i) pattern discovery from high-dimensional data (ii) nonparametric tests for validation, and (iii) online experiments on Facebook. RESULTS The nonparametric tests verified the robustness of the discovered associations between the popularity of certain shows and health conditions. Findings from fifty (independent) online experiments (where our awareness messages were seen by nearly 1.5 million American adults) empirically demonstrated the value of the methodology. DISCUSSION For 2016, the methodology identified several shows whose popularities were genuinely associated with certain health conditions, opening up the possibility of health agencies embracing both big data and large-scale experimentation to address an old problem in a new way. CONCLUSION Policy makers can repeatedly apply the methodology as new data streams in, with perhaps different feature sets, pattern discovery techniques, and online experiments running over longer periods. The comparatively lower initial investment in the methodology can pay off by identifying several shows for a potentially national television campaign. As simply a by-product, the initial investment also results in awareness messages that might reach millions of individuals.
Collapse
Affiliation(s)
- Arash Barfar
- Department of Information Systems, College of Business, University of Nevada, Reno, Nevada, USA
| | - Balaji Padmanabhan
- School of Information Systems and Management, Muma College of Business, University of South Florida, Tampa, Florida, USA
| |
Collapse
|
5
|
Lu H, Chen XI, Shi J, Vaidya J, Atluri V, Hong Y, Huang W. Algorithms and Applications to Weighted Rank-one Binary Matrix Factorization. ACM Trans Manag Inf Syst 2020; 11. [PMID: 33251040 DOI: 10.1145/3386599] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Many applications use data that are better represented in the binary matrix form, such as click-stream data, market basket data, document-term data, user-permission data in access control, and others. Matrix factorization methods have been widely used tools for the analysis of high-dimensional data, as they automatically extract sparse and meaningful features from data vectors. However, existing matrix factorization methods do not work well for the binary data. One crucial limitation is interpretability, as many matrix factorization methods decompose an input matrix into matrices with fractional or even negative components, which are hard to interpret in many real settings. Some matrix factorization methods, like binary matrix factorization, do limit decomposed matrices to binary values. However, these models are not flexible to accommodate some data analysis tasks, like trading off summary size with quality and discriminating different types of approximation errors. To address those issues, this article presents weighted rank-one binary matrix factorization, which is to approximate a binary matrix by the product of two binary vectors, with parameters controlling different types of approximation errors. By systematically running weighted rank-one binary matrix factorization, one can effectively perform various binary data analysis tasks, like compression, clustering, and pattern discovery. Theoretical properties on weighted rank-one binary matrix factorization are investigated and its connection to problems in other research domains are examined. As weighted rank-one binary matrix factorization in general is NP-hard, efficient and effective algorithms are presented. Extensive studies on applications of weighted rank-one binary matrix factorization are also conducted.
Collapse
Affiliation(s)
| | | | | | | | | | - Yuan Hong
- Illinois Institute of Technology, USA
| | - Wei Huang
- Southern University of Science & Technology; Xi'an Jiaotong University, China
| |
Collapse
|
6
|
Gan Y, Li N, Xin Y, Zou G. TriPCE: A Novel Tri-Clustering Algorithm for Identifying Pan-Cancer Epigenetic Patterns. Front Genet 2020; 10:1298. [PMID: 32010182 PMCID: PMC6974616 DOI: 10.3389/fgene.2019.01298] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 11/25/2019] [Indexed: 11/20/2022] Open
Abstract
Epigenetic alteration is a fundamental characteristic of nearly all human cancers. Tumor cells not only harbor genetic alterations, but also are regulated by diverse epigenetic modifications. Identification of epigenetic similarities across different cancer types is beneficial for the discovery of treatments that can be extended to different cancers. Nowadays, abundant epigenetic modification profiles have provided a great opportunity to achieve this goal. Here, we proposed a new approach TriPCE, introducing tri-clustering strategy to integrative pan-cancer epigenomic analysis. The method is able to identify coherent patterns of various epigenetic modifications across different cancer types. To validate its capability, we applied the proposed TriPCE to analyze six important epigenetic marks among seven cancer types, and identified significant cross-cancer epigenetic similarities. These results suggest that specific epigenetic patterns indeed exist among these investigated cancers. Furthermore, the gene functional analysis performed on the associated gene sets demonstrates strong relevance with cancer development and reveals consistent risk tendency among these investigated cancer types.
Collapse
Affiliation(s)
- Yanglan Gan
- School of Computer Science and Technology, Donghua University, Shanghai, China
| | - Ning Li
- School of Computer Science and Technology, Donghua University, Shanghai, China
| | - Yongchang Xin
- School of Computer Science and Technology, Donghua University, Shanghai, China
| | - Guobing Zou
- School of Computer Engineering and Science, Shanghai University, Shanghai, China
| |
Collapse
|
7
|
Zhou PY, Lee ESA, Sze-To A, Wong AKC. Revealing Subtle Functional Subgroups in Class A Scavenger Receptors by Pattern Discovery and Disentanglement of Aligned Pattern Clusters. Proteomes 2018; 6:proteomes6010010. [PMID: 29419792 PMCID: PMC5874769 DOI: 10.3390/proteomes6010010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Revised: 02/01/2018] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
A protein family has similar and diverse functions locally conserved as aligned sequence segments. Further discovering their association patterns could reveal subtle family subgroup characteristics. Since aligned residues associations (ARAs) in Aligned Pattern Clusters (APCs) are complex and intertwined due to entangled function, factors, and variance in the source environment, we have recently developed a novel method: Aligned Residue Association Discovery and Disentanglement (ARADD) to solve this problem. ARADD first obtains from an APC an ARA Frequency Matrix and converts it to an adjusted statistical residual vectorspace (SRV). It then disentangles the SRV into Principal Components (PCs) and Re-projects their vectors to a SRV to reveal succinct orthogonal AR groups. In this study, we applied ARADD to class A scavenger receptors (SR-A), a subclass of a diverse protein family binding to modified lipoproteins with diverse biological functionalities not explicitly known. Our experimental results demonstrated that ARADD can unveil subtle subgroups in sequence segments with diverse functionality and highly variable sequence lengths. We also demonstrated that the ARAs captured in a Position Weight Matrix or an APC were entangled in biological function and domain location but disentangled by ARADD to reveal different subclasses without knowing their actual occurrence positions.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- VaryWave Technology Co., Ltd., 538A, Core Building 2, Hong Kong Science Park, Shatin, NT, Hong Kong.
| | - En-Shiun Annie Lee
- VerticalScope Inc., 111 Peter Street, Suite 900, Toronto, ON M5V 2H1, Canada.
| | - Antonio Sze-To
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| | - Andrew K C Wong
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| |
Collapse
|
8
|
Elgendi M. Eventogram: A Visual Representation of Main Events in Biomedical Signals. Bioengineering (Basel) 2016; 3:bioengineering3040022. [PMID: 28952583 PMCID: PMC5597265 DOI: 10.3390/bioengineering3040022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2016] [Revised: 09/15/2016] [Accepted: 09/18/2016] [Indexed: 11/06/2022] Open
Abstract
Biomedical signals carry valuable physiological information and many researchers have difficulty interpreting and analyzing long-term, one-dimensional, quasi-periodic biomedical signals. Traditionally, biomedical signals are analyzed and visualized using periodogram, spectrogram, and wavelet methods. However, these methods do not offer an informative visualization of main events within the processed signal. This paper attempts to provide an event-related framework to overcome the drawbacks of the traditional visualization methods and describe the main events within the biomedical signal in terms of duration and morphology. Electrocardiogram and photoplethysmogram signals are used in the analysis to demonstrate the differences between the traditional visualization methods, and their performance is compared against the proposed method, referred to as the “eventogram” in this paper. The proposed method is based on two event-related moving averages that visualizes the main time-domain events in the processed biomedical signals. The traditional visualization methods were unable to find dominant events in processed signals while the eventogram was able to visualize dominant events in signals in terms of duration and morphology. Moreover, eventogram-based detection algorithms succeeded with detecting main events in different biomedical signals with a sensitivity and positive predictivity >95%. The output of the eventogram captured unique patterns and signatures of physiological events, which could be used to visualize and identify abnormal waveforms in any quasi-periodic signal.
Collapse
Affiliation(s)
- Mohamed Elgendi
- Department of Obstetrics & Gynecology, University of British Columbia, Vancouver, BC V6Z 2K5, Canada.
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
| |
Collapse
|
9
|
Durston KK, Chiu DKY, Wong AKC, Li GCL. Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring. EURASIP J Bioinform Syst Biol 2012; 2012:8. [PMID: 22793672 PMCID: PMC3524763 DOI: 10.1186/1687-4153-2012-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2011] [Accepted: 05/29/2012] [Indexed: 11/10/2022]
Abstract
UNLABELLED BACKGROUND Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. RESULTS The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. CONCLUSIONS Our results demonstrate that the method we present here using a k-modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
Collapse
Affiliation(s)
- Kirk K Durston
- School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - David KY Chiu
- School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - Andrew KC Wong
- Department of System Design Engineering, University of Waterloo, 200 University Ave. W, Waterloo, ON, N2L 3G1, Canada
| | - Gary CL Li
- Department of System Design Engineering, University of Waterloo, 200 University Ave. W, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
10
|
Höhl M, Rigoutsos I, Ragan MA. Pattern-based phylogenetic distance estimation and tree reconstruction. Evol Bioinform Online 2007; 2:359-75. [PMID: 19455227 PMCID: PMC2674673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
We have developed an alignment-free method that calculates phylogenetic distances using a maximum-likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf + py at http://www.bioinformatics.org.au), we have created a data set of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees.We find our pattern-based method statistically superior to all other tested alignment-free methods. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.
Collapse
Affiliation(s)
- Michael Höhl
- Institute for Molecular Bioscience, The University of Queensland, Brisbane QLD 4072, Australia, Australian Research Council Centre in Bioinformatics
| | - Isidore Rigoutsos
- Australian Research Council Centre in Bioinformatics, Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center, Yorktown Heights, NY 10598, U.S.A
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane QLD 4072, Australia, Australian Research Council Centre in Bioinformatics,Correspondence: M.A. Ragan. Tel: +61-7-3346-2616; Fax: +61-7-3346-2101;
| |
Collapse
|
11
|
Putonti C, Pettitt B, Reid J, Fofanov Y. PIDA:A new algorithm for pattern identification. Online J Bioinform 2007; 8:30-40. [PMID: 19834570 PMCID: PMC2761635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Algorithms for motif identification in sequence space have predominately been focused on recognizing patterns of a fixed length containing regions of perfect conservation with possible regions of unconstrained sequence. Such motifs can be found in everything from proteins with distinct active sites to non-coding RNAs with specific structural elements that are necessary to maintain functionality. In the event that an insertion/deletion has occurred within an unconstrained portion of the pattern, it is possible that the pattern retains its functionality. In such a case the length of the pattern is now variable and may be overlooked when utilizing existing motif detection methods. The Pattern Island Detection Algorithm (PIDA) presented here has been developed to recognize patterns that have occurrences of varying length within sequences of any size alphabet. PIDA works by identifying all regions of perfect conservation (for lengths longer than a user-specified threshold), and then builds those conservation "islands" into fixed-length patterns. Next the algorithm modifies these fixed-length patterns by identifying additional (and different) islands that can be incorporated into each pattern through insertions/deletions within the "water" separating the islands. To provide some benchmarks for this analysis, PIDA was used to search for patterns within randomly generated sequences as well as sequences known to contain conserved patterns. For each of the patterns found, the statistical significance is calculated based upon the pattern's likelihood to appear by chance, thus providing a means to determine those patterns which are likely to have a functional role. The PIDA approach to motif finding is designed to perform best when searching for patterns of variable length although it is also able to identify patterns of a fixed length. PIDA has been created to be as generally applicable as possible since there are a variety of sequence problems of this type. The algorithm was implemented in C++ and is freely available upon request from the authors.
Collapse
Affiliation(s)
- C Putonti
- Department of Computer Science, University of Houston, Houston, Texas, USA
| | | | | | | |
Collapse
|