1
|
Solovieva E, Sakai H. PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data. BMC Bioinformatics 2023; 24:135. [PMID: 37020193 PMCID: PMC10074814 DOI: 10.1186/s12859-023-05169-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 02/02/2023] [Indexed: 04/07/2023] Open
Abstract
BACKGROUND Population structure and cryptic relatedness between individuals (samples) are two major factors affecting false positives in genome-wide association studies (GWAS). In addition, population stratification and genetic relatedness in genomic selection in animal and plant breeding can affect prediction accuracy. The methods commonly used for solving these problems are principal component analysis (to adjust for population stratification) and marker-based kinship estimates (to correct for the confounding effects of genetic relatedness). Currently, many tools and software are available that analyze genetic variation among individuals to determine population structure and genetic relationships. However, none of these tools or pipelines perform such analyses in a single workflow and visualize all the various results in a single interactive web application. RESULTS We developed PSReliP, a standalone, freely available pipeline for the analysis and visualization of population structure and relatedness between individuals in a user-specified genetic variant dataset. The analysis stage of PSReliP is responsible for executing all steps of data filtering and analysis and contains an ordered sequence of commands from PLINK, a whole-genome association analysis toolset, along with in-house shell scripts and Perl programs that support data pipelining. The visualization stage is provided by Shiny apps, an R-based interactive web application. In this study, we describe the characteristics and features of PSReliP and demonstrate how it can be applied to real genome-wide genetic variant data. CONCLUSIONS The PSReliP pipeline allows users to quickly analyze genetic variants such as single nucleotide polymorphisms and small insertions or deletions at the genome level to estimate population structure and cryptic relatedness using PLINK software and to visualize the analysis results in interactive tables, plots, and charts using Shiny technology. The analysis and assessment of population stratification and genetic relatedness can aid in choosing an appropriate approach for the statistical analysis of GWAS data and predictions in genomic selection. The various outputs from PLINK can be used for further downstream analysis. The code and manual for PSReliP are available at https://github.com/solelena/PSReliP .
Collapse
Affiliation(s)
- Elena Solovieva
- Research Center for Advanced Analysis, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, Japan
| | - Hiroaki Sakai
- Research Center for Advanced Analysis, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, Japan.
| |
Collapse
|
2
|
Abstract
The analysis of population structure has many applications in medical and population genetic research. Such analysis is used to provide clear insight into the underlying genetic population substructure and is a crucial prerequisite for any analysis of genetic data. The analysis involves grouping individuals into subpopulations based on shared genetic variations. The most widely used markers to study the variation of DNA sequences between populations are single nucleotide polymorphisms. Data preprocessing is a necessary step to assess the quality of the data and to determine which markers or individuals can reasonably be included in the analysis. After preprocessing, several methods can be utilized to uncover population substructure, which can be categorized into two broad approaches: parametric and nonparametric. Parametric approaches use statistical models to infer population structure and assign individuals into subpopulations. However, these approaches suffer from many drawbacks that make them impractical for large datasets. In contrast, nonparametric approaches do not suffer from these drawbacks, making them more viable than parametric approaches for analyzing large datasets. Consequently, nonparametric approaches are increasingly used to reveal population substructure. Thus, this paper reviews and discusses the nonparametric approaches that are available for population structure analysis along with some implications to resolve challenges.
Collapse
Affiliation(s)
- Luluah Alhusain
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
| | - Alaaeldin M Hafez
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
3
|
Abstract
Background Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Results Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. Conclusions This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Collapse
Affiliation(s)
- Luluah Alhusain
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Alaaeldin M Hafez
- College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
4
|
Du Y, Zhou H, Wang F, Liang S, Cheng L, Du X, Pang F, Tian J, Zhao J, Kan B, Xu J, Li J, Zhang F. Multilocus sequence typing-based analysis of Moraxella catarrhalis population structure reveals clonal spreading of drug-resistant strains isolated from childhood pneumonia. Infect Genet Evol 2017; 56:117-124. [PMID: 29155241 DOI: 10.1016/j.meegid.2017.11.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Revised: 10/05/2017] [Accepted: 11/15/2017] [Indexed: 10/18/2022]
Abstract
This work revealed the drug resistance and population structure of Moraxella catarrhalis strains isolated from children less than three years old with pneumonia. Forty-four independent M. catarrhalis strains were analyzed using broth dilution antimicrobial susceptibility testing and multilocus sequence typing (MLST). The highest non-susceptibility rate was observed for amoxicillin (AMX), which reached 95.5%, followed by clindamycin (CLI) (n=33; 75.0%), azithromycin (AZM) (61.4%), cefaclor (CEC) (25.0%), trimethoprim-sulfamethoxazole (SXT) (15.9%), cefuroxime (CXM) (4.5%), tetracycline (TE) (2.3%), and doxycycline (DOX) (2.3%). There was no strain showing non-susceptibility to other six antimicrobials. Using MLST, the 44 M. catarrhalis strains were divided into 33 sequence types (STs). Based on their allelic profiles, the 33 STs were divided into one CC (CC363) and 28 singletons. CC363 contained five STs and ST363 was the founder ST. CC363 contained 63.6%, 33.3%, and 40.7% of CEC non-susceptible, CLI non-susceptible and AZM non-susceptible strains, respectively. The proportions of CEC non-susceptible, CLI non-susceptible and AZM non-susceptible strains in CC363 were higher than that of singletons; these differences were significant for CEC (p=0.002) and AZM (p=0.011). Furthermore, CC363 contained more AMX-CLI-AZM co-non-susceptible and AMX-CEC-CLI-AZM co-non-susceptible strains than the singletons (p=0.007 and p<0.001, respectively). CC363 is a drug-resistant clone of clinical M. catarrhalis strains in China. Expansion of this clone under selective pressure of antibiotics should be noted and long-term monitoring should be established.
Collapse
Affiliation(s)
- Yinju Du
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China
| | - Haijian Zhou
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, PR China; Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Hangzhou, PR China
| | - Fei Wang
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China
| | - Shengnan Liang
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China
| | - Lihong Cheng
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China
| | - Xiaofei Du
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, PR China
| | - Feng Pang
- The People's Hospital of Liaocheng, Liaocheng, PR China
| | - Jinjing Tian
- The Second People's Hospital of Liaocheng, Liaocheng, PR China
| | - Jinxing Zhao
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China
| | - Biao Kan
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, PR China; Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Hangzhou, PR China
| | - Jianguo Xu
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, PR China; Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Hangzhou, PR China
| | - Juan Li
- State Key Laboratory of Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, PR China; Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Hangzhou, PR China.
| | - Furong Zhang
- Center for Disease Control and Prevention of Liaocheng, Liaocheng, PR China.
| |
Collapse
|