1
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
2
|
Pagkrati I, Duke JL, Mbunwe E, Mosbruger TL, Ferriola D, Wasserman J, Dinou A, Tairis N, Damianos G, Kotsopoulou I, Papaioannou J, Giannopoulos D, Beggs W, Nyambo T, Mpoloka SW, Mokone GG, Njamnshi AK, Fokunang C, Woldemeskel D, Belay G, Maiers M, Tishkoff SA, Monos DS. Genomic characterization of HLA class I and class II genes in ethnically diverse sub-Saharan African populations: A report on novel HLA alleles. HLA 2023; 102:192-205. [PMID: 36999238 PMCID: PMC10524506 DOI: 10.1111/tan.15035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 03/11/2023] [Accepted: 03/11/2023] [Indexed: 04/01/2023]
Abstract
HLA allelic variation has been well studied and documented in many parts of the world. However, African populations have been relatively under-represented in studies of HLA variation. We have characterized HLA variation from 489 individuals belonging to 13 ethnically diverse populations from rural communities from the African countries of Botswana, Cameroon, Ethiopia, and Tanzania, known to practice traditional subsistence lifestyles using next generation sequencing (Illumina) and long-reads from Oxford Nanopore Technologies. We identified 342 distinct alleles among the 11 HLA targeted genes: HLA-A, -B, -C, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1, and -DPB1, with 140 of those alleles containing novel sequences that were submitted to the IPD-IMGT/HLA database. Sixteen of the 140 alleles contained novel content within the exonic regions of the genes, while 110 alleles contained novel intronic variants. Four alleles were found to be recombinants of already described HLA alleles and 10 alleles extended the sequence content of already described alleles. All 140 alleles include complete allelic sequence from the 5' UTR to the 3' UTR that are inclusive of all exons and introns. This report characterizes the HLA allelic variation from these individuals and describes the novel allelic variation present within these specific African populations.
Collapse
Affiliation(s)
- Ioanna Pagkrati
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Jamie L. Duke
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Eric Mbunwe
- Department of Genetics and Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Timothy L. Mosbruger
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Deborah Ferriola
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Jenna Wasserman
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Amalia Dinou
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Nikolaos Tairis
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Georgios Damianos
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Ioanna Kotsopoulou
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Joanna Papaioannou
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - Diamantoula Giannopoulos
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
| | - William Beggs
- Department of Genetics and Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Thomas Nyambo
- Department of Biochemistry, Kampala International University in Tanzania (KIUT), Dar es Salaam, Tanzania
| | - Sununguko W. Mpoloka
- Department of Biological Sciences, Faculty of Science, University of Botswana, Gaborone, Botswana
| | - Gaonyadiwe G. Mokone
- Department of Biomedical Sciences, Faculty of Medicine, University of Botswana, Gaborone, Botswana
| | - Alfred K. Njamnshi
- Department of Neuroscience, Brain Research Africa Initiative (BRAIN), Yaoundé, Cameroon
- Department of Neurology & Neuroscience, Central Hospital Yaoundé, Yaoundé, Cameroon
- Neuroscience Lab, Faculty of Medicine and Biomedical Sciences, The University of Yaoundé I, Yaoundé, Cameroon
| | - Charles Fokunang
- Department of Pharmacotoxicology and Pharmacokinetics, Faculty of Medicine and Biomedical Sciences, The University of Yaoundé I, Yaoundé, Cameroon
| | - Dawit Woldemeskel
- Department of Microbial, Cellular and Molecular Biology, Addis Ababa University, Addis Ababa, Ethiopia
| | - Gurja Belay
- Department of Microbial, Cellular and Molecular Biology, Addis Ababa University, Addis Ababa, Ethiopia
| | - Martin Maiers
- National Marrow Donor Program/Be The Match, Minneapolis, Minnesota, USA
- Center for International Blood and Marrow Transplant Research, Minneapolis, Minnesota, USA
| | - Sarah A. Tishkoff
- Department of Genetics and Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Dimitri S. Monos
- Immunogenetics Laboratory, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia,Pennsylvania, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
3
|
de Lima MMF, Anselmo DHAL, Silva R, Nunes GHS, Fulco UL, Vasconcelos MS, Mello VD. A Bayesian Analysis of Plant DNA Length Distribution via κ-Statistics. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1225. [PMID: 36141111 PMCID: PMC9497530 DOI: 10.3390/e24091225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023]
Abstract
We report an analysis of the distribution of lengths of plant DNA (exons). Three species of Cucurbitaceae were investigated. In our study, we used two distinct κ distribution functions, namely, κ-Maxwellian and double-κ, to fit the length distributions. To determine which distribution has the best fitting, we made a Bayesian analysis of the models. Furthermore, we filtered the data, removing outliers, through a box plot analysis. Our findings show that the sum of κ-exponentials is the most appropriate to adjust the distribution curves and that the values of the κ parameter do not undergo considerable changes after filtering. Furthermore, for the analyzed species, there is a tendency for the κ parameter to lay within the interval (0.27;0.43).
Collapse
Affiliation(s)
- Maxsuel M. F. de Lima
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Dory H. A. L. Anselmo
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Raimundo Silva
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Glauber H. S. Nunes
- Departamento de Ciências Vegetais, Universidade Federal Rural do Semi-Árido, Mossoró 59625-900, RN, Brazil
| | - Umberto L. Fulco
- Departamento de Biofísica e Farmacologia, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Manoel S. Vasconcelos
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Vamberto D. Mello
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| |
Collapse
|
4
|
Hatzidaki E, Iliopoulos A, Papasotiriou I. A Novel Method for Colorectal Cancer Screening Based on Circulating Tumor Cells and Machine Learning. ENTROPY 2021; 23:e23101248. [PMID: 34681972 PMCID: PMC8534570 DOI: 10.3390/e23101248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/20/2021] [Accepted: 09/21/2021] [Indexed: 02/07/2023]
Abstract
Colorectal cancer is one of the most common types of cancer, and it can have a high mortality rate if left untreated or undiagnosed. The fact that CRC becomes symptomatic at advanced stages highlights the importance of early screening. The reference screening method for CRC is colonoscopy, an invasive, time-consuming procedure that requires sedation or anesthesia and is recommended from a certain age and above. The aim of this study was to build a machine learning classifier that can distinguish cancer from non-cancer samples. For this, circulating tumor cells were enumerated using flow cytometry. Their numbers were used as a training set for building an optimized SVM classifier that was subsequently used on a blind set. The SVM classifier’s accuracy on the blind samples was found to be 90.0%, sensitivity was 80.0%, specificity was 100.0%, precision was 100.0% and AUC was 0.98. Finally, in order to test the generalizability of our method, we also compared the performances of different classifiers developed by various machine learning models, using over-sampling datasets generated by the SMOTE algorithm. The results showed that SVM achieved the best performances according to the validation accuracy metric. Overall, our results demonstrate that CTCs enumerated by flow cytometry can provide significant information, which can be used in machine learning algorithms to successfully discriminate between healthy and colorectal cancer patients. The clinical significance of this method could be the development of a simple, fast, non-invasive cancer screening tool based on blood CTC enumeration by flow cytometry and machine learning algorithms.
Collapse
Affiliation(s)
- Eleana Hatzidaki
- Research Genetic Cancer Centre SA (RGCC), 53100 Florina, Greece; (E.H.); (A.I.)
| | - Aggelos Iliopoulos
- Research Genetic Cancer Centre SA (RGCC), 53100 Florina, Greece; (E.H.); (A.I.)
| | - Ioannis Papasotiriou
- Research Genetic Cancer Centre International GmbH, 6300 Zug, Switzerland
- Correspondence:
| |
Collapse
|