1
|
Ji M, Kan Y, Kim D, Lee S, Yi G. DeepPI: Alignment-Free Analysis of Flexible Length Proteins Based on Deep Learning and Image Generator. Interdiscip Sci 2024; 16:1-12. [PMID: 38568406 DOI: 10.1007/s12539-024-00618-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 02/01/2024] [Accepted: 02/03/2024] [Indexed: 09/19/2024]
Abstract
With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.
Collapse
Affiliation(s)
- Mingeun Ji
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Yejin Kan
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Dongyeon Kim
- Department of Artificial Intelligence, Dongguk University, Seoul, 04620, Korea
| | - Seungmin Lee
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea.
- Department of Artificial Intelligence, Dongguk University, Seoul, 04620, Korea.
- Division of AI Software Convergence, Dongguk University, Seoul, 04620, Korea.
| |
Collapse
|
3
|
Scalia CR, Gendusa R, Basciu M, Riva L, Tusa L, Musarò A, Veronese S, Formenti A, D'Angelo D, Ronzio AG, Cattoretti G, Bolognesi MM. Epitope recognition in the human-pig comparison model on fixed and embedded material. J Histochem Cytochem 2015. [PMID: 26209082 PMCID: PMC4823807 DOI: 10.1369/0022155415597738] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The conditions and the specificity by which an antibody binds to its target protein in routinely fixed and embedded tissues are unknown. Direct methods, such as staining in a knock-out animal or in vitro peptide scanning of the epitope, are costly and impractical. We aimed to elucidate antibody specificity and binding conditions using tissue staining and public genomic and immunological databases by comparing human and pig—the farmed mammal evolutionarily closest to humans besides apes. We used a database of 146 anti-human antibodies and found that antibodies tolerate partially conserved amino acid substitutions but not changes in target accessibility, as defined by epitope prediction algorithms. Some epitopes are sensitive to fixation and embedding in a species-specific fashion. We also find that half of the antibodies stain porcine tissue epitopes that have 60% to 100% similarity to human tissue at the amino acid sequence level. The reason why the remaining antibodies fail to stain the tissues remains elusive. Because of its similarity with the human, pig tissue offers a convenient tissue for quality control in immunohistochemistry, within and across laboratories, and an interesting model to investigate antibody specificity.
Collapse
Affiliation(s)
| | - Rossella Gendusa
- Azienda Ospedaliera San Gerardo, Monza, Italy (CRS, RG, LR, LT, AM, GC, MMB)
| | - Maria Basciu
- Dipartimento di Chirurgia e Medicina Traslazionale, Universitá degli Studi di Milano-Bicocca, Monza Italy (MB, GC)
| | - Lorella Riva
- Azienda Ospedaliera San Gerardo, Monza, Italy (CRS, RG, LR, LT, AM, GC, MMB)
| | - Lorenza Tusa
- Azienda Ospedaliera San Gerardo, Monza, Italy (CRS, RG, LR, LT, AM, GC, MMB)
| | - Antonella Musarò
- Azienda Ospedaliera San Gerardo, Monza, Italy (CRS, RG, LR, LT, AM, GC, MMB)
| | - Silvio Veronese
- Struttura Complessa di Anatomia Patologica, Dipartimento di Medicina di Laboratorio, Azienda Ospedaliera Ospedale Niguarda Ca' Granda, Milano Italy (SV)
| | - Angelo Formenti
- Servizio di Igiene degli Alimenti di Origine Animale, Dipartimento Veterinario, Azienda Sanitaria Locale di Monza e Brianza, Desio, Italy (AF, DD)
| | - Donatella D'Angelo
- Servizio di Igiene degli Alimenti di Origine Animale, Dipartimento Veterinario, Azienda Sanitaria Locale di Monza e Brianza, Desio, Italy (AF, DD)
| | - Angela Gabriella Ronzio
- Dipartimento di Prevenzione Veterinario, Distretto Veterinario 2 Legnano - Castano Primo, Azienda Sanitaria Locale Milano 1, Castano Primo, Italy (AGR)
| | - Giorgio Cattoretti
- Azienda Ospedaliera San Gerardo, Monza, Italy (CRS, RG, LR, LT, AM, GC, MMB),Dipartimento di Chirurgia e Medicina Traslazionale, Universitá degli Studi di Milano-Bicocca, Monza Italy (MB, GC)
| | | |
Collapse
|
4
|
Mount DW. Using gaps and gap penalties to optimize pairwise sequence alignments. ACTA ACUST UNITED AC 2008; 2008:pdb.top40. [PMID: 21356856 DOI: 10.1101/pdb.top40] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
INTRODUCTIONTo obtain the best possible alignment between two sequences, it is necessary to include gaps in sequence alignments and use gap penalties. For aligning DNA sequences, a simple positive score for matches and a negative score for mismatches and gaps are most often used. To score matches and mismatches in alignments of proteins, it is necessary to know how often one amino acid is substituted for another in related proteins. In addition, a method is needed to account for insertions and deletions that sometimes appear in related DNA or protein sequences. To accommodate such sequence variations, gaps that appear in sequence alignments are given a negative penalty score reflecting the fact that they are not expected to occur very often. Mathematically speaking, it is very difficult to produce the best-possible alignment, either global or local, unless gaps are included in the alignment. This article discusses how to use gaps and gap penalties to optimize pairwise sequence alignments.
Collapse
|
5
|
Abstract
INTRODUCTIONThe original Dayhoff percent accepted mutation (PAM) matrices were developed based on a small number of protein sequences and an evolutionary model of protein change. By extrapolating from the observed changes at small evolutionary distances to large ones, it was possible to establish a PAM250 scoring matrix for sequences that were highly divergent. Another approach to finding a scoring matrix for divergent sequences is to start with a more divergent set of sequences and produce a scoring matrix from the substitutions found in those less-related sequences. The blocks amino acid substitution matrices (BLOSUM) scoring matrices were prepared this way. This article explains how BLOSUM scoring matrices were created and how they can best be used.
Collapse
|