1
|
Nielsen H, Teufel F, Brunak S, von Heijne G. SignalP: The Evolution of a Web Server. Methods Mol Biol 2024; 2836:331-367. [PMID: 38995548 DOI: 10.1007/978-1-0716-4007-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
SignalP ( https://services.healthtech.dtu.dk/services/SignalP-6.0/ ) is a very popular prediction method for signal peptides, the intrinsic signals that make proteins secretory. The SignalP web server has existed since 1995 and is now in its sixth major version. In this historical account, we (three authors who have taken part in the entire journey plus the first author of the latest version) describe the differences between the versions and discuss the various decisions taken along the way.
Collapse
Affiliation(s)
- Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Felix Teufel
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Digital Science & Innovation, Novo Nordisk A/S, Malov, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
- Science for Life Laboratory, Stockholm University, Solna, Sweden
| |
Collapse
|
2
|
Affiliation(s)
- Heather J. Kulik
- Department of Chemical Engineering Massachusetts Institute of Technology 77 Massachusetts Ave Rm 66–464 Cambridge MA 02139 USA
| |
Collapse
|
3
|
Jørgensen IF, Brunak S. Time-ordered comorbidity correlations identify patients at risk of mis- and overdiagnosis. NPJ Digit Med 2021; 4:12. [PMID: 33514862 PMCID: PMC7846731 DOI: 10.1038/s41746-021-00382-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 01/05/2021] [Indexed: 11/08/2022] Open
Abstract
Diagnostic errors are common and can lead to harmful treatments. We present a data-driven, generic approach for identifying patients at risk of being mis- or overdiagnosed, here exemplified by chronic obstructive pulmonary disease (COPD). It has been estimated that 5-60% of all COPD cases are misdiagnosed. High-throughput methods are therefore needed in this domain. We have used a national patient registry, which contains hospital diagnoses for 6.9 million patients across the entire Danish population for 21 years and identified statistically significant disease trajectories for COPD patients. Using 284,154 patients diagnosed with COPD, we identified frequent disease trajectories comprising time-ordered comorbidities. Interestingly, as many as 42,459 patients did not present with these time-ordered, common comorbidities. Comparison of the individual disease history for each non-follower to the COPD trajectories, demonstrated that 9597 patients were unusual. Survival analysis showed that this group died significantly earlier than COPD patients following a trajectory. Out of the 9597 patients, we identified one subgroup comprising 2185 patients at risk of misdiagnosed COPD without the typical events of COPD patients. In all, 10% of these patients were diagnosed with lung cancer, and it seems likely that they are underdiagnosed for lung cancer as their laboratory test values and survival pattern are similar to such patients. Furthermore, only 4% had a lung function test to confirm the COPD diagnosis. Another subgroup with 2368 patients were found to be at risk of "classically" overdiagnosed COPD that survive >5.5 years after the COPD diagnosis, but without the typical complications of COPD.
Collapse
Affiliation(s)
- Isabella Friis Jørgensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
4
|
Vangay P, Fugett EB, Sun Q, Wiedmann M. Food microbe tracker: a web-based tool for storage and comparison of food-associated microbes. J Food Prot 2013; 76:283-94. [PMID: 23433376 DOI: 10.4315/0362-028x.jfp-12-276] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Large amounts of molecular subtyping information are generated by the private sector, academia, and government agencies. However, use of subtype data is limited by a lack of effective data storage and sharing mechanisms that allow comparison of subtype data from multiple sources. Currently available subtype databases are generally limited in scope to a few data types (e.g., MLST. net) or are not publicly available (e.g., PulseNet). We describe the development and initial implementation of Food Microbe Tracker, a public Web-based database that allows archiving and exchange of a variety of molecular subtype data that can be cross-referenced with isolate source data, genetic data, and phenotypic characteristics. Data can be queried with a variety of search criteria, including DNA sequences and banding pattern data (e.g., ribotype or pulsed-field gel electrophoresis type). Food Microbe Tracker allows the deposition of data on any bacterial genus and species, bacteriophages, and other viruses. The bacterial genera and species that currently have the most entries in this database are Listeria monocytogenes, Salmonella, Streptococcus spp., Pseudomonas spp., Bacillus spp., and Paenibacillus spp., with over 40,000 isolates. The combination of pathogen and spoilage microorganism data in the database will facilitate source tracking and outbreak detection, improve discovery of emerging subtypes, and increase our understanding of transmission and ecology of these microbes. Continued addition of subtyping, genetic or phenotypic data for a variety of microbial species will broaden the database and facilitate large-scale studies on the diversity of food-associated microbes.
Collapse
Affiliation(s)
- Pajau Vangay
- Department of Food Science, Cornell University, Ithaca, NY 14853, USA
| | | | | | | |
Collapse
|
5
|
Behler J. Neural network potential-energy surfaces in chemistry: a tool for large-scale simulations. Phys Chem Chem Phys 2011; 13:17930-55. [DOI: 10.1039/c1cp21668f] [Citation(s) in RCA: 477] [Impact Index Per Article: 36.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
6
|
Abstract
A forensic entomological investigation can benefit from a variety of widely practiced molecular genotyping methods. The most commonly used is DNA-based specimen identification. Other applications include the identification of insect gut contents and the characterization of the population genetic structure of a forensically important insect species. The proper application of these procedures demands that the analyst be technically expert. However, one must also be aware of the extensive list of standards and expectations that many legal systems have developed for forensic DNA analysis. We summarize the DNA techniques that are currently used in, or have been proposed for, forensic entomology and review established genetic analyses from other scientific fields that address questions similar to those in forensic entomology. We describe how accepted standards for forensic DNA practice and method validation are likely to apply to insect evidence used in a death or other forensic entomological investigation.
Collapse
Affiliation(s)
- Jeffrey D Wells
- Department of Biology, West Virginia University, Morgantown, WV 26506-6057, USA.
| | | |
Collapse
|
7
|
Jensen LJ, Skovgaard M, Brunak S. Prediction of novel archaeal enzymes from sequence-derived features. Protein Sci 2002; 11:2894-8. [PMID: 12441387 PMCID: PMC2373754 DOI: 10.1110/ps.0225102] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The completely sequenced archaeal genomes potentially encode, among their many functionally uncharacterized genes, novel enzymes of biotechnological interest. We have developed a prediction method for detection and classification of enzymes from sequence alone (available at http://www.cbs.dtu.dk/services/ArchaeaFun/). The method does not make use of sequence similarity; rather, it relies on predicted protein features like cotranslational and posttranslational modifications, secondary structure, and simple physical/chemical properties.
Collapse
Affiliation(s)
- Lars Juhl Jensen
- Center for Biological Sequence Analysis, BioCentrum-DTU, The Technical University of Denmark, DK-2800 Lyngby, Denmark
| | | | | |
Collapse
|
8
|
Peleg O, Brunak S, Trifonov EN, Nevo E, Bolshoy A. RNA secondary structure and squence conservation in C1 region of human immunodeficiency virus type 1 env gene. AIDS Res Hum Retroviruses 2002; 18:867-78. [PMID: 12201910 DOI: 10.1089/08892220260190353] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022] Open
Abstract
We have analyzed amino acid, nucleotide sequence, and RNA secondary structure variability in the env gene of human immunodeficiency virus type (HIV-1). In applying algorithms for computing optimal RNA-folding patterns to a nonredundant data set of 178 env nucleotide sequences, we found a conserved RNA stem-loop structure in the first conserved (C1) region of the env gene. This detailed examination also revealed the known secondary structure conservation of the Rev-responsive element (RRE). This finding is also supported by a higher third position conservation of the translatable reading frame along these subregions. The typical folding of the C1 region consists of two isolated stem-loop structures. These highly conserved structures are likely to have a biological function. This assumption is supported by the conservation of the third position along the coding region of these structures. The third position retains a conservation level above what would be statistically expected.
Collapse
Affiliation(s)
- Ofer Peleg
- Genome Diversity Center, Institute of Evolution, Haifa University, Mt. Carmel, Haifa 31905, Israel
| | | | | | | | | |
Collapse
|
9
|
Nielsen H, Brunak S, von Heijne G. Machine learning approaches for the prediction of signal peptides and other protein sorting signals. PROTEIN ENGINEERING 1999; 12:3-9. [PMID: 10065704 DOI: 10.1093/protein/12.1.3] [Citation(s) in RCA: 420] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently, the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, have made it possible to achieve a level of reliability where practical use in, for example automatic database annotation is feasible. In this review, we concentrate on the present status and future perspectives of SignalP, our neural network-based method for prediction of the most well-known sorting signal: the secretory signal peptide. We discuss the problems associated with the use of SignalP on genomic sequences, showing that signal peptide prediction will improve further if integrated with predictions of start codons and transmembrane helices. As a step towards this goal, a hidden Markov model version of SignalP has been developed, making it possible to discriminate between cleaved signal peptides and uncleaved signal anchors. Furthermore, we show how SignalP can be used to characterize putative signal peptides from an archaeon, Methanococcus jannaschii. Finally, we briefly review a few methods for predicting other protein sorting signals and discuss the future of protein sorting prediction in general.
Collapse
Affiliation(s)
- H Nielsen
- Center for Biological Sequence Analysis, Department of Biotechnology, The Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
10
|
Ostergaard L, Pedersen AG, Jespersen HM, Brunak S, Welinder KG. Computational analyses and annotations of the Arabidopsis peroxidase gene family. FEBS Lett 1998; 433:98-102. [PMID: 9738941 DOI: 10.1016/s0014-5793(98)00849-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Classical heme-containing plant peroxidases have been ascribed a wide variety of functional roles related to development, defense, lignification, and hormonal signaling. More than 40 peroxidase genes are now known in Arabidopsis thaliana for which functional association is complicated by a general lack of peroxidase substrate specificity. Computational analysis was performed on 30 near full-length Arabidopsis peroxidase cDNAs for annotation of start codons and signal peptide cleavage sites. A compositional analysis revealed that 23 of the 30 peroxidase cDNAs have 5' untranslated regions containing 40-71% adenine, a rare feature observed also in cDNAs which predominantly encode stress-induced proteins, and which may indicate translational regulation.
Collapse
Affiliation(s)
- L Ostergaard
- Department of Protein Chemistry, Institute of Molecular Biology, University of Copenhagen, Denmark
| | | | | | | | | |
Collapse
|
11
|
Abstract
We study the dynamics of on-line learning for a large class of neural networks and learning rules, including backpropagation for multilayer perceptrons. In this paper, we focus on the case where successive examples are dependent, and we analyze how these dependencies affect the learning process. We define the representation error and the prediction error. The representation error measures how well the environment is represented by the network after learning. The prediction error is the average error that a continually learning network makes on the next example. In the neighborhood of a local minimum of the error surface, we calculate these errors. We find that the more predictable the example presentation, the higher the representation error, i.e., the less accurate the asymptotic representation of the whole environment. Furthermore we study the learning process in the presence of a plateau. Plateaus are flat spots on the error surface, which can severely slow down the learning process. In particular, they are notorious in applications with multilayer perceptrons. Our results, which are confirmed by simulations of a multilayer perceptron learning a chaotic time series using backpropagation, explain how dependencies between examples can help the learning process to escape from a plateau.
Collapse
Affiliation(s)
- W Wiegerinck
- Department of Medical Physics and Biophysics, University of Nijmegen, The Netherlands
| | | |
Collapse
|
12
|
Blom N, Hansen J, Blaas D, Brunak S. Cleavage site analysis in picornaviral polyproteins: discovering cellular targets by neural networks. Protein Sci 1996; 5:2203-16. [PMID: 8931139 PMCID: PMC2143287 DOI: 10.1002/pro.5560051107] [Citation(s) in RCA: 190] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Picornaviral proteinases are responsible for maturation cleavages of the viral polyprotein, but also catalyze the degradation of cellular targets. Using graphical visualization techniques and neural network algorithms, we have investigated the sequence specificity of the two proteinases 2Apro and 3Cpro. The cleavage of VP0 (giving rise to VP2 and VP4), which is carried out by a so-far unknown proteinase, was also examined. In combination with a novel surface exposure prediction algorithm, our neural network approach successfully distinguishes known cleavage sites from noncleavage sites and yields a more consistent definition of features common to these sites. The method is able to predict experimentally determined cleavage sites in cellular proteins. We present a list of mammalian and other proteins that are predicted to be possible targets for the viral proteinases. Whether these proteins are indeed cleaved awaits experimental verification. Additionally, we report several errors detected in the protein databases. A computer server for prediction of cleavage sites by picornaviral proteinases is publicly available at the e-mail address NetPicoRNA@cbs.dtu.dk or via WWW at http:@www.cbs.dtu.dk/services/NetPicoRNA/.
Collapse
Affiliation(s)
- N Blom
- Center for Biological Sequence Analysis, Technical University of Denmark
| | | | | | | |
Collapse
|
13
|
Korning PG, Hebsgaard SM, Rouze P, Brunak S. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic Acids Res 1996; 24:316-20. [PMID: 8628656 PMCID: PMC145627 DOI: 10.1093/nar/24.2.316] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
Collapse
Affiliation(s)
- P G Korning
- Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
| | | | | | | |
Collapse
|
14
|
Hansen JE, Lund O, Nielsen JO, Hansen JE, Brunak S. O-GLYCBASE: a revised database of O-glycosylated proteins. Nucleic Acids Res 1996; 24:248-52. [PMID: 8594592 PMCID: PMC145605 DOI: 10.1093/nar/24.1.248] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
O-GLYCBASE is a comprehensive database of information on glycoproteins and their O-linked glycosylation sites. Entries are compiled and revised from the SWISS-PROT and PIR databases as well as directly from recently published reports. Nineteen percent of the entries extracted from the databases needed revision with respect to O-linked glycosylation. Entries include information about species, sequence, glycosylation site and glycan type, and are fully referenced. Sequence logos displaying the acceptor specificity for the GaINAc transferase are shown. A neural network method for prediction of mucin type O-glycosylation sites in mammalian glycoproteins exclusively from the primary sequence is made available by E-mail or WWW. The O-GLYCBASE database is also available electronically through our WWW server or by anonymous FTP.
Collapse
Affiliation(s)
- J E Hansen
- Laboratory for Infectious Diseases, University of Copenhagen, Denmark
| | | | | | | | | |
Collapse
|
15
|
Abstract
Recognition of function of newly sequenced DNA fragments is an important area of computational molecular biology. Here we present an extensive review of methods for prediction of functional sites, tRNA, and protein-coding genes and discuss possible further directions of research in this area.
Collapse
Affiliation(s)
- M S Gelfand
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow region, Russia
| |
Collapse
|
16
|
Stephens RM, Schneider TD. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 1992; 228:1124-36. [PMID: 1474582 DOI: 10.1016/0022-2836(92)90320-j] [Citation(s) in RCA: 220] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
An information analysis of the 5' (donor) and 3' (acceptor) sequences spanning the ends of nearly 1800 human introns has provided evidence for structural features of splice sites that bear upon spliceosome evolution and function: (1) 82% of the sequence information (i.e. sequence conservation) at donor junctions and 97% of the sequence information at acceptor junctions is confined to the introns, allowing codon choices throughout exons to be largely unrestricted. The distribution of information at intron-exon junctions is also described in detail and compared with footprints. (2) Acceptor sites are found to possess enough information to be located in the transcribed portion of the human genome, whereas donor sites possess about one bit less than the information needed to locate them independently. This difference suggests that acceptor sites are located first in humans and, having been located, reduce by a factor of two the number of alternative sites available as donors. Direct experimental evidence exists to support this conclusion. (3) The sequences of donor and acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive from a common ancestor and that during evolution the information of both sites shifted onto the intron. If so, the protein and RNA components that are found in contemporary spliceosomes, and which are responsible for recognizing donor and acceptor sequences, should also be related. This conclusion is supported by the common structures found in different parts of the spliceosome.
Collapse
Affiliation(s)
- R M Stephens
- National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, MD 21702-1201
| | | |
Collapse
|
17
|
Abstract
Analysis of an artificial neural network trained to classify DNA as coding or non-coding revealed compositional differences between sequence parts translated into protein and those that were not. The 5' end of human introns was found to have a base composition that was non-random to an extent matching the non-randomness in the 3' end that contains the polypyrimidine tract. The prevailing nucleotides in the initial 50 nucleotides of human introns are guanine and cytosine, the trinucleotide GGG was found to occur almost four times as frequently as it would in sequences with a uniform distribution of the nucleotides. The initial part of terminal exons and their associated terminal introns were shown to have a very special base composition deviating strongly from the normal picture in other exons and introns.
Collapse
Affiliation(s)
- J Engelbrecht
- Department of Physical Chemistry, Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
18
|
Lamperti ED, Kittelberger JM, Smith TF, Villa-Komaroff L. Corruption of genomic databases with anomalous sequence. Nucleic Acids Res 1992; 20:2741-7. [PMID: 1614861 PMCID: PMC336916 DOI: 10.1093/nar/20.11.2741] [Citation(s) in RCA: 24] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
We describe evidence that DNA sequences from vectors used for cloning and sequencing have been incorporated accidentally into eukaryotic entries in the GenBank database. These incorporations were not restricted to one type of vector or to a single mechanism. Many minor instances may have been the result of simple editing errors, but some entries contained large blocks of vector sequence that had been incorporated by contamination or other accidents during cloning. Some cases involved unusual rearrangements and areas of vector distant from the normal insertion sites. Matches to vector were found in 0.23% of 20,000 sequences analyzed in GenBank Release 63. Although the possibility of anomalous sequence incorporation has been recognized since the inception of GenBank and should be easy to avoid, recent evidence suggests that this problem is increasing more quickly than the database itself. The presence of anomalous sequence may have serious consequences for the interpretation and use of database entries, and will have an impact on issues of database management. The incorporated vector fragments described here may also be useful for a crude estimate of the fidelity of sequence information in the database. In alignments with well-defined ends, the matching sequences showed 96.8% identity to vector; when poorer matches with arbitrary limits were included, the aggregate identity to vector sequence was 94.8%.
Collapse
Affiliation(s)
- E D Lamperti
- Department of Neurology, Children's Hospital, Harvard Medical School, Boston, MA 02115
| | | | | | | |
Collapse
|
19
|
Affiliation(s)
- J W Clark
- McDonnell Center for the Space Sciences, Washington University, St Louis, Missouri 63130
| |
Collapse
|
20
|
Brunak S, Engelbrecht J, Knudsen S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 1991; 220:49-65. [PMID: 2067018 DOI: 10.1016/0022-2836(91)90380-o] [Citation(s) in RCA: 529] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Artificial neural networks have been applied to the prediction of splice site location in human pre-mRNA. A joint prediction scheme where prediction of transition regions between introns and exons regulates a cutoff level for splice site assignment was able to predict splice site locations with confidence levels far better than previously reported in the literature. The problem of predicting donor and acceptor sites in human genes is hampered by the presence of numerous amounts of false positives: here, the distribution of these false splice sites is examined and linked to a possible scenario for the splicing mechanism in vivo. When the presented method detects 95% of the true donor and acceptor sites, it makes less than 0.1% false donor site assignments and less than 0.4% false acceptor site assignments. For the large data set used in this study, this means that on average there are one and a half false donor sites per true donor site and six false acceptor sites per true acceptor site. With the joint assignment method, more than a fifth of the true donor sites and around one fourth of the true acceptor sites could be detected without accompaniment of any false positive predictions. Highly confident splice sites could not be isolated with a widely used weight matrix method or by separate splice site networks. A complementary relation between the confidence levels of the coding/non-coding and the separate splice site networks was observed, with many weak splice sites having sharp transitions in the coding/non-coding signal and many stronger splice sites having more ill-defined transitions between coding and non-coding.
Collapse
Affiliation(s)
- S Brunak
- Department of Structural Properties of Materials, Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
21
|
Abstract
Like macroscopic machines, molecular-sized machines are limited by their material components, their design, and their use of power. One of these limits is the maximum number of states that a machine can choose from. The logarithm to the base 2 of the number of states is defined to be the number of bits of information that the machine could "gain" during its operation. The maximum possible information gain is a function of the energy that a molecular machine dissipates into the surrounding medium (Py), the thermal noise energy which disturbs the machine (Ny) and the number of independently moving parts involved in the operation (dspace): Cy = dspace log2 [( Py + Ny)/Ny] bits per operation. This "machine capacity" is closely related to Shannon's channel capacity for communications systems. An important theorem that Shannon proved for communication channels also applies to molecular machines. With regard to molecular machines, the theorem states that if the amount of information which a machine gains is less than or equal to Cy, then the error rate (frequency of failure) can be made arbitrarily small by using a sufficiently complex coding of the molecular machine's operation. Thus, the capacity of a molecular machine is sharply limited by the dissipation and the thermal noise, but the machine failure rate can be reduced to whatever low level may be required for the organism to survive.
Collapse
Affiliation(s)
- T D Schneider
- Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, MD 21702
| |
Collapse
|
22
|
Sibbald PR, Argos P. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J Mol Biol 1990; 216:813-8. [PMID: 2176240 DOI: 10.1016/s0022-2836(99)80003-5] [Citation(s) in RCA: 59] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Aligned sequences from the same family (e.g. the haemoglobins) are seldom representative of the entire family. This is because (1) the sequence databases are heavily skewed toward a small number of organisms and (2) only a minute fraction of all the different family members have been sequenced. For many applications, such as using alignments or profiles to perform database searches for distantly related family members, such unequal representation requires correction. An algorithm to perform appropriate weighting of individual sequences is presented along with examples illustrating its efficacy.
Collapse
Affiliation(s)
- P R Sibbald
- European Molecular Biology Laboratory, Heidelberg, F.R.G
| | | |
Collapse
|
23
|
Brunak S, Engelbrecht J, Knudsen S. Neural network detects errors in the assignment of mRNA splice sites. Nucleic Acids Res 1990; 18:4797-801. [PMID: 2395643 PMCID: PMC331948 DOI: 10.1093/nar/18.16.4797] [Citation(s) in RCA: 45] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The use of databanks in genetic research assumes reliability of the information they contain. Currently, error-detection in the manually or electronically entered data contained in the nucleotide sequence databanks at EMBL, Heidelberg and GenBank at Los Alamos is limited. We have used a subset of sequences from these databanks to train neural networks to recognize pre-mRNA splicing signals in human genes. During the training on 33 human genes from the EMBL databank seven genes appeared to disturb the learning process. Subsequent investigation revealed discrepancies from the original published papers, for three genes. In four genes, we found wrongly assigned splicing frames of introns. We believe this to be a reflection of the fact that splicing frames cannot always be unambiguously assigned on the basis of experimental data. Thus incorrect assignment appear both due to mere typographical misprints as well as erroneous interpretation of experiments. Training on 241 human sequences from GenBank revealed nine new errors. We propose that such errors could be detected by computer algorithms designed to check the consistency of data prior to their incorporation in databanks.
Collapse
Affiliation(s)
- S Brunak
- Department of Structural Properties of Materials, Technical University of Denmark, Lyngby
| | | | | |
Collapse
|
24
|
Petersen SB, Bohr H, Bohr J, Brunak S, Cotterill RM, Fredholm H, Lautrup B. Training neural networks to analyse biological sequences. Trends Biotechnol 1990; 8:304-8. [PMID: 1366766 DOI: 10.1016/0167-7799(90)90206-d] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
25
|
|