Protein localization prediction programs


















Comparison based on the proteins in the MetaSCL07 dataset. The problem of protein subcellular localization is commonly formulated as a multi-class classification problem.

However, it can also be viewed as several individual two-class classification problems, one for each subcellular compartment. This allows one to examine the ability of a given predictor to identify proteins localized in each of the compartments individually. The MetaSCL06 dataset was converted into four variant datasets, each one of which for examining one of the four subcellular compartments: nuclear, cytoplasmic, mitochondria and extracellular, respectively.

We evaluated the predicting performance of each of the 12 element predictors, as well as that of RAW-RAG-6 meta-predictor, using these variant datasets. Comparing with the element predictor with the best performance Proteome Analyst , the biggest improvement was achieved for the extracellular compartment, with 2.

It is followed by the nuclear compartment, for which a 2. The smallest improvement is found for the cytoplasmic compartment, where 0. Overall, the RAW-RAG-6 meta-predictor achieves remarkable performance in these two-class classification problems, and consistently outperforms every element predictor in identifying proteins localized in each of the four subcellular compartments.

Predicting performance of element predictors and RAW-RAG-6 in two-class predictions for the 4 subcellular compartments. Comparison made based on datasets derived from the MetaSCL06 dataset.

When more than one of these is used on the same data, they may produce conflicting predictions. Users are often confused and frustrated by such conflicting results, because they may lack the knowledge to make a sensible choice among them. If a meta-predictor can be developed with predicting performance exceeding that of any individual element predictors, it may resolve this quandary.

Meta-predictors cannot replace element predictors. Rather, they are enhancements. Meta-predictors are constructed from element predictors, and their performance depends on accurate predictions made by element predictors.

Without good element predictors, it is not possible for good meta-predictors to be developed. In addition, meta-predictors in particular, voting-based meta-predictors are effective only within the scope of the prediction problem that is common to multiple element predictors. Often, element predictors make unique predictions. For example, among the prediction programs discussed in this study, only PSORT II makes predictions about protein localization to secretory vesicles.

For unique predictions, one has to rely on an element predictor. We did not perform cross-validation explicitly in this development. However, because all parameters of RAW-RAG-6 relative accuracy values of element predictors are calculated as sample statistics, and the latter are insensitive to removal of a small number of samples given that the sample size is sufficiently large, the testing we performed can be considered as being equivalent to cross-validation.

In each iteration, the parameters of the meta-predictor—which are relative accuracy values of the element predictors—calculated based on the training samples, would be essentially the same as the relative accuracy values of the whole sample dataset, because the relative accuracy of each element predictor is a sample statistic, which is insensitive to the removal of one sample from the dataset, given that the sample size is sufficiently large.

Therefore, each predictor achieved from the iterations of LOO cross-validation would be the same as the predictor achieved from the entire sample dataset.

Its performance for future, unseen data is expected to be close to what was achieved in this study, assuming no changes are made to the element predictors. However, if changes take place in any of its component element predictors, the reduced voting-based meta-predictor will need adjustment. The linear voting strategies explored in this study are related to several well-known online learning algorithms, including Littlestone and Warmuth's weighted majority WM algorithm 21 and Freund and Schapire's Hedge algorithm In other words, the training samples are assumed to be provided together, instead of one at a time.

In the same paper in which the Hedge algorithm was discussed 22 , Freund and Schapire introduced the well-known Adaboost algorithm, which applies to batch learning. The major difference between meta-prediction problem discussed in this study and Adaboost and other ensemble learning algorithms [e. The successful development of RAW-RAG-6 demonstrates the effectiveness of voting-base strategies in the meta-prediction problems.

Proper employment of voting-based strategies is likely to lead to good meta-predictors in other life sciences problem domains. We thank Drs A. Banerjee and W. Pan at the University of Minnesota for inspiring discussions. We also thank the Supercomputing Institute, University of Minnesota for computational resources, and W. Gong for technical assistance. National Center for Biotechnology Information , U. Journal List Nucleic Acids Res v. Nucleic Acids Res.

Published online Aug 1. Lynda B. Author information Article notes Copyright and License information Disclaimer. This article has been cited by other articles in PMC. Abstract Meta-prediction seeks to harness the combined strengths of multiple predicting programs with the hope of achieving predicting performance surpassing that of all existing predictors in a defined problem domain.

Open in a separate window. Figure 1. Step 1: Assembling an unbiased set of proteins For unbiased testing, the dataset compiled for this study should not contain data used in the development of any element prediction programs. Step 2: Assigning class labels to proteins based on GO annotations In this study, we focus on classifying proteins localized in four subcelluar compartments—nuclear, cytoplasmic, mitochondrial and extracellular. Step 3: Assigning class labels to proteins based on annotations on comment field All proteins in the unbiased protein set obtained at Step 1 were fed into a keyword filter, in which the comment field CC was checked against a list of keywords.

Step 4: Manual reconciliation Finally, all proteins that received class labels in the Steps 2 and 3 were subject to manual reconciliation. Selection of element predictors In order to be usable as an element predictor for the meta-prediction problem, a prediction program needs to be accessible online or be available in downloadable form.

Each program is discussed below: Table 1. Summary of the 12 element predictors a. Obtaining and pre-processing prediction results of element predictors Prediction jobs were submitted to each of the element prediction programs with the protein sequences in the MetaSCL06 and MetaSCL07 datasets.

Performance measures For a two-class classification problem, commonly used performance measures include sensitivity, specificity, accuracy and Matthew's correlation coefficient MCC Comparison of element predictors The element predictors are not completely compatible with one another in the types of predictions they make. Unweighted voting strategy For a given protein in the dataset, the unweighted voting meta-predictor makes prediction P uv as.

Weighted voting strategy The weighted voting strategy differs from the unweighted voting strategy in that the predictions made by element predictors are multiplied by a weight, which varies among predictors, before being summed up to produce the prediction of the meta-predictors. Reduced voting strategy Although the prediction results of all element predictors are available to the meta-predictors, it is not necessary for all of them to be used.

Table 2. Predicting performance of element predictors using the MetaSCL06 dataset. Unweighted voting strategy With the performance of every element predictor assessed using the unbiased MetaSCL06 dataset, we set out to explore strategies to construct meta-predictors on top of these element predictors.

Table 3. Weighted voting strategy Next, we examined a weighted voting strategy. Reduced voting strategy The four voting schemes unweighted voting and three weighted voting schemes are combined with the three reduction methods [accuracy-guided reduction or AG , relative accuracy-guided reduction or RAG and GCC-guided reduction GG ], giving rise to a total of 12 series of reduced voting meta-predictors.

Figure 2. Table 4. Predicting performance of reduced voting meta-predictors. RAW-RAG-6 with data not used in its development Element predictor performance was evaluated on data not used in their development. Table 5. RAW-RAG-6 in individual compartment predictions The problem of protein subcellular localization is commonly formulated as a multi-class classification problem. Table 6. Meta-predictors versus element predictors Meta-predictors cannot replace element predictors.

Cross-validation and future performance We did not perform cross-validation explicitly in this development. Linear voting strategies The linear voting strategies explored in this study are related to several well-known online learning algorithms, including Littlestone and Warmuth's weighted majority WM algorithm 21 and Freund and Schapire's Hedge algorithm Conflict of interest statement.

None declared. Nakai K, Kanehisa M. A knowledge base for predicting protein localization sites in eukaryotic cells. Automatic prediction of protein function. Life Sci. Donnes P, Hoglund A. Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Nakai K, Horton P.

PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. This mitigates a bug found in v3. Copy and paste your FASTA-formatted sequences into the textbox below or select a file containing your sequences to upload from your computer. Web display mode is limited to the analysis of approximately proteins. For larger analyses, either enter your email address in the form below results of up to per submission returned by email or for even larger analyses we can help you or you can download the standalone version.

If you are interested in being notified about changes in the status of PSORTb, please subscribe to the psort-update mailing list by emailing maillist sfu. Choose an organism type? Bacteria Archaea Required. Advanced Gram stain options? Select Option Negative without outer membrane Positive with outer membrane Required.

Normal Short Format tab delimited Long Format tab delimited. Show results? Updates Documentation Resources Contact. SLP-Local Matsuda et al, predicts localizations for chloroplast, mitochondria, secretory pathway, and other locations nucleus or cytosol for eukaryotic proteins, as well as cytoplasm, extracell, and periplasm for Gram negative organisms.

TatP Bendtsen et al, predicts twin-arginine signal peptides in bacteria. TBpred Rashid et al, is a prediction server that predicts four subcellular localization cytoplasmic, integral membrane, secretory and membrane attached by lipid anchor of mycobacterial proteins. Augur Billion et al, is a computational pipeline for Gram-positive bacterial whole-genome sufrace protein predictions. P-classifier Wang et al, predicts subcellular localizations of proteins for Gram-negative bacteria based on amino acid subalphabets and a combination of multiple support vector machines PSLDoc Chang et al, uses document classification techniques and incorporates a probabilistic latent semantic analysis with a support vector machine odel, for prediction on prokaryotes and eukaryotes.

SubLoc Hua and Sun, uses Support Vector Machine to assign a prokaryotic protein to the cytoplasmic, periplasmic, or extracellular sites, and a eukaryotic protein to the cytoplasmic, mitochondrial, nuclear, or extracellular sites. Active AAIndexLoc Tantoso and Li, predicts protein subcellular localization by using amino acid composition and physicochemical properties.

BaCelLo Pierleoni et al, is a predictor for five classes of eukaryotic subcellular localization secretory pathway, cytoplasm, nucleus, mitochondrion and chloroplast and it is based on different SVMs organized in a decision tree. DeepLoc Almagro Armenteros et al. Discriminative HMMs Lin et al, predicts yeast SCLs using motifs that are present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the protein sorting mechanism.

Euk-mPLoc Chou and Shen, Chou and Shen, is a general eukaryotic predictor which hybridizes gene ontology information, functional domain information, and sequential evolutionary information to predict eukaryotic protein subcellular localization. Golgi Localization Predictor Yuan and Teasdale, predicts Golgi Type II membrane proteins and can discriminate between proteins destined for the Golgi apparatus or other post-Golgi locations. Hum-mPLoc Shen and Chou, is a localization predictor specific for human proteins.

It uses an ensemble classifier that handles cases where a human protein has multiple possible location sites. MultiLoc2 Blum et al, predicts animal, plant and fungal protein subcellularlocalizations by integrating phylogeny and Gene Ontology terms to the new version of the software.

Plant-mPLoc Shen and Chou, predicts plant protein subcellular localization by Gene Ontology, functional domain, and 3 modes of pseduo-amino acid composition. Protein Prowler version 1. It classifies eukaryotic targeting signals as secretory, mitochondrion, chloroplast or other.

Version 1. PSLDoc Chang et al, uses document classification techniques and incorporates a probabilistic latent semantic analysis with a support vector machine model, for prediction on prokaryotes and eukaryotes. SecretomeP Bendtsen et al, predicts eukaryotic proteins which are secreted via a non-traditional secretory mechanism.

SecretP Yu et al, predicts mammalian secreted proteins using PseAA and SVMs SherLoc2 Briesemeister et al, predicts animal, plant and fungal protein subcellualr localizations using sequence-based and text-based features. SignalP Bendtsen et al, predicts traditional N-terminal signal peptides in both prokaryotic and eukaryotic proteins. SLPFA Tamura and Akutsu, predicts localizations by feature vectors based on amino acid composition frequency and sequence alignment.

Subcellular locations predicted include chloroplast, mitochondria, secretory pathway, and other locations nucleus or cytosol for eukaryotic proteins SLP-Local Matsuda et al, predicts localizations for chloroplast, mitochondria, secretory pathway, and other locations nucleus or cytosol for eukaryotic proteins, as well as cytoplasm, extracell, and periplasm for Gram negative organisms.

TargetP Emanuelsson et al, predicts the presence of signal peptides, chloroplast transit peptides, and mitochondrial targeting peptides for plant proteins, and the presence of signal peptides and mitochondrial targeting peptides for eukaryotic proteins.

YLoc Briesemeister et al, , Briesemeister et al, provides attributes explanations for users and mutliple localization prediction capabilities for animal, plant and fungal protein subcellular localizations. EpiLoc Brady and Shatkay, is a text-based system for predicting animal, plant and fungal protein subcellular locations. Hum-mPLoc 2. KnowPredsite Lin et al, predicts single and multiple localizations based on local similarity of proteins at different sites.

Predotar is designed to predict the presence of mitochondrial and plastid targeting peptides in plant sequences. PROlocalizer Laurila and Vihinen, predicts 12 animal protein localization by integrating 11 methods together.



0コメント

  • 1000 / 1000