iGPS - Prediction of site-specific kinase-substrate relations from phosphoproteomic data

※Computational resources of protein phosphorylation:

Last updated: May 12, 2010

Introduction:

Protein phosphorylation is the most ubiquitous post-translational modification (PTM), and plays important roles in most of biological processes. Identification of site-specific phosphorylated substrates is fundamental for understanding the molecular mechanisms of phosphorylation. Besides experimental approaches, prediction of potential candidates with computational methods has also attracted great attention for its convenience and fast-speed. In this review, we present a comprehensive but brief summarization of computational resources of protein phosphorylation, including phosphorylation databases, prediction of non-specific or organism-specific phosphorylation sites, prediction of kinase-specific phosphorylation sites or phospho-binding motifs, and other tools. A testing data set prepared from Phospho.ELM 6.0 is available at: Comparison_data.

We apologized that the computational studies without any web links of databases or tools will not be included in this compendium, since it's not easy for experimentalists to use studies directly. We are grateful for users feedback. Please inform Prof. Yu Xue or Prof. Jian Ren to add, remove or update one or multiple web links below.

Index:

<1> Phosphorylation Databases

<2> Prediction of non-specific or organism-specific phosphorylation sites

<3> Prediction of kinase-specific phosphorylation sites or phospho-binding motifs

<4> Miscellaneous tools

<5> Detection of potential phosphorylation sites from mass spectrometry data

==================================================================================

<1> Phosphorylation Databases:

1. Phospho.ELM 8.3 (PhosphoBase): contains 5,115 experimentally verified phosphorylated proteins from different species with 2,746 tyrosine, 15,972 serine and 3,283 threonine sites. All instances were manually collected from scientific literature (Diella, et al., 2004; Diella, et al., 2008).

2. PhosphoSitePlus: a new version of PhosphoSite, is a web-based database to collect protein modification sites, including protein phosphorylation sites from scientific literature as well as high-throughput discovery programs. Currently, PhosphoSitePlus contains 78,022 phosphorylation sites (Hornbeck, et al., 2004).

3. PhosphoNET: PhosphoNET presently holds data on more than 74,000 phosphorylation sites in over 12,400 human proteins that have been collected from the scientific literature and other reputable websites. It features direct links to several other useful websites, and will continue to expand as a useful portal for phosphoproteomics information.

4. HPRD release 9 : HPRD currently contains information for 16,972 PTMs which belong to various categories with phosphorylation (10,858), dephosphorylation (3,118) and glycosylation (1,860) forming the majority of the annotated PTMs. At least one enzyme responsible for PTMs has been annotated for 8,960 PTMs, which resulted in the documentation of 7,253 enzyme - substrate relationships (Keshava Prasad, et al., 2009).

5. PHOSIDA (Mirror website): a phosphorylation site database, integrates thousands of high-confidence in vivo phosphosites identified by mass spectrometry-based proteomics in various species. For each phosphosite, PHOSIDA lists matching kinase motifs, predicted secondary structures, conservation patterns, and its dynamic regulation upon stimulus. Using support vector machines, PHOSIDA also predicts non-specific phosphosites (Gnad, et al., 2007; Gnad, et al., 2009).

6. PhosphoPep v2.0: contains MS-derived phosphorylation data from 4 different organisms, including fly (Drosophila melanogaster), human (Homo sapiens), worm (Caenorhabditis elegans), and yeast (Saccharomyces cerevisiae) (Bodenmiller, et al., 2008).

7. PhosPhAt 3.0: contains information on Arabidopsis phosphorylation sites which were identified by mass spectrometry in large scale experiments from different research groups with 6,282 phosphopeptides (Heazlewood, et al., 2008; Durek, et al., 2010).

8. P(3)DB 1.1: provides a database of protein phosphorylation data from multiple plants. The database was initially constructed with a dataset from oilseed rape, including 14,670 nonredundant phosphorylation sites from 6382 substrate proteins (Gao, et al., 2009).

9. Swiss-Prot knowledge base (Mirror website): for each protein annotation, the "Amino acid modifications" in the "Sequence annotation (Features)" section collected the post-translational modification information of proteins (Farriol-Mathis, et al., 2004).

10. dbPTM 2.0: integrates experimentally verified PTMs from several databases, and to annotate the predicted PTMs on Swiss-Prot proteins (Lee, et al., 2006).

11. SysPTM 1.1 (Mirror website): provides a systematic and sophisticated platform for proteomic PTM research, equipped not only with a knowledge base of manually curated multi-type modification data, but also with four fully developed, in-depth data mining tools. (Li, et al., 2009).

12. PhosphoPOINT: is a comprehensive human kinase interactome and phospho-protein database, containing 4195 phospho-proteins with a total of 15,738 phosphorylation sites (Yang, et al., 2008).

13. NetworKIN 1.0 (NetworKIN-2.0 beta version): is a method for predicting in vivo kinase-substrate relationships, that augments consensus motifs with context for kinases and phosphoproteins. It's a great resource and open a door for computational discovering of phospho-regulatory network (Linding, et al., 2007; Linding, et al., 2008).

14. Phospho3D: is a database of three-dimensional structures of phosphorylation sites which stores information retrieved from the phospho.ELM database and which is enriched with structural information and annotations at the residue level (Zanzoni, et al., 2007).

15. PepCyber :P~Pep 1.2: is a database of human protein-protein interactions mediated by 10 classes of phosphoprotein binding domains (PPBDs) (Gong, et al., 2008).

16. PhosphoVariant: a database for human phosphovariants, which were defined as genetic variations that change phosphorylation sites or their interacting kinases (Ryu, et al., 2009).

17. ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites, containing 4,557 manually validated spectra associated with 4,226 unique peptides from 1,367 proteins (Hummel, et al., 2009).

18. PlantsP: contains more than 300 phosphorylation sites from Arabidopsis thaliana plasma membrane proteins (Nühse, et al., 2009).

19. LymPHOS: a phosphosite database of primary human T cells, with 342 phosphorylation sites mapping to more than 200 gene sequences (Ovelleiro, et al., 2009).

20. PhosSNP 1.0: a genome-wide analysis of genetic polymorphisms that influence protein phosphorylation in H. Sapiens. It was estimated that ~69.76% of nsSNPs (non-synonymous SNPs) are potential phosSNPs (Phosphorylation-related SNPs) (64, 035) in 17, 614 proteins (Ren, et al., 2010).

21. The Phosphorylation Site Database: provides ready access to information from the primary scientific literature concerning those proteins from prokaryotic organisms, i.e., the members of the domains Archaea and Bacteria, that have been reported to undergo covalent phosphorylation on the hydroxyl side chains of serine, threonine, and/or tyrosine residues (Wurgler-Murphy, et al., 2004).

22. PhosphoGRID: a database of experimentally verified in vivo phosphorylation sites curated from the S. cerevisiae primary literature. PhosphoGRID records the positions of over 5000 specific phosphorylated residues on 1495 gene products. (Stark, et al., 2004).

<2> Prediction of non-specific or organism-specific phosphorylation sites:

1. NetPhos 2.0: produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins (Blom, et al., 1999).

2. CRP: Cleaved Radioactivity of Phosphopeptides. CRP performs an in silico proteolytic cleavage of the sequence and reports the predicted Edman cycles in which radioactivity would be observed if a given serine, threonine or tyrosine will be phosphorylated (Mackey, et al., 2003).

3. DISPHOS 1.3: uses disorder information to improve the discrimination between phosphorylation and non-phosphorylation sites, and predicts serine, threonine and tyrosine phosphorylation sites in proteins (Iakoucheva, et al., 2004).

4. NetPhosYeast 1.0: predicts serine and threonine phosphorylation sites in yeast proteins (Ingrell, et al., 2007).

5. NetPhosBac 1.0: NetPhosBac 1.0 server predicts serine and threonine phosphorylation sites in bacterial proteins (Miller et al. 2009).

6. PhosPhAt 3.0: They utilized a set of 802 experimentally validated serine phosphorylation sites as the training data set in their 2.2 version, while with additional 1,818 threonine phosphorylation sites and 676 tyrosine sites in Arabidopsis to develop their 3.0 predictor for phosphorylation sites in Arabidopsis (Heazlewood, et al., 2008; Durek, et al., 2010).

7. PHOSIDA (Mirror website): a predictor based on more than 5,000 high confidence phosphosites, with the Support vector machines (SVMs) algorithm (Gnad, et al., 2007).

8. GANNPhos: uses a genetic algorithm integrated neural network (GANN) algorithm (Tang, et al., 2007). The tool is not available.

9. PHOSITE: is based on the case-based sequence analysis (Koenig and Grabe, 2004). The tool is not available.

<3> Prediction of kinase-specific phosphorylation sites or phospho-binding motifs:

1. GPS 2.1 :The current version of GPS system. We renamed the tool as the Group-based Prediction System. GPS 2.1 software was implemented in JAVA and could predict kinase-specific phosphorylation sites for 408 human Protein Kinases in hierarchy (Xue, et al., 2008).

2. GPS 1.10 : The old version of GPS. We designed a novel algorithm GPS (Group-based Phosphorylation sites Prediction) and construct an easy-to-use web server for the experimentalists (Xue, et al., 2005; Zhou, et al., 2004).

3. PPSP 1.0 :We also developed another online program for prediction of kinase-specific phosphorylation sites, implemented in Baysian Decision Theory (BDT) (Xue, et al., 2006).

4. ScanProsite: consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them (de Castro, et al., 2006; Hulo, et al., 2008).

5. ELM: is a resource for predicting functional sites in eukaryotic proteins (Puntervoll, et al., 2003).

6. Minimotif Miner: analyzes protein queries for the presence of short functional motifs that, in at least one protein, has been demonstrated to be involved in posttranslational modifications (PTM), binding to other proteins, nucleic acids, or small molecules, or proteins trafficking (Balla, et al., 2006; Rajasekaran, et al., 2009).

7. PhosphoMotif Finder: contains known kinase/phosphatase substrate as well as binding motifs that are curated from the published literature. It reports the PRESENCE of any literature-derived motif in the query sequence (Amanchy, et al., 2007).

8. PREDIKIN 1.0: produces a prediction of substrates for serine/threonine protein kinases based on the primary sequence of a protein kinase catalytic domain (Brinkworth, et al., 2003).

9. Predikin & PredikinDB 2.0: consists of two components: (i) PredikinDB, a database of phosphorylation sites that links substrates to kinase sequences and (ii) a Perl module, which provides methods to classify protein kinases, reliably identify substrate-determining residues, generate scoring matrices and score putative phosphorylation sites in query sequences (Saunders, et al., 2008; Saunders and Kobe, 2008).

10. ScanSite 2.0: searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains (Obenauer, et al., 2003).

11. NetPhosK 1.0: produces neural network predictions of kinase specific eukaryotic protein phosphoylation sites. Currently NetPhosK covers the following kinases: PKA, PKC, PKG, CKII, Cdc2, CaM-II, ATM, DNA PK, Cdk5, p38 MAPK, GSK3, CKI, PKB, RSK, INSR, EGFR and Src (Blom, et al., 2004).

12. PredPhospho 1.0: implemented in SVM algorithm, could predict kinase-specific phosphorylation sites for 4 kinase groups and 4 kinase families, respectively (Kim, et al., 2004).

13. PredPhospho 2.0: enhance version of PredPhospho predictor, which was still implemented in SVM algorithm, for 7 kinase groups and 18 kinase families, respectively (Ryu, et al., 2009).

14. KinasePhos 1.0: predicts kinase-specific phosphorylation sites within given protein sequences. Profile Hidden Markov Model (HMM) is applied for learning to each group of sequences surrounding to the phosphorylation residues (Huang, et al., 2005).

15. KinasePhos 2.0: New version of kinase-specific phosphorylation site prediction tool that is based the sequenece-based amino acid coupling-pattern analysis and solvent accessibility as new features of SVM (support vector machine) (Wong, et al., 2007).

16. PhoScan: predicts of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach (Li, et al., 2007).

17. pkaPS: Prediction of protein kinase A phosphorylation sites using the simplified kinase binding model (Neuberger, et al., 2007).

18. CRPhos 0.8: Prediction of kinase-specific phosphorylation sites using conditional random fields. Its source code is free for academic research and could be compiled in Linux/Unix OS (Dang, et al., 2008).

19. AutoMotif 2.0: allows for identification of PTM (post-translational modification) sites, including phosphorylation sites in proteins. The AutoMotif Server 2.0 was trained support vector machine (SVM) for each type of PTM separately on proteins of the Swiss-Prot database (version 42.0) (Plewczynski, et al., 2005; Plewczynski, et al., 2008).

20. MetaPredPS: Meta-predictors make predictions by organizing and processing the predictions produced by several other predictors in a defined problem domain (Wan, et al., 2008).

21. SMALI: searches for peptide ligands in human proteins that are likely to bind to SH2 domains (Huang, et al., 2008; Li, et al., 2008).

22. NetPhorest: is a non-redundant collection of 125 sequence-based classifiers for linear motifs in phosphorylation-dependent signaling. The collection contains both family-based and gene-specific classifiers (Miller, et al., 2008).

23. SiteSeek: is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence (Yoo, et al., 2008). The tool is not available.

24. PostMod: is a predict sever for phosphorylation sites. The authors combined physicochemical information, motif information, and evolutionary information by simply comaparing sequence similarities, and could predict phosphorylation sites for 48 different kinases (Jung, et al., 2010).

<4> Miscellaneous tools:

1. DOG 1.0 :prepares publication-quality figures of protein domain structures. The scale of a protein domain and the position of a functional motif/site will be precisely calculated (Ren, et al., 2009).

2. Motif-X: is a software tool designed to extract overrepresented patterns from any sequence data set. The algorithm is an iterative strategy which builds successive motifs through comparison to a dynamic statistical background (Schwartz and Gygi, 2005).

3. Scan-X: is a software tool designed to find motifs (identified using motif-x) within any sequence data set. The first large scale scan was performed using all available human, mouse, fly and yeast phosphorylation and acetylation data to perform a scan for undiscovered sites (Schwartz, et al., 2008).

4. MoDL: finds mutliple motifs in a set of phosphorylated peptides (Ritz, et al., 2009).

5. PhosphoBlast: allows the user to submit a protein query to search against the curated dataset of phosphorylated peptides (Wang and Klemke, 2008).

6. RLIMS-P: is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu, et al., 2005; Yuan, et al., 2006).

7. KEA: Kinase enrichment analysis (KEA) is a web-based tool with an underlying database providing users with the ability to link lists of mammalian proteins/genes with the kinases that phosphorylate them (Lachmann and Ma'ayan, 2009).

<5> Detection of potential phosphorylation sites from mass spectrometry data:

1. PhosphoScore: is a phosphorylation assignment program that is compatible with all levels of tandem mass spectrometry spectra (MSn) generated through the Bioworks/Sequest platform. The program utilizes a "cost function" which takes into account both the match quality and normalized intensity of observed spectral peaks compared to a theoretical spectrum. PhosphoScore was written in Java (Ruttenberg, et al., 2008).

2. Ascore: measures the probability of correct phosphorylation site localization based on the presence and intensity of site-determining ions in MS/MS spectra (Beausoleil, et al., 2006).

3. Colander: a probability-based support vector machine algorithm for automatic screening for CID spectra of phosphopeptides prior to database search (Lu, et al., 2008).

4. DeBunker: a SVM-based software, which could automatically validate phosphopeptide identifications from tandem mass spectra (Lu, et al., 2007).

5. APIVASE 2.2: was developed for phosphopeptide validation by combining the information obtained from MS2 spectra and its corresponding neutral loss MS3 spectra (Jiang, et al., 2008).

6. InsPecT: a new scoring function was developed for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation (Payne, et al., 2008).

7. Phosphopeptide FDR Estimator: is designed for analysis of phosphopeptide LC-MS/MS data (Du, et al., 2008). The tool is not available.

8. PhosTShunter: a fast and reliable tool to detect phosphorylated peptides in liquid chromatography Fourier transform tandem mass spectrometry data sets (Kocher, et al., 2006). The tool is not available.

9. PhosphoScan: a probability-based method for phosphorylation site prediction using MS2/MS3 pair information (Wan, et al., 2008). The tool is not available.

10. ArMone: a new scoring function was developed for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation (Jiang, et al., 2010).