In a Foldit "de-novo" puzzle, players are given a fixed sequence of amino acids, presented as a straight "extended chain". Unlike design puzzles, which also start with an extended chain, no mutation is allowed on de-novo puzzles. Also unlike a design puzzle, a de-novo puzzle typically has some secondary structures (helixes or sheets) defined. The puzzle comments typically state that the secondary structure predictions are "from PSIPRED".
The subject of secondary structure predictions came up in #veteran chat on 8 January 2017 (UTC-6). An edited version of the chat log appears below.
Background[]
Some general background on the topics discussed in the chat may be helpful.
Amino acid sequence and secondary structure notation in Foldit[]
The amino acid sequence (or "primary structure") of a Foldit puzzle is typically represented as a string of single-character amino acid codes. Recent Foldit puzzles typically have the sequence on the web page. For example, for Puzzle 1326, the sequence is:
TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG
This style is often referred to as "Fasta format". (Fasta has many variations; often there's a short header that gives the sequence a name.) While the prediction shown here is in upper case, Foldit functions, for example structure.GetAminoAcid and structure.SetAminoAcid, use lowercase.
Many Foldit recipes use a similar format for secondary structure. The Foldit standard is to use "H" for helix, "E" for sheet, and "L" for loop. The starting secondary structure for Puzzle 1326 is
LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL
in this format. Other tools may use "-" or a blank space for loop. And just to keep things confusing, sheets are sometimes called "strands", and "coil" may be used instead of "loop". On the other hand, there's "coiled coil", where two or more helixes twist together, as seen in puzzle 479.
Foldit recipes that work with amino acid sequence and secondary structure[]
The Foldit recipe Print Protein 2.4 prints the amino acid sequence and secondary structure in the format shown above. For convenience, both structures are also presented for copy and paste.
The Foldit recipes AA Edit 1.2 and SS Edit 1.2 show the current amino acid and secondary structure sequence, and allow the user to paste in new sequences.
The recipe AA Copy Paste Compare v 1.1.1 -- Brow42 combines both amino acid and secondary structure display and change in one recipe.
Tools mentioned in the chat[]
The chat mentioned several tools that predict secondary structure and other aspects of a fold based on the amino acid sequence. These tools are available online, and accept the simple "Fasta" format shown above for the input sequence.
The first tool is PSIPRED, which is used to produce the secondary structure prediction of most Foldit de-novos. One of PSIPRED's output's is similar to the secondary structure format shown above.
Another popular tool is Jpred, which produces several predictions of the secondary structure based on the amino acid sequence. Jpred also attempts to find any matching or similar sequences for published proteins. JPred's main predictions for secondary structure are similar to the format shown above.
The chat also mentioned NetSurfP, which produces secondary structure predictions as probabilities for each segment. This led to the Foldit recipe NetSurfP 1.0, which converts NetSurfP output into the secondary structure format shown above (and also reformats the NetSurfP output so it can be more easily pasted into a spreadsheet).
Finally, NetTurnP is closely related to NetSurfP, but produces a segment-by-segment analysis of where there are likely to be turns. A Foldit recipe to digest NetTurnP output is no doubt forthcoming.
Comparison of predictions[]
The prediction tools described above were compared for Puzzle 1326.
PSIPRED[]
One version of the PSIPRED prediction is a simple text file:
# PSIPRED HFORMAT (PSIPRED V3.3)
1 2 3 4 5 6 7 123456789012345678901234567890123456789012345678901234567890123456789012345 Conf: 915999999999999851688752057789998400155416887210011678872999999999999997439 Pred: CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC AA: TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG
The secondary structure prediction is:
CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC
or translated into Foldit:
LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL
this is a little different than the start for Puzzle 1326
LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)
The difference means that PSIPRED was probably run with different settings for Puzzle 1326 setup. The tool has many different modes and options. Only the default mode was used for this analysis. Some of the modes are proprietary and require a license key to run.
Jpred[]
The main Jpred prediction for the sequence from Puzzle 1326 is:
OrigSeq TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG Jnet --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH-- jhmm --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH--
Jnet and jhmm are two different prediction methods, but here they produced the same results. Converted to Foldit style, here's the comparison to the puzzle 1326 start:
LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)
Jpred predicts the initial helix is shorter than shown at the start of Puzzle 1326.
NetSurfP[]
The NetSurfP output was reduced by the Foldit recipe NetSurfP 1.0:
LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)
Combined[]
The four slightly different predictions combined in one box:
LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED) LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred) LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP) LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)
As Susume mentions in the chat, all these tools probably have a similar weak spot, which is predicting sheets on the outside of a protein. Puzzle 1326 is likely a protein originally designed by Foldit players. Foldit design often have relatively flat section of two or more sheets opposite one or more helixes. This is referred to as the "hotdogs and surf board" model in the chat, where the hotdogs are the helixes and the surfboard is the sheets. In designs of this type, the sheets on the outer edge of the surfboard tend to have a lot of hydrophobic residues on both the "outer" and "inner" (helix-facing) sides. The prediction services seem to have difficulty guessing that these hydrophobic sequences form sheets.
The Chat[]
Here is the #veteran chat that discussed all these tools. The chat has be lightly edited to remove some interspersed conversations and correct a few typos. All times are UTC-6, or US Central Standard Time.
Session Start: Sun Jan 08 10:25:56 2017 | ||
Session Ident: #veteran | ||
Susume2 | when I have a design that jpred can't get anywhere close on the SS, I guess Rosetta is likely to choke on it as well :-P | 10:25 |
---|---|---|
Formula350 | I don't know for certain but I've speculated that the predictions might be only there as a hint but aren't always in the right place; thus, might need to be moved somewhere else. | 13:46 |
Susume2 | the SS predictions are from an algorithm called psipred - it has certain pros and cons - one con is that it can't recognize all-blue areas as outside sheets and predicts them as loops | 13:49 |
Susume2 | not sure why it had that extra helix on a recent one though | 13:50 |
TomTaylor5 | There are also other prediction sites you can try. | 13:50 |
Susume2 | I always check jpred as a point of comparison, though it has the same weakness on all-blue areas | 13:51 |
@TimovdL | I always get a second opinion of jpred | 13:51 |
kabubi | how do you use jpred? copying the sequence manually? | 13:52 |
@TimovdL | I use a small script that prints the sequence | 13:53 |
Susume2 | Loci has a nice script called AA Edit that will let you put the sequence in the clipboard | 13:53 |
TomTaylor5 | This would be a good one for 350. NetSurfP | 13:53 |
tokens | I think in the previous puzzle the predictor got confuzed by the leucines and arginines which are more often found on helices than on sheets | 13:53 |
Formula350 | Software to download, or a website Tom? | 13:53 |
kabubi | i had problems to positioning that helix indeed | 13:54 |
TomTaylor5 | like PSIPRED. | 13:54 |
tokens | well, leucines at least | 13:54 |
TomTaylor5 | Secondary structure predictions | 13:55 |
kabubi | i'm trying jpred now...the sequence is in the puzzle page | 13:55 |
Formula350 | Found it, thanks Tom. We'll see if I can figure out how to use it! :D | 13:56 |
Skippysk8sirc | Kabubi -- I may just be a hack, but sometimes I just change things and ignore predictions..... easier to fold | 13:56 |
Skippysk8sirc | You can also copy and paste the AA sequence from the puzzle notes..... quick and easy | 13:57 |
TomTaylor5 | Copy/paste to text box, press Submit query | 13:57 |
Formula350 | The notes on the Puzzle menu, Skip? | 13:58 |
kabubi | to see the result in applet , it ask 7gb of ram... | 13:58 |
Skippysk8sirc | yes, at bottom of page for puzzle | 13:58 |
Formula350 | How do you do that? I was just in there trying haha | 13:58 |
TomTaylor5 | ctrl-a, ctrl-v | 13:59 |
Susume2 | on jpred, I always show all results in html - it gives a nice fixed-width lineup of homologous sequences if it finds any (it won't find them on these foldit designs though) | 13:59 |
Skippysk8sirc | go from main foldit page to puzzles, click on puzzle number, and then move mouse down to sequence which is listed. | 13:59 |
Formula350 | Oh lol | 14:00 |
Formula350 | I was in the game's puzzle menu. | 14:00 |
Formula350 | Didn't know you meant on the website. | 14:00 |
kabubi | H stand for? | 14:01 |
kabubi | helix? | 14:01 |
Susume2 | H helix, E sheet, L loop | 14:02 |
kabubi | ok, thanks | 14:02 |
Skippysk8sirc | 1326 is wacky though... | 14:04 |
kabubi | this tool are for professional biochemists.. | 14:06 |
TomTaylor5 | We can fold with the best of them :) | 14:07 |
Formula350 | For the heck of it I ran the B-Turn predictor.... Wonder how long that'll take haha | 14:09 |
kabubi | how much of you are biochemist or similar? | 14:09 |
Susume2 | very few - I know I'm not - and I use jpred all the time | 14:10 |
kabubi | something similar? | 14:10 |
Formula350 | I'm quite possibly the furthest thing from a BioChemist, Chemist, or anything even remotely close to anyting in any scientific field. lol | 14:10 |
Susume2 | I was a computer programmer - now a housewife | 14:11 |
kabubi | computer programmer me too | 14:12 |
Skippysk8sirc | retired logistician who likes jigsaw puzzles LOL | 14:12 |
Formula350 | I'm an Imagineer <dot dot dot> lol | 14:13 |
Skippysk8sirc | after a while, sometimes you start to see patterns that may be about scoring system or about good protein design.... hard to tell. But it does help | 14:13 |
Skippysk8sirc | took me a year | 14:13 |
kabubi | what is an imagineer? | 14:14 |
Formula350 | Someone who Imagines stuff. | 14:14 |
kabubi | designer? | 14:14 |
Formula350 | It was just a joke to be honest :P | 14:14 |
@TimovdL | Dont you know the song Imagine? That really tells it | 14:15 |
Formula350 | Though, in the earlier days of Disney stuff, the designer and cartoonists I believe were actually called "Imagineers". | 14:16 |
kabubi | i'm italian, i don't get wordplay in english quickly | 14:17 |
Formula350 | Ah, my apologies. | 14:17 |
TomTaylor5 | Don't worry. Sometimes we don't get 350 either. lol. | 14:17 |
Formula350 | *nods* | 14:18 |
TomTaylor5 | :) | 14:18 |
kabubi | i don't get it indeed...what 350 stand for | 14:18 |
TomTaylor5 | Formula350. Too lazy to type the whole name | 14:19 |
Formula350 | "Imagineering - a term for Creative Engineering, coined by Alcoa Corporation and made famous by the Disney Company." | 14:20 |
Formula350 | My B-Turn prediction finished. Spat out a ton of stuff I can't make any sense of! lol | 14:23 |
kabubi | what about using only the jpred prediction with 9 as prediction accuracy? | 14:23 |
Formula350 | (expired link removed, see NetTurnP web page to try your own prediction) | 14:23 |
Formula350 | Looks like NetSurfP will output the same type of thing. (That was with 1326 sequence BTW) | 14:26 |
Susume2 | so it predicts turns at 29-33, 41-45, 53-56 | 14:27 |
Susume2 | pretty consistent with psipred, misses the turn around 20 that psipred also misses | 14:29 |
Formula350 | Oh, ok, I see that now. 5-turn, 5-turn, 4-turn? Which I assume in between those turns are... sheets? | 14:29 |
TomTaylor5 | (expired link removed, see "Comparison of predictions" above for NetSurfP results, or NetSurfP web page to try your own prediction) | 14:31 |
Susume2 | it is only predicting beta turns, which are typically but not always sheet-sheet turns | 14:31 |
tokens | I just look for 2 segments of aspartate/asparagine/glycine | 14:33 |
Susume2 | in tom's link, the last 3 columns are probability of helix/sheet/loop by amino acid number | 14:33 |
Formula350 | So all turns between sheets will pretty much be Aspartate, Asparagine, and Glycine? | 14:34 |
TomTaylor5 | Isn't Glycine usually only on turns? | 14:35 |
tokens | In proteins designed in foldit they usually are | 14:35 |
Susume2 | in a foldit design puzzzle, if they got full filter bonus, glycine is only in turns, and only where necessary | 14:35 |
Susume2 | in nature there are other places glycines occur | 14:36 |
Formula350 | Well, I can say I somewhat understand what is being displayed in those two prediction results, but not fully able to know (-yet-) how to apply it to FoldIt | 14:40 |
TomTaylor5 | They are just another take on the prediction given in the puzzle. | 14:41 |
TomTaylor5 | Well, the initial design with the helixes, sheets, and loops already set up. | 14:42 |
TomTaylor5 | Some other prediction could indicate the sheet is a helix. | 14:44 |
Susume2 | my favorite way to use jpred is, print puzle sequence in a text document; under that print SS predicted by psipred (--HHH--EE etc.); under that print SS predicted by jpred (on "show all as html" page of jpred results); see which AAs they agree on and which they disagree; if their loops are too long, see which AAs I might change to sheets & helices, etc. | 14:44 |
Formula350 | Something tells me I'm still geting ahead of myself a bit, as my FoldIt career is still well in its infancy... <_> | 14:46 |
Susume2 | to use the netsurfp results, i would find/write a process to translate them into the same format as the others (--EEEE--HHH-- etc.) and paste them in teh same text document | 14:46 |
LociOiling | h'mmm netsurfp | 14:47 |
LociOiling | new one on me | 14:48 |
TomTaylor5 | I copy to Excel and in a new column set to E,H,- until it looks OK. | 14:48 |
TomTaylor5 | For NetSurfP | 14:48 |
LociOiling | will take a look | 14:49 |
Formula350 | Are there any prediction programs/services that uh... *cough* outputs a picture? eh-heh | 14:49 |
Susume2 | there are, but if you mean a picture of the 3D structure, the foldit rules say that is cheating ;-) | 14:49 |
LociOiling | not a picture exactly, but maybe a pdb | 14:49 |
Formula350 | Oh, fair enough on the cheating, then. | 14:50 |
LociOiling | if we're talking denovos, chances of any modeling programs helping is kind of limited | 14:51 |
Susume2 | agreed | 14:51 |
LociOiling | I think the recent denovos are designed by us, more or less | 14:51 |
Susume2 | not denovo per se, but designed proteins | 14:51 |
LociOiling | yep, maybe rosetta@home, us or ???? | 14:52 |
LociOiling | may I take this opportunity to plug "print protein 2.4" for it's ability to print primary, secondary, etc. in both string and spreadsheet format | 14:53 |
LociOiling | also "AA Edit 1.2" and "SS Edit 1.2" for simple strings | 14:55 |
Susume2 | I love AA edit and SS edit - was going to write one of those and checked the website and woohoo, Loci already did! | 14:56 |
Vredeman | hi folks, is there a way that someone can put this conversation up somewhere for us to learn from, its going too fast for me :) | 14:57 |
Norrjane | That would be great. | 14:58 |
LociOiling | there's an IRC to HTML program somewhere, I'll look into it | 14:58 |
Vredeman | thanks you loci :) | 14:58 |
LociOiling | just seeing what netsurfp coughed up for a recent revisting puzzle | 15:03 |
LociOiling | ah spreadsheet format | 15:04 |
LociOiling | kind of appropriate, lead author's given name is "Bent" | 15:05 |
Vredeman | chuckle | 15:07 |
LociOiling | netsurfp output not pasting into spreadsheet for me, | 15:09 |
LociOiling | so folks are looking at netsurfp mainly for turns | 15:10 |
LociOiling | ? | 15:10 |
Formula350 | NetTurnP would be for Turns, no? | 15:11 |
LociOiling | I thought netsurfp was discount internet or something | 15:12 |
LociOiling | oh, I see, it's probabilities | 15:12 |
LociOiling | still getting over a cold, more spare time than usual, but not quite 100% | 15:13 |
LociOiling | reading previous chat, wouldn't have guessed Kabubi was Italian | 15:14 |
Formula350 | Yea I thought it was something like that, too, when Tom mentioned it. lol | 15:14 |
LociOiling | I confuse myself about 50 times a day, don't really need y'all for that | 15:15 |
LociOiling | I can probably whip up something to digest that netsurfp format | 15:20 |
LociOiling | speaking on lost in translation, lysine, tyrosine, and tryptophan might be lost in lo Stivale | 15:25 |
LociOiling | (how's *that* for confusing, thanks to Camilleri for the idea...something Montalbano didn't feel like explaining to Catarella*) | 15:26 |
LociOiling | wow, that cleared the room | 17:46 |
LociOiling | meanwhile, new recipe time | 17:46 |
LociOiling | NetSurfP 1.0, recipe 102278 | 17:47 |
TomTaylor5 | added | 17:47 |
LociOiling | you paste in NetSurfP output | 17:48 |
LociOiling | recipe converts to tab-delimited, and makes secondary structure string | 17:48 |
TomTaylor5 | What criteria do you use for percentage | 17:48 |
LociOiling | highest wins, it's pretty dumb at this point | 17:48 |
TomTaylor5 | That's the easiest :P | 17:49 |
LociOiling | if pHelix > pLoop then | 17:49 |
LociOiling | seems to work so far | 17:50 |
LociOiling | remember, I'm on cough medicine! | 17:50 |
TomTaylor5 | That's where I start and then if for example a sheet looks too small adjust. | 17:50 |
LociOiling | (not true, but still) | 17:50 |
LociOiling | yep, didn't try tie breaking | 17:50 |
TomTaylor5 | Haven't really seen too many exactly the same though it could occur. | 17:51 |
LociOiling | I ran it on sequence from 1323, and it matched my solo except to 49-52 | 17:51 |
LociOiling | I had sheet, NetSurfP had helix | 17:51 |
TomTaylor5 | What percentages were the second choices? | 17:52 |
LociOiling | might explain why I came in 26th place | 17:52 |
TomTaylor5 | Could be :) | 17:52 |
TomTaylor5 | I like the percentages shown. Makes me think I can improve on the prediction. | 17:53 |
LociOiling | helix was around 0.48 for 49-52, sheet 0.3 - 0.37 | 17:54 |
TomTaylor5 | That's encouraging. | 17:54 |
LociOiling | much closer than other spots, for 59-70 is 0.97 helix | 17:54 |
LociOiling | always good to have addtional source of confusion, er *information* | 17:55 |
TomTaylor5 | lol, true. | 17:55 |
TomTaylor5 | It can be a good excuse. | 17:56 |
LociOiling | oh, not 100% about line formats, had to kludge a bit to make the paste of the NetSurfP output readable on Windows | 17:56 |
LociOiling | hopefully it will work on Linux and Mac as well | 17:57 |
Formula350 | Aww I was hoping it converted the protein to what NetSurf came up with lol | 18:04 |
TomTaylor5 | Just ran NetSurfP v1.0 and it works great! | 18:11 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!
Addendum[]
*In The Age of Doubt by Andrea Camilleri (Penguin Books. 2012. ISBN 978-0-14-312092-6.), Montalbano tries to relay the words "Kimberly Process" to Catarella over the phone. "[O]nce they got past the stumbling block at the K, there was still the Y at the end." The base Italian alphabet consists of 21 letters and does not include K (lysine), Y (tyrosine), or W (tryptophan). You'd have to know Vigata to get the rest.