Foldit Wiki
Advertisement

In a Foldit "de-novo" puzzle, players are given a fixed sequence of amino acids, presented as a straight "extended chain". Unlike design puzzles, which also start with an extended chain, no mutation is allowed on de-novo puzzles. Also unlike a design puzzle, a de-novo puzzle typically has some secondary structures (helixes or sheets) defined. The puzzle comments typically state that the secondary structure predictions are "from PSIPRED".

The subject of secondary structure predictions came up in #veteran chat on 8 January 2017 (UTC-6). An edited version of the chat log appears below.

Background[]

Some general background on the topics discussed in the chat may be helpful.

Amino acid sequence and secondary structure notation in Foldit[]

The amino acid sequence (or "primary structure") of a Foldit puzzle is typically represented as a string of single-character amino acid codes. Recent Foldit puzzles typically have the sequence on the web page. For example, for Puzzle 1326, the sequence is:

TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG

This style is often referred to as "Fasta format". (Fasta has many variations; often there's a short header that gives the sequence a name.) While the prediction shown here is in upper case, Foldit functions, for example structure.GetAminoAcid and structure.SetAminoAcid, use lowercase.

Many Foldit recipes use a similar format for secondary structure. The Foldit standard is to use "H" for helix, "E" for sheet, and "L" for loop. The starting secondary structure for Puzzle 1326 is

LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL

in this format. Other tools may use "-" or a blank space for loop. And just to keep things confusing, sheets are sometimes called "strands", and "coil" may be used instead of "loop". On the other hand, there's "coiled coil", where two or more helixes twist together, as seen in puzzle 479.

Foldit recipes that work with amino acid sequence and secondary structure[]

The Foldit recipe Print Protein 2.4 prints the amino acid sequence and secondary structure in the format shown above. For convenience, both structures are also presented for copy and paste.

The Foldit recipes AA Edit 1.2 and SS Edit 1.2 show the current amino acid and secondary structure sequence, and allow the user to paste in new sequences.

The recipe AA Copy Paste Compare v 1.1.1 -- Brow42 combines both amino acid and secondary structure display and change in one recipe.

Tools mentioned in the chat[]

The chat mentioned several tools that predict secondary structure and other aspects of a fold based on the amino acid sequence. These tools are available online, and accept the simple "Fasta" format shown above for the input sequence.

The first tool is PSIPRED, which is used to produce the secondary structure prediction of most Foldit de-novos. One of PSIPRED's output's is similar to the secondary structure format shown above.

Another popular tool is Jpred, which produces several predictions of the secondary structure based on the amino acid sequence. Jpred also attempts to find any matching or similar sequences for published proteins. JPred's main predictions for secondary structure are similar to the format shown above.

The chat also mentioned NetSurfP, which produces secondary structure predictions as probabilities for each segment. This led to the Foldit recipe NetSurfP 1.0, which converts NetSurfP output into the secondary structure format shown above (and also reformats the NetSurfP output so it can be more easily pasted into a spreadsheet).

Finally, NetTurnP is closely related to NetSurfP, but produces a segment-by-segment analysis of where there are likely to be turns. A Foldit recipe to digest NetTurnP output is no doubt forthcoming.

Comparison of predictions[]

The prediction tools described above were compared for Puzzle 1326.

PSIPRED[]

One version of the PSIPRED prediction is a simple text file:

# PSIPRED HFORMAT (PSIPRED V3.3)
               1         2         3         4         5         6         7     
      123456789012345678901234567890123456789012345678901234567890123456789012345
Conf: 915999999999999851688752057789998400155416887210011678872999999999999997439
Pred: CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC
  AA: TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG

The secondary structure prediction is:

CHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCEEEEEECCCCCCCEECCCCCCHHHHHHHHHHHHHHHHHCC

or translated into Foldit:

LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL

this is a little different than the start for Puzzle 1326

LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED)
LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)

The difference means that PSIPRED was probably run with different settings for Puzzle 1326 setup. The tool has many different modes and options. Only the default mode was used for this analysis. Some of the modes are proprietary and require a license key to run.

Jpred[]

The main Jpred prediction for the sequence from Puzzle 1326 is:

OrigSeq TDDFREELKKMLKEYKRHSQEHYRSSRSTDDGRTSTEVRYDHDNGTSEVRSTSDNGDEEIRKQLKEMKKELKKQG
Jnet    --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH--
jhmm    --HHHHHHHHHHHHHHHHHHH---------------EEEEE------EEEE-----HHHHHHHHHHHHHHHHH--

Jnet and jhmm are two different prediction methods, but here they produced the same results. Converted to Foldit style, here's the comparison to the puzzle 1326 start:

LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred)
LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)

Jpred predicts the initial helix is shorter than shown at the start of Puzzle 1326.

NetSurfP[]

The NetSurfP output was reduced by the Foldit recipe NetSurfP 1.0:

LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP)
LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)

Combined[]

The four slightly different predictions combined in one box:

LHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (PSIPRED)
LLHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLLLLLEEEEELLLLLLEEEELLLLLHHHHHHHHHHHHHHHHHLL (Jpred)
LLHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLHHHHHHHHHHHHHHHHHLL (NetSurfP)
LLHHHHHHHHHHHHHHHHHHHHHHHLLLLLLLLLLLEEEEELLLLLLLEELLLLLLHHHHHHHHHHHHHHHHHLL (Puzzle 1326 start)

As Susume mentions in the chat, all these tools probably have a similar weak spot, which is predicting sheets on the outside of a protein. Puzzle 1326 is likely a protein originally designed by Foldit players. Foldit design often have relatively flat section of two or more sheets opposite one or more helixes. This is referred to as the "hotdogs and surf board" model in the chat, where the hotdogs are the helixes and the surfboard is the sheets. In designs of this type, the sheets on the outer edge of the surfboard tend to have a lot of hydrophobic residues on both the "outer" and "inner" (helix-facing) sides. The prediction services seem to have difficulty guessing that these hydrophobic sequences form sheets.

The Chat[]

Here is the #veteran chat that discussed all these tools. The chat has be lightly edited to remove some interspersed conversations and correct a few typos. All times are UTC-6, or US Central Standard Time.

Session Start: Sun Jan 08 10:25:56 2017
Session Ident: #veteran
Susume2 when I have a design that jpred can't get anywhere close on the SS, I guess Rosetta is likely to choke on it as well :-P 10:25
Formula350 I don't know for certain but I've speculated that the predictions might be only there as a hint but aren't always in the right place; thus, might need to be moved somewhere else. 13:46
Susume2 the SS predictions are from an algorithm called psipred - it has certain pros and cons - one con is that it can't recognize all-blue areas as outside sheets and predicts them as loops 13:49
Susume2 not sure why it had that extra helix on a recent one though 13:50
TomTaylor5 There are also other prediction sites you can try. 13:50
Susume2 I always check jpred as a point of comparison, though it has the same weakness on all-blue areas 13:51
@TimovdL I always get a second opinion of jpred 13:51
kabubi how do you use jpred? copying the sequence manually? 13:52
@TimovdL I use a small script that prints the sequence 13:53
Susume2 Loci has a nice script called AA Edit that will let you put the sequence in the clipboard 13:53
TomTaylor5 This would be a good one for 350. NetSurfP 13:53
tokens I think in the previous puzzle the predictor got confuzed by the leucines and arginines which are more often found on helices than on sheets 13:53
Formula350 Software to download, or a website Tom? 13:53
kabubi i had problems to positioning that helix indeed 13:54
TomTaylor5 like PSIPRED. 13:54
tokens well, leucines at least 13:54
TomTaylor5 Secondary structure predictions 13:55
kabubi i'm trying jpred now...the sequence is in the puzzle page 13:55
Formula350 Found it, thanks Tom. We'll see if I can figure out how to use it! :D 13:56
Skippysk8sirc Kabubi -- I may just be a hack, but sometimes I just change things and ignore predictions..... easier to fold 13:56
Skippysk8sirc You can also copy and paste the AA sequence from the puzzle notes..... quick and easy 13:57
TomTaylor5 Copy/paste to text box, press Submit query 13:57
Formula350 The notes on the Puzzle menu, Skip? 13:58
kabubi to see the result in applet , it ask 7gb of ram... 13:58
Skippysk8sirc yes, at bottom of page for puzzle 13:58
Formula350 How do you do that? I was just in there trying haha 13:58
TomTaylor5 ctrl-a, ctrl-v 13:59
Susume2 on jpred, I always show all results in html - it gives a nice fixed-width lineup of homologous sequences if it finds any (it won't find them on these foldit designs though) 13:59
Skippysk8sirc go from main foldit page to puzzles, click on puzzle number, and then move mouse down to sequence which is listed. 13:59
Formula350 Oh lol 14:00
Formula350 I was in the game's puzzle menu. 14:00
Formula350 Didn't know you meant on the website. 14:00
kabubi H stand for? 14:01
kabubi helix? 14:01
Susume2 H helix, E sheet, L loop 14:02
kabubi ok, thanks 14:02
Skippysk8sirc 1326 is wacky though... 14:04
kabubi this tool are for professional biochemists.. 14:06
TomTaylor5 We can fold with the best of them :) 14:07
Formula350 For the heck of it I ran the B-Turn predictor.... Wonder how long that'll take haha 14:09
kabubi how much of you are biochemist or similar? 14:09
Susume2 very few - I know I'm not - and I use jpred all the time 14:10
kabubi something similar? 14:10
Formula350 I'm quite possibly the furthest thing from a BioChemist, Chemist, or anything even remotely close to anyting in any scientific field. lol 14:10
Susume2 I was a computer programmer - now a housewife 14:11
kabubi computer programmer me too 14:12
Skippysk8sirc retired logistician who likes jigsaw puzzles LOL 14:12
Formula350 I'm an Imagineer <dot dot dot> lol 14:13
Skippysk8sirc after a while, sometimes you start to see patterns that may be about scoring system or about good protein design.... hard to tell. But it does help 14:13
Skippysk8sirc took me a year 14:13
kabubi what is an imagineer? 14:14
Formula350 Someone who Imagines stuff. 14:14
kabubi designer? 14:14
Formula350 It was just a joke to be honest :P 14:14
@TimovdL Dont you know the song Imagine? That really tells it 14:15
Formula350 Though, in the earlier days of Disney stuff, the designer and cartoonists I believe were actually called "Imagineers". 14:16
kabubi i'm italian, i don't get wordplay in english quickly 14:17
Formula350 Ah, my apologies. 14:17
TomTaylor5 Don't worry. Sometimes we don't get 350 either. lol. 14:17
Formula350 *nods* 14:18
TomTaylor5 :) 14:18
kabubi i don't get it indeed...what 350 stand for 14:18
TomTaylor5 Formula350. Too lazy to type the whole name 14:19
Formula350 "Imagineering - a term for Creative Engineering, coined by Alcoa Corporation and made famous by the Disney Company." 14:20
Formula350 My B-Turn prediction finished. Spat out a ton of stuff I can't make any sense of! lol 14:23
kabubi what about using only the jpred prediction with 9 as prediction accuracy? 14:23
Formula350 (expired link removed, see NetTurnP web page to try your own prediction) 14:23
Formula350 Looks like NetSurfP will output the same type of thing. (That was with 1326 sequence BTW) 14:26
Susume2 so it predicts turns at 29-33, 41-45, 53-56 14:27
Susume2 pretty consistent with psipred, misses the turn around 20 that psipred also misses 14:29
Formula350 Oh, ok, I see that now. 5-turn, 5-turn, 4-turn? Which I assume in between those turns are... sheets? 14:29
TomTaylor5 (expired link removed, see "Comparison of predictions" above for NetSurfP results, or NetSurfP web page to try your own prediction) 14:31
Susume2 it is only predicting beta turns, which are typically but not always sheet-sheet turns 14:31
tokens I just look for 2 segments of aspartate/asparagine/glycine 14:33
Susume2 in tom's link, the last 3 columns are probability of helix/sheet/loop by amino acid number 14:33
Formula350 So all turns between sheets will pretty much be Aspartate, Asparagine, and Glycine? 14:34
TomTaylor5 Isn't Glycine usually only on turns? 14:35
tokens In proteins designed in foldit they usually are 14:35
Susume2 in a foldit design puzzzle, if they got full filter bonus, glycine is only in turns, and only where necessary 14:35
Susume2 in nature there are other places glycines occur 14:36
Formula350 Well, I can say I somewhat understand what is being displayed in those two prediction results, but not fully able to know (-yet-) how to apply it to FoldIt 14:40
TomTaylor5 They are just another take on the prediction given in the puzzle. 14:41
TomTaylor5 Well, the initial design with the helixes, sheets, and loops already set up. 14:42
TomTaylor5 Some other prediction could indicate the sheet is a helix. 14:44
Susume2 my favorite way to use jpred is, print puzle sequence in a text document; under that print SS predicted by psipred (--HHH--EE etc.); under that print SS predicted by jpred (on "show all as html" page of jpred results); see which AAs they agree on and which they disagree; if their loops are too long, see which AAs I might change to sheets & helices, etc. 14:44
Formula350 Something tells me I'm still geting ahead of myself a bit, as my FoldIt career is still well in its infancy... <_> 14:46
Susume2 to use the netsurfp results, i would find/write a process to translate them into the same format as the others (--EEEE--HHH-- etc.) and paste them in teh same text document 14:46
LociOiling h'mmm netsurfp 14:47
LociOiling new one on me 14:48
TomTaylor5 I copy to Excel and in a new column set to E,H,- until it looks OK. 14:48
TomTaylor5 For NetSurfP 14:48
LociOiling will take a look 14:49
Formula350 Are there any prediction programs/services that uh... *cough* outputs a picture? eh-heh 14:49
Susume2 there are, but if you mean a picture of the 3D structure, the foldit rules say that is cheating ;-) 14:49
LociOiling not a picture exactly, but maybe a pdb 14:49
Formula350 Oh, fair enough on the cheating, then. 14:50
LociOiling if we're talking denovos, chances of any modeling programs helping is kind of limited 14:51
Susume2 agreed 14:51
LociOiling I think the recent denovos are designed by us, more or less 14:51
Susume2 not denovo per se, but designed proteins 14:51
LociOiling yep, maybe rosetta@home, us or ???? 14:52
LociOiling may I take this opportunity to plug "print protein 2.4" for it's ability to print primary, secondary, etc. in both string and spreadsheet format 14:53
LociOiling also "AA Edit 1.2" and "SS Edit 1.2" for simple strings 14:55
Susume2 I love AA edit and SS edit - was going to write one of those and checked the website and woohoo, Loci already did! 14:56
Vredeman hi folks, is there a way that someone can put this conversation up somewhere for us to learn from, its going too fast for me :) 14:57
Norrjane That would be great. 14:58
LociOiling there's an IRC to HTML program somewhere, I'll look into it 14:58
Vredeman thanks you loci :) 14:58
LociOiling just seeing what netsurfp coughed up for a recent revisting puzzle 15:03
LociOiling ah spreadsheet format 15:04
LociOiling kind of appropriate, lead author's given name is "Bent" 15:05
Vredeman chuckle 15:07
LociOiling netsurfp output not pasting into spreadsheet for me, 15:09
LociOiling so folks are looking at netsurfp mainly for turns 15:10
LociOiling ? 15:10
Formula350 NetTurnP would be for Turns, no? 15:11
LociOiling I thought netsurfp was discount internet or something 15:12
LociOiling oh, I see, it's probabilities 15:12
LociOiling still getting over a cold, more spare time than usual, but not quite 100% 15:13
LociOiling reading previous chat, wouldn't have guessed Kabubi was Italian 15:14
Formula350 Yea I thought it was something like that, too, when Tom mentioned it. lol 15:14
LociOiling I confuse myself about 50 times a day, don't really need y'all for that 15:15
LociOiling I can probably whip up something to digest that netsurfp format 15:20
LociOiling speaking on lost in translation, lysine, tyrosine, and tryptophan might be lost in lo Stivale 15:25
LociOiling (how's *that* for confusing, thanks to Camilleri for the idea...something Montalbano didn't feel like explaining to Catarella*) 15:26
LociOiling wow, that cleared the room 17:46
LociOiling meanwhile, new recipe time 17:46
LociOiling NetSurfP 1.0, recipe 102278 17:47
TomTaylor5 added 17:47
LociOiling you paste in NetSurfP output 17:48
LociOiling recipe converts to tab-delimited, and makes secondary structure string 17:48
TomTaylor5 What criteria do you use for percentage 17:48
LociOiling highest wins, it's pretty dumb at this point 17:48
TomTaylor5 That's the easiest :P 17:49
LociOiling if pHelix > pLoop then 17:49
LociOiling seems to work so far 17:50
LociOiling remember, I'm on cough medicine! 17:50
TomTaylor5 That's where I start and then if for example a sheet looks too small adjust. 17:50
LociOiling (not true, but still) 17:50
LociOiling yep, didn't try tie breaking 17:50
TomTaylor5 Haven't really seen too many exactly the same though it could occur. 17:51
LociOiling I ran it on sequence from 1323, and it matched my solo except to 49-52 17:51
LociOiling I had sheet, NetSurfP had helix 17:51
TomTaylor5 What percentages were the second choices? 17:52
LociOiling might explain why I came in 26th place 17:52
TomTaylor5 Could be :) 17:52
TomTaylor5 I like the percentages shown. Makes me think I can improve on the prediction. 17:53
LociOiling helix was around 0.48 for 49-52, sheet 0.3 - 0.37 17:54
TomTaylor5 That's encouraging. 17:54
LociOiling much closer than other spots, for 59-70 is 0.97 helix 17:54
LociOiling always good to have addtional source of confusion, er *information* 17:55
TomTaylor5 lol, true. 17:55
TomTaylor5 It can be a good excuse. 17:56
LociOiling oh, not 100% about line formats, had to kludge a bit to make the paste of the NetSurfP output readable on Windows 17:56
LociOiling hopefully it will work on Linux and Mac as well 17:57
Formula350 Aww I was hoping it converted the protein to what NetSurf came up with lol 18:04
TomTaylor5 Just ran NetSurfP v1.0 and it works great! 18:11

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!

Addendum[]

*In The Age of Doubt by Andrea Camilleri (Penguin Books. 2012. ISBN 978-0-14-312092-6.), Montalbano tries to relay the words "Kimberly Process" to Catarella over the phone. "[O]nce they got past the stumbling block at the K, there was still the Y at the end." The base Italian alphabet consists of 21 letters and does not include K (lysine), Y (tyrosine), or W (tryptophan). You'd have to know Vigata to get the rest.

Advertisement