CSV File of Complete Lab Design Submission Data for Download

dimension9 · January 23, 2011, 3:29am

Hi to all the Devs!

Many of us EteRNA addicts are also “spreadsheet nuts.” I know myself and at least a few others would love to be able to download the contents of the complete Lab Submission Data Listing (complete with full sequence data) into a spreadsheet, so that we can crunch away to our heart’s content without having to manually input the data (which would be completely prohibitive with the current number of submissions).

What say Devs? Can a CSV file of the Lab Data be auto-constructed and auto-updated which continually reflects the current state of the data in the online Lab submission window? And can this CSV file then be made available to download?

Thanks, and Best Regards,

-d9

Berex_NZ · January 23, 2011, 3:39am

I wouldn’t mind having the library of puzzles as well.
Need a control dataset to figure out if there is an optimal formula.

Thanks
Berex

Pytho · January 23, 2011, 3:42pm

Did you have a look on this file?
http://eterna.cmu.edu/sites/cached_cs…
I think it’s exactly what you wanted. (For the current lab)

dimension9 · January 23, 2011, 5:06pm

wow, pytho, Thank You! just perfect
…where can I continue to get updated new ones?

Berex_NZ · January 24, 2011, 9:12am

Pytho, how did you get that data?
According to that data, they’ve been synthesizing a lot more submissions…?

Most interestingly, jeehyung’s submission got a 94 and so did a christmas tree… Both in round one.

If anyone wants to find them faster, its nid 26607 and 27014.

Ding · January 24, 2011, 4:58pm

I think that the “synthesis score” column in that spreadsheet is the score for whichever one of the 8 designs synthesized in that round was most like each non-synthesized design.

In Round One, donald’s submission scored 94, and apparently three other solutions were considered “closest to” that out of the eight synthesized. There’s the two you mention and also anneromaine’s Bulge1.1 (nid 26657).

ccccc · January 24, 2011, 6:52pm

Whoa ! Yes, tell us!

If I have a theory like “there is a correlation between how often Gs are adjacent to each other and how well the RNA folds,” it is hard to test it without the list of sequences.

Of even more interest is the SHAPE column – how do we interpret that? It looks excellent.

Pytho · January 24, 2011, 9:59pm

I just had a look on what the flashplayer loads in the background, and apparently it’s a nice CSV file
And I would just “guess”, that the number 26603 corresponds to the ‘myVal’ number which occurs in the URL of the current lab.

AnticNoise · January 25, 2011, 2:45am

How are all those synthesis scores calculated…? I’m very curious about that.

ChristopherVanLang · January 26, 2011, 2:33am

This itself should be a question on getsatisfaction.

Berex_NZ · January 26, 2011, 6:06am

Ding, I could buy that except for the christmas tree example. You’re saying the christmas tree design was similar to donald’s results?

Ding · January 26, 2011, 4:47pm

If you sort the list of Round One submissions by most-similar to the Christmas Tree, the closest one that was actually synthesized is donald’s.

I’m not saying that I think the two are similar, just that however “similarity” is calculated, the Christmas Tree was “more similar” to donald’s than to any of the others that went to synthesis.

Honestly, I think it might be a bit of a flaw in the lab rewards system. “Similar” doesn’t really mean much if you’re talking about a design that was very dissimilar from all the synthesized designs.

ccccc · January 26, 2011, 7:02pm

/agree with Ding

JeehyungLee · January 26, 2011, 7:31pm

This has been added to the task list (case #399). Thanks for the idea!

EteRNA team

dimension9 · January 26, 2011, 9:14pm

That 94% for that “ChristmasTree,” I’m thinking, must be an error or an anomaly; every single other design with 27 to 33 G-C pairs which has already been scored, received the absolute minimum score …all 73 of them.

However, there are still 72 High G-C designs submitted which have hot yet received a score (as of the date of this spreadsheet download), so we will have to see if there are any other unexpected (and suspicious) exceptions. It could be Extremely Enlightening if we are to discover that there are actually SOME high G-C designs that actually do work!

If that turns out to be the case, it will raise some very interesting questions, like: “What is it that made these 1 or 2 High G-C designs (out of nearly 200) actually work, when all the rest failed miserably?”

If these high scores were not determined in synthesis, but by the Similarity Algorithm, then it is almost certainly time for a serious review of this algorithm.

If an error is possible that erroneously elevates, what by all other accounts is a design that ought to fail or score very low, then perhaps it is also possible (or at least conceivable), - that a very good design could also be evaluated very negatively, and wrongly receive a very low score.

JeehyungLee · January 26, 2011, 9:31pm

8 designs are synthesized - every other design takes the score of the closest synthesized designs.

For that “Christmas tree”, donald’s 94 scored design was the closest.

Here’s distances from round1 synthesized designs to “Christmas tree”

distance / nid / score
46 26626 83
58 26628 86
52 26634 82
44 26639 86
56 26654 89
45 26693 87
57 26711 77
43 26735 94 (donald’s)

That file is for our flash interface to get data from the database. We don’t mind at all you using it, but in case you wonder what that file is for.

dimension9 · January 26, 2011, 10:01pm

Jee. I apologize, but I cannot see HOW this all-G-C design can possibly be closest to Donald’s 94% winner, when all the other all-G-C designs received the absolute minimum score. It seems something must be wrong somewhere. Please see my last entry above; edits were added since you read it.

Ding · January 26, 2011, 10:09pm

My guess from looking at the numbers is that the “similarity” is based very simply on the # of nucleotides two designs have in the same position. So if Christmas Tree had the same orientations of GC pairs at all the positions that donald’s design had GC pairs, that would lead to relatively high “similarity”, despite the fact that other measures like overall % of different kinds of bonds/nucleotides, melting point, and free energy might be very different.

Another thing to note is that I don’t think any actual “christmas trees” were synthesized in that particular round, and I think the comparison only goes within a round.

JeehyungLee · January 26, 2011, 10:24pm

Ding’s right - comparison only goes within rounds. That Christmas tree was submitted in round 1 and there was no synthesized GC pairs design in round 1.

As you can see it differs by 43 bases from the 94 design, which means about half of bases are different, basically meaning they are completely different RNAs. The tree was completely different from all 8 synthesized designs but it happened that 94 design was relatively closest.

This scheme was designed so that every design (even non-synthesized) in the lab at least gets some score, and people can at least get some rewards by participating. This obviously isn’t an optimal way of evaluating. We are actively looking for better way to evaluate non-synthesized designs along with a better way of selecting candidates.

dimension9 · January 26, 2011, 10:35pm

@Ding: I fear you may be correct & I say “fear” because that would be a serious flaw, if it means that the same design could score 10% (minimum) in one round and 94% in another, just because of a positional similarity of G-C pairs to a very good design, irrespective of the quality of the rest of the design. I am certain the developers would not want this perception of possible inequities in the scoring system, or possible flaws in the similarity algorithm - to proliferate. I hope something can be done.