What are RNA design algorithms that EteRNA bots use?

In EteRNA player puzzles and RNA lab, we have 4 “bots” that try to solve puzzles and submit their RNA sequences.

These bots are computer algorithms that are designed to take in the “target shape”, then output RNA sequences that fold into the target shape. They are called “Inverse folding” algorithms opposite to “folding” algorithm which tries to figure out RNA shapes given RNA sequences.

The following link explains more details about inverse folding algorithms in general.
http://www.stats.ox.ac.uk/__data/asse…

4 EteRNA bots use following algorithms

(1) ViennaRNA : ViennaRNA (a.k.a. RNAFold) package’s RNAInverse program. ViennaRNA mainly runs a stochastic search - in easy words, it tries changing bases randomly until it finds a sequence that folds into the target shape. It starts from a random sequence, try to change bases randomly, and pick the change that puts the sequence most close to the target shape.

Because of it’s random nature, ViennaBot’s performance often unstable. It could solve very hard puzzle in seconds when lucky, but it could also get stuck in very easy puzzle when out of luck.

http://www.tbi.univie.ac.at/RNA/

(2) InfoRNA : INverse FOlding of RNA takes the same approach as ViennaRNA - a stochastic search. The main difference however is that they first try initialize the sequence to minimize free energy in the target structure. In EteRNA, this is equivalent to designing RNA to have the lowest energy in “Target mode”, regardless of whether it folds correctly in “Natural mode.” Then InfoRNA runs the usual random process to come up with the answer.

Due to this nature, InfoRNA is extremely fast and strong. However their designs usually have excessive number of GC pairs

http://vac-o.googlecode.com/svn-histo…

(3) RNASSD : RNA Secondary Structure Design is an algorithm described in the following paper,

A New Algorithm for RNA Secondary Structure Design
M. Andronescu et al. 2003

The authors of the paper generously provided us the source code to run RNASSD bot. RNASSD uses stochastic searches too, but differs from the former 2 algorithms in that it first decomposes the target shape into a hierarchy of substructures. Then it runs stochastic searches on the hierarchy recursively, until it comes up with the answer

(4) NUPACK : NUcleic Acid PACKage is quite different from other 3 in that it’s primary goal is not just to come up with a sequence that folds into the target shape. It wants to come up with the sequence with the minimal “ensemble defect”. The ensemble defect is the average number of incorrectly paired nucleotides at equilibrium over the ensemble of (possible) secondary structures(=shape).

NUPACK bot is only used in RNA lab now, and its performance is surprisingly good. It also seems that there is a noticeable correlation between NUPACK design’s ensemble defect & lab synthesis score (The lower the defect is, the higher the score.)

http://www.nupack.org/

Thanks for this information, jee! Is there any plan to have NUPACK join the others in attempting our player puzzles in the future?

Also, I’m curious about whether you’ve done a comparison of NUPACK ensemble defect scores and lab synthesis scores for all the designs or only for its own designs?

edit to add: also curious whether NUPACK’s “ensemble defect” is the same thing as Vienna/RNAfold’s “ensemble diversity”?

I just ran a sequence through both NUPACK and RNAfold and the numbers were different, but I’m not sure if this means they’re different metrics or if it’s that they’re using slightly different energy models (NUPACK offers a choice between “Serra and Turner, 1995” and “Mathews et al 1999” whereas on RNAfold I usually use the default “Turner 1999”. Also choices about dangling energies are worded differently.

FYI, the full citation is: Mathews, D. H., Sabina, J., Zuker, M. & Turner, D. H. (1999a). Expanded sequence dependence of thermodynamic parameters provides improved prediction of RNA Secondary Structure. Journal of Molecular Biology. 288, 911-940.

So you see Matthews et al. 1999 and Turner 1999 are the same paper :slight_smile:

As for ensemble defect, it’s a more general (and potentially useful) measure than ensemble diversity. For example, if I have a 5 base-pair helix, and my ensemble has populations with just the 1st basepair misformed, some with the 2nd base pair misformed, and so on for the 3rd, 4rd, 5th positions, the “ensemble diversity” would be large because there are 5 distinct shapes in population. However, the ensemble defect would just be 1, since none of them are more than 1 “defect” away from the target.

Just wanted to quickly point out that the core ‘hierarchical’ algorithm in NUPACK is actually pretty similar to RNA-SSD, but NUPACK’s emphasis on minimizing thermal fluctuations (‘ensemble defect’) lets it use some interesting tricks to make the design process much more efficient. NUPACK also breaks down the structure into more pieces than RNA-SSD during its ‘hierarchical buildup’.

The idea of breaking down puzzles into subpuzzles is something we might want to explore in EteRNA as well, because I know for very large structures, it can take a while for EteRNA to ‘refold’ the structure given a sequence change.

Thanks for the citation & clarification on terms, alan.robot.

Ding, I don’t think we’ll be running NUPACK for player puzzles in the near future.

As Rhiju pointed out below, NUPACK is different in that it emphasizes minimizing the ensemble defect - I think that’s why it’s doing so well in the RNA lab, but for just solving puzzles I doubt it will do any better than InfoRNA or RNASSD.

It also seems that there is a noticeable correlation between NUPACK design’s ensemble defect & lab synthesis score (The lower the defect is, the higher the score.)

Have you calculated its rank correlation coefficient?

Aldo, you can find the correlation here

http://eterna.cmu.edu/htmls/strategy…

We have only done this for ones designed by NUPACK, score function simply being 100 - normalized ensemble defect * 20

Surprisingly there doesn’t seem to be very explicit correlation, at least across all puzzles. The results could be different if we calculated the correlation just between the designs from the same puzzle…