The development of the Eterna100 was a triumph in citizen science and Eterna. For those interested, here is a link to the original paper from 2016.
However, as mentioned in a previous forum post by Cynwulf28:
one of the limitations present in the Eterna100 is that they are based on puzzles designed using the Turner 1999 nearest neighbor parameters with Vienna 1.x. Since that time, several RNA structure prediction software have been integrated into the site, including Vienna2.x, NuPACK, and most recently the LinearFold versions of Vienna2.x and ContraFold. Over this past year, I published as many of the Eterna100 as possible using these other parameters. Here is a list of every puzzle in the Eterna100, and the ability to solve the secondary structure in other parameter sets.
Given this adventure in persistence, I decided to post the work here. We’ll start with a short description for each puzzle set, why some puzzles are unsolvable, as well as the (completely subjective) “most difficult puzzle” to solve for each model. I would like to note that it is plausible that some of the puzzles that are unsolvable may actually be a mistake. Some puzzles are very large and I generally didn’t try to solve a puzzle if I saw a motif that was unsolvable with previous motifs. However, there were several puzzles where I probably spent hours trying to find a solution given the possibility it works. So maybe you’ll be able to scratch one off the list somewhere.
So, it turns out that VRNA_2 has something like 20 puzzles that are unsolvable using the secondary structures of the original Eterna100. We can break these up into two categories:
- Single Base Pair Structures: So a very large number of the puzzles chosen for the Eterna100 incorporated single base pair motifs into their structures. This leads to several unsolvable puzzles, such as https://eternagame.org/web/puzzle/547104/ Campfire, where the extended zigzag motif is unsolvable in Vienna2.x. There are several puzzles where the addition of a single base pair makes the structure solvable in VRNA2, but may defeat the purpose of puzzle difficulty.
- Number of multiloop branches: As the number of branches off of an internal multiloop increases, the free energy of the loop decreases at a rate faster than an external loop with an equal number of branches. If an external multiloop has a large number of branched helices, the sequence will form a “neck” helix, which may be only a few base pairs in length, to benefit from an internal multiloop with a large number of branches.
These structural motifs result in most, if not all of the unsolvable puzzles in VRNA2. The third attribute is also the reason why the puzzle https://eternagame.org/web/puzzle/8950194/ Taraxacum officinale by ariel19_89 is the most difficult to solve.
https://eternagame.org/web/puzzle/8950194/ Taraxacum officinale by ariel19_89
So it turns out this puzzle took me several hours over the course of several days to solve. I considered it unsolvable at one point given the amount of misfolding. Given sheer luck, I found a solution, and it is currently only solved by the player Wawan151, whose solution is not the same as my own. So at the very least, there are two unique methods of solving all of the branches off of the external loop central to this puzzle (Author’s note, Brourd has misplaced his solution somewhere).
So unlike VRNA2, Nupack has the pleasure of being a far easier model to solve some puzzles in, while also having nearly 30 structures from the Eterna100 being unsolvable. The reasons for this vary slightly, but are quite similar to VRNA2
- Unpaired Nucleotides in Multiloops: It turns out that the parameters in NuPACK have a term for increasing the free energy of multiloops for each unpaired base added to the loop. This is good for certain predictions, given it leads to structures that may favor internal loops over multiloops. However, this means there are several unsolvable puzzles due to excessively large internal multiloops. It also means that external loops typically have a much lower free energy compared to internal multiloops, given the same rules for increasing free energy don’t apply. Several Eterna100 puzzles have short helices connecting internal multiloops and external loops, resulting in them being unsolvable in NuPACK
- Not many special free energy bonuses: Unlike the Turner parameters, which provide free energy bonuses for specific 1-1, 2-2, and 1-2 loop sequences, the NuPACK parameters remove several of these. 1-1 and 2-2 loops are a recurring motif in the Eterna100, and this results in several puzzles being unsolvable. Finally, structures with a 1-X format cannot be boosted at all, resulting in a handful of unsolvable puzzles.
- Single Base Pair Structures: Just like VRNA2, the single base pair structures result in several motifs being unsolvable in NuPACK. This is further compounded by many of these motifs being related to 1-1 and 2-2 loops, which cannot be boosted in the same way.
Given this, NuPACK does have some quirks that work to its advantage. In particular, G-U base pairs that act as closing base pairs for multiloops have a malus of several hundred thousand kcal/mol. While this may seem to be a negative attribute, if manipulated correctly, it allows certain puzzles with a single base pair motif connected to a multibranch loop to be solvable.
As for the most difficult puzzle, the closest there is to that is https://eternagame.org/web/puzzle/8940638/ Cat’s Toy 2 by Merryskies. While this didn’t take several hours to solve, it requires extreme manipulation of the free energy of the puzzle to solve the single base pair hairpin loop, as well as the short helices connecting everything.
So while I haven’t gone through every puzzle with this model, the assumption is that some puzzles with a very small solution space for long sequence with degenerate MFE structures may be unsolvable. In particular, Taraxacum officinale may be one such puzzle, but as stated above, the author has misplaced their solution. Otherwise, the thermodynamic parameters are identical to VRNA2, and most puzzles should be solvable in the LinearFold version.
Unlike ViennaRNA2 and NuPACK, the author solved every LinearFold-C puzzle without the help of initially knowing (or being able to view) the parameter set. The best description of this is the use of brute force to determine heuristics for solving these puzzles. For example, did you know?
In Contrafold, several structural motifs can be boosted. This includes bulges, triloops, multiloops, internal loops, and other hairpin loops.
The combination of stacking energy and base pairing energies for G-U base pairs results in parameters where helices containing G-U base pairs are incredibly unstable.
Lone base pairs can be incredibly unstable. So much so, that some motifs are just unsolvable.
The number of base pairs that a helix must contain to close some loops is significantly higher than NuPACK and Vienna.
U-U boosting is the favored sequence for special motifs like 1-1, 2-2, and 1-2 loops.
The base pair orientation for some loops determines its stability (Lower free energy).
Contrafold isn’t the model you want to solve repetitive structures/sequences, instead resulting in causing these motifs to fold into a variety of other structures that you don’t want. Enough that some puzzles were (for now) placed as unsolvable.
Kudzu took nearly 2 hours to solve AFTER knowing a number of these heuristics.
Most likely single nucleotide bulges can be boosted, although I didn’t need to find out.
These are just some of the rules that were ingrained into my brain in order to solve a number of these puzzles. In total, I was only able to solve 45 of the Eterna100 puzzles with LinearFold-C. Once the actual model parameters are added to the game, this number may increase a bit, but it’s still most likely going to be at least 45 unsolvable puzzles, perhaps bringing that number to 55 solvable. This is due to several factors:
Short helices: The Eterna100 has an obsession with short helices. Enough so that it causes a large number of puzzles to be unsolvable. Helices as long as 2, 3 or 4 base pairs may not be stable in some motifs, and many lone pairs are definitely not stable at all.
Repetitive substructures: Some large puzzles have repetitive substructures, and while this isn’t an issue unto itself, if the sequence required to solve a specific structure only has one variation, then it will most likely misfold.
Speaking of the most difficult puzzle, https://eternagame.org/web/puzzle/458872/ Kudzu by Quasispecies contains several short helices, very few solutions to some of the puzzle motifs, and loops that requires specific boosting to work. It took nearly two hours to solve, although that could be due to solving it from first principles that were derived from iterative sequence/structure manipulation. Being unable to look at the energies of loops is definitely a pain. Some of the other LinearFold-C puzzles were difficult to solve as well, but generally weren’t impossible once there was a basic understanding of how the sequences fold.
Where do we go from here?
Once the parameters for Contrafold are added to the game, those puzzles will be published. They should be far easier to solve… Other than that.
There was an Eterna100 summit about a year ago now that I wasn’t able to attend. Whatever their final verdict was may be the direction they choose to go with for the puzzles, however I will say this.
As we move towards parameters for secondary structure prediction that are more and more accurate, the more irrelevant the Eterna100 may be with regards to actual RNA structure design. Granted, the puzzles provide an excellent benchmark for algorithms that solve secondary structures, given they are naturally difficult in nature, but there is an over-reliance on difficulty from motifs that require specific solutions for specific structures. Especially, if those solutions are specific to one model over another. This would mostly affect algorithms that use machine learning of player sequence design, and not based on first principles like NEMO and others.
When I originally started the Alternate Eterna100, one goal of mine was to modify the structures of the Eterna100 with a minimum number of edits, allowing for the puzzles to be solvable in a model. This quickly spiraled out of control, as the definition of “minimum” was called into question, as well as whether these puzzles may potentially lose the ability to be unsolvable by bots without these motifs. So I now have a proposal for this.
First, the minimum number of structural mutations to a puzzle to make it solvable in every available model, as well as minimizing the change in free energy from one structure to another. So why this proposal? Well, consider a puzzle where you delete a single nucleotide bulge in a helix, making the entire helix stable. Deleting the bulge decreases the overall difficulty of the puzzle, and is only a single structural mutation. However, it may be a more drastic change to free energy than the addition of a base pair. An additional constraint could be to minimize the change in length to the puzzle.
Anyway, that could be an interesting way to design a new Eterna100, but maybe it would be easier to just pick new puzzles. Hopefully somebody found these musings and my work to be useful. In conclusion, perhaps nothing comes of this but about ~200 puzzles.