I’ve been thinking about algorithms for RNA 3D prediction (I will soon make another post, asking if anyone wants to collaborate on that) and most 3D prediction algorithms work better if given a reliable secondary structure. The recent (currently still being tested) Kaggle competition allowed use of the RibonanzaNet, and it’s pretty good, but it’s not perfect, and it’s also not the kind of model I like to use because it’s completely opaque, and very large. It’s not like a force field where you can turn energy terms on and off and see what happens.
As I’m experimenting with my own prediction algorithms, I’d like to have some sort of baseline to compare against, as to what a good predictor with openly available algorithm can achieve. As I understand it, EternaBot was state-of-the-art for a while, and it seems to have two versions. The first version is available on GitHub, and the algorithm for the second seems to be available as a verbal description but it seems to not have any source.
In various places I’m seeing a reference to a “ThreshKnot” algorithm–where does this fall in accuracy compared to EternaBot (both versions)? Are there any newer algorithms that I should be paying attention to?
I don’t think you mean EternaBot - EternaBot is used for designing sequences, but is not a structure predictor. Do you mean EternaFold? There is only one version of EternaFold (any references to a version 2 were prospective).
ThreshKnot itself is an approach for using base pairing probabilities to predict pseudoknots. In Eterna, we run this on top of the base pairing probabilities output by EternaFold. There’s some evaluation in the Ribonanza paper: https://www.biorxiv.org/content/10.1101/2024.02.24.581671v1
So it seems from looking through the EternaFold source and comparing to the paper, that it’s a dynamic programming approach based on stacking and pairing scores, like Vienna, but with parameters that are learned “top-down” by fitting to predict known structures as opposed to measuring the stabilities of small constructs “bottom up” and assuming that the parameters transfer as was done by Turner et. al. for Vienna?
So then if I understand correctly, it also doesn’t predict pseudoknots for the same reason as all the other dynamic programming methods don’t, and therefore you had to develop another algorithm (ThreshKnot) on top of it?
It seems like from the comparisons in the Ribonanza paper, the iterative form of HFold was the previous most successful method, very slightly outperforming the EternaFold-ThreshKnot combination–am I reading that correctly?
Since you are wanting a light model with energy prediction, Eternafold with the threshknot calculation is a good choice. When I talk with RNA biology researchers, most of them use RNAfold, which I don’t believe accounts for pseudoknots. (Interesting to note that RNAfold can detect G-quads.)
Here is a recent paper evaluating deep learning RNA prediction methods that might interest you, and of course the results of the CASP 16 competition are relevant. The Chen Lab performed well with VFold but I don’t know how accurate their 2D prediction method is.
The energy models (Vienna) focus on nearest neighbor parameters which fail to take global structures (and energy) into account, so they don’t generate a very accurate prediction for a 300nt molecule. And then of course there is the whole problem of trying to model something that is not static in Nature, which is why using SHAPE reactivity data and prediction may ultimately outperform energy models.