I need to decide on a way of quantitatively evaluating the “clean-ness” of a dotplot before I can look at the RNAfold’s plots as indicators of folding accuracy.
The data are basically a list of coordinates (the plot position of each interacting pair of bases) and a probability (how dark the square is on the plot). I was thinking about adding up all of the probabilities for undesired pairings and dividing by sequence length.
Is this approach valid and/or going to give me useful information? Anyone have other ideas? It’s the simplest I could think of, but I’m concerned that it treats a large number of lower-probability interactions the same as a smaller number of higher-probability interactions.
If anyone from the EteRNA team reads this … how do EteRNA strategies evaluate dotplot data? Ideally, I’d score dotplot data in the same way instead of trying to reinvent the wheel.
I’m interested in knowing how the EteRNA team implemented Penguinan’s and xmbrst’s strategies. Specifically, how they took dot plot data and turned it into a score for the sequence.
I didn’t realize Penguian’s strategy got that low of an ordering score, though. I wonder if it really did treat all “dark” squares on the dot plot the same. I don’t think you’d want to treat an undesired pair with a 50% chance of forming the same as an undesired pair with a 0.5% chance of forming.
Ok, as I understand it, the dotplots are diagonally divided between the ideal probability for the shape tested (bottom left) and the calculated probability for the shape in question (top right), and the probability in question is that the 2 bases defined by the intersection dot are interacting.
So, to get the maximum information from the dotplot, you might want to look at the differences between the equivalent dots on the 2 sides of the plot. The sum of the absolute values of the differences between all the equivalent dots would be something like the total error probability. You would need both the white and dark areas, because there is for any given base a probability that it will pair with an incorrect base or won’t pair when you want it to, or will pair when you don’t want it to, and this would cover all the bases (heh,heh, sorry).
You’d probably want to get into the algorithm before the place where it bins the probability values to make the grayscale plot.
I am undecided whether it’s a good idea to add a factor to take into account the size of the molecule (dividing by number of bases). On one hand, it seems like just having more molecules would tend to increase entropy, and increase probability of unwanted interactions. On the other hand, it seems a lot easier to stabilize a long stack or loop than a short one. For this it might be interesting to see if it made a difference if you counted total bases, intended interacting bases, and intended non-interacting bases, or different weights of all three.
I think that all these strategies for using a single piece of information like the dotplot are going to be less than totally successful, because if it was that easy, it would already have been done. Since melting temp info is also included in the basic package, maybe there could be a factor added to the strategy to account for this, as well as possibly a free energy factor. The problem with adding in factors is that I suspect few of these factors are independent variables, so things will get complex quickly.
What you describe in your second paragraph is basically what I was planning to do, with the additional step of dividing by sequence length. The lower left half of the plot shows the desired pairings, the upper right shows everything. The difference is “undesired” pairings.
The reason I wanted to factor in sequence length was that I was concerned that small, unstable molecules could score the same as longer, more stable molecules. Maybe not a problem in lab, though, where all designs are about the same size.