OPRELA stands for Optimized Revision Layers. It is a bot that uses algorithms that, using the advanced data from the top percentage of previous lab synthesis results, computes pairs that are more probable to bond under synthesis while attempting to maintain a good MFE frequency. This bot is still in its infancy.
How the probability matching works
The process is really simple. It does not use the average inverse folding models to predict the sequence. When the synthesis results are posted for a particular design they are imported into the bot’s database. This data consists of 3 parts. The string notation, the sequence, and the result data. The result data is a string that holds the probability of bonds from 0.0 to 1.0 in which 0.0 is equal to 0% and 1.0 is equal to 100%.
The first step the bot takes is to collect the design’s pairs and loops in a list. It then cycles through the data from the lab results and compares the probabilities. It will then rate the probability of the pair. The higher the rating, the more chance it has to bond and is then added to the predicted sequence.
There are 3 modes the bot will use to compute the prediction sequence. Single pair checking, Quad-nucleotide checking, and Tri-nucleotide pair checking. Single pair checking only takes into consideration of single pairs and not neighboring pairs.
Single pair checking is the only mode that is enabled right now.
Below is an example of how the Single pair checking works.
SN* String Notation
SPC* Single Pair Checking
As you can see, right now, its really simple with only a few functions. It basically keeps the most probable bonding pairs. There are a few exceptions and preset measures such as preferring A, C, and G over U in loops, in order it prefers G then A then C and last U. This is visible in the n2* and n3* columns. It also prefers AU over GC pairs in the middle of stacks and GC over AU on the ends. The n1* columns show how the bot computes the most probable pair.
With the other modes the calculation is identical to how the n1* columns are computed only it looks for the best Quad-nucleotides or Tri-nucleotide pairs instead of just the single pairs. It uses several passes to find the most probable occurrences.
Next step is MFE optimization
After the predicted sequence for the given shape is produced the MFE optimization layer tries to refine it. Its level of refinement is controlled. It is set 0.0 to 1.0 depending on the desired integrity you want to keep. When I say integrity I mean how much of the predicted sequence you want to keep the same.
Right now I have only implemented stack and tetraloops. Multiloops and bulges are ignored but I will add them at a later time. I also want to add checking for equal loop energy distributions as noted by Eli’s observations.
Whats next
This version of the bot is just 2 layers of calculations. What I’m hoping to achieve, in the future, is a stable database of synthesized designs with an additional layer which will look at these designs and see what works and what doesn’t according to certain attributes like stack lengths and/or mirrored energy distributions etc…
As more data and new attributes for the bot to compare are added, the better it will understand what works and what doesn’t under actual synthesis.
I’m planning on making an online version of the bot and releasing some binaries and source when it gets further developed.