How EteRNA ensemble algorithm works

Dear all -

Now that we have (almost) final version of EteRNA ensemble algorithm running in the lab, we thought we should share how we create the algorithm from the player strategies.

Ensemble Classifier

First we create an “ensemble classifier” from all player strategies in the market. Ensemble classifier combines player strategies to come up with one single strategy that scores a given RNA design in 0~100 scale. There are 2 ways to do this - with sparse features & with L2 regularization.

  1. Sparse features : In this method, we first preselect the best 5 features. This is done by using a technique, Least Angle Regression and Shrinkage (LARS). After 5 features are selected, we determine weights (importance of each strategy in the classifier) that minimize the errors between the predicted score & actual synthesis score.

  2. L2 regularization : Unlike previous method, we don’t preselect 5 features - we use ALL the strategies and try determine weights of them directly. However, we control the weights by using L2 regularization to prevent the classifier from over-optimizing itself to existing data.

Both techniques are from machine learning literature, and if you are interested, you might find this slides useful.

http://www.cs.utexas.edu/~vvasuki/wor…

Sequence Designer

Once we have the ensemble classifier, we pass it over to a “sequence designer” which first create a sequence with 60% GC pairs and then keeps changing bases at random positions until it finds a sequence that gets high ensemble classifier score.

We now have 2 versions of EteRNA ensemble algorithm running - 1 with the classifier using sparse features and the other with the classifier using L2 regularization. If you have a good idea on improving the algorithm please let us know!

EteRNA team

Thanks for posting this, Jee.

I should think that statistical ‘Weight of evidence’ would be an interesting approach?
http://www.isprs.org/proceedings/XXXI…

I’m unsure which of these methods our bot use, when it uses conventional mode. How many strategies are in then and which criteria are they picked from?

Here is the explanation Jee gave me:

The conventional bot runs 2 very simple strategy - GC pairs must be 60% and dotplot must be clean. It was designed to test if EteRNA players’ non conventional strategy is playing a role in our bot’s performance. And it does : ]

What are the current ensemble classifier rules for EteRNA?