NuPACK 4 and JupyterLab for T-Box lab

jandersonlee · January 2, 2023, 9:05pm

I was wondering about trying to use the NuPACK4/Jupyterlab environment for analyzing sequences for the T-Box lab. Does anyone know:

How to specify an aptamer bonus for a specific region forming a given shape?
How to generate/read pairing probability data for a sequence?

In particular I’d like to specify a -16.34 kcal bonus for the aptamer region forming (base pairs 6:29,14:28)

It’s been years since I did any work with NuPACK and the last time I worked with it, it was still written C++. I’m not sure I have the time to try and relearn all that stuff again, but it might be nice to play with it in python if it can do what I need since I’ve gotten more than a bit rusty with my Javascript and C++ these past few years.

LFP6 · January 3, 2023, 2:52pm

If I’m understanding correctly, I don’t think any of our models (NUPACK included) natively support binding site bonuses - we’ve actually modified the model code to insert these bonuses.

Looks like the pairs function calculates base pairing probabilities: Utilities Jobs - NUPACK 4.0 User Guide

jandersonlee · January 3, 2023, 5:31pm

Thanks. There was a version of ViennaRNA that had an aptamer bonus hook, but I cannot seem to find an example of my old code that used it. I did find some examples borrowed from Nando of using the suboptimal structures feature to estimate the ensemble with bonuses. I’ll either work with that in C++ or adopt the strategy to NuPACK in Python I guess.

wayment · January 3, 2023, 10:21pm

Hi Jandersonlee!

You might find Arnie helpful for doing aptamer bonus calculations. There’s three different ways it can handle doing it –

It can handle the vienna motif syntax for Vienna. Here’s an example in a colab notebook of how that works.

If you know exactly where the motif is, it can calculate the bonus in eternafold (via constrained partition functions)
if you know exactly where the motif is and it has one base pair that’s always closing – you can approximate it with all the packages, including Nupack.

The second two variations are more complicated but would be happy to write down how it would work if you have an example sequence and motif in mind. Arnie has a lot of capabilities that it would be awesome to get more in the hands of eterna players!

jandersonlee · January 4, 2023, 12:09am

Hmm. The old strategy I used to use with constraining subopt results does not seem to work well with a -16.34 kcal bonus. It’s too far from the MFE for meaningful results (at least with the Vienna version I have).

jandersonlee · January 4, 2023, 2:55am

Thanks. The motif bit was the part I was missing. Unfortunately I cannot seem to get it to work quite as I expect for the T-Box lab examples. I don’t know if the -16.34 kcal is too big of a bonus for it to work well or what.

jandersonlee · January 4, 2023, 12:43pm

@wayment I’ve tried using Arnie with package=“vienna_2” and the motif option but it doesn’t seem to do what I expect. I get the same MFE shape with or without a motif specified although the computed dG does change with a motif. If I compute an alternative folding by artificially forcing the motif and evaluate is free energy and add in the bonus I get a lower energy than the MFE shape with the moftif_bonus used.

#!/usr/bin/env python3

import sys, os
os.environ['ARNIEFILE'] = '/content//COLAB_ARNIE_FILE.txt'

from arnie.pfunc import pfunc
from arnie.bpps import bpps
from arnie.mfe import mfe
from arnie.free_energy import free_energy

seq='GUGGGGUGGUACCGCGAUAAUCAAUCGUCCCCUUGUGUUCACGAGGGGGCGUUUUU'
s1=seq[5:14]
s2=seq[27:29]
s3=seq[5:29]
s4=seq[2:32]
# motif_bonus=''.join([s1,'&',s2,',(.......(&)),-16.34'])
# motif_bonus=''.join([s3,',(.......(((((.....)))))),-16.34'])
apt_cons = '((((xxxxxxx(((((xxxxx)))))))))'
bonus_s='-16.34'
bonus=float(bonus_s)
motif_bonus=''.join([s4,',',apt_cons,',',bonus_s])

# compute the MFE its free_energy and dG without bonus
mfe_struct = mfe(seq,package='vienna_2')
dG = pfunc(seq, package='vienna_2', return_free_energy=True)
mfe_cons = mfe_struct.replace(".","x")
fe = free_energy(seq,mfe_struct)

# for vienna, the key is supposed to be to use the motif syntax

mfe_struct_with_bonus = mfe(seq,package='vienna_2',motif=motif_bonus)
dG_with_bonus = pfunc(seq, package='vienna_2', motif=motif_bonus, return_free_energy=True)

# construct an alternative folding that has the motif
# fold the post motif part by itself
# attach that to the motif
# estimate the free energy with the bonus

mfe_tail = mfe(seq[32:],package='vienna_2')
cons2 = 'xx'+apt_cons+mfe_tail.replace('.','x')
cons2_shape = cons2.replace('x','.')
fe2c = free_energy(seq,cons2)+bonus
fe2s = free_energy(seq,cons2_shape)+bonus

b = bpps(seq,package='vienna_2',motif=motif_bonus)

print('seq="'+seq+'"')
print('motif="'+motif_bonus+'"')

print(mfe_struct, dG, fe)
print(mfe_struct_with_bonus, dG_with_bonus)
print(cons2, fe2c)
print(cons2_shape, fe2s)

which produces:

$ ./tbox.py 
seq="GUGGGGUGGUACCGCGAUAAUCAAUCGUCCCCUUGUGUUCACGAGGGGGCGUUUUU"
motif="GGGGUGGUACCGCGAUAAUCAAUCGUCCCC,((((xxxxxxx(((((xxxxx))))))))),-16.34"
((((.......))))..........(((((((((((....)))))))))))..... -24.75 -24.54
((((.......))))..........(((((((((((....)))))))))))..... -28.85
xx((((xxxxxxx(((((xxxxx)))))))))xx((x(((xxxx)))x))xxxxxx -28.14
..((((.......(((((.....)))))))))..((.(((....))).))...... -28.58```

wayment · January 4, 2023, 4:37pm

Thanks for this example! There was a bug in how mfe.py was creating the Vienna command line call. I pushed a fix that should make the motif option in mfe() work. The output I get for your example now is

seq="GUGGGGUGGUACCGCGAUAAUCAAUCGUCCCCUUGUGUUCACGAGGGGGCGUUUUU"
motif="GGGGUGGUACCGCGAUAAUCAAUCGUCCCC,((((xxxxxxx(((((xxxxx))))))))),-16.34"
((((.......))))..........(((((((((((....)))))))))))..... -24.75 -24.54
..((((.......(((((.....)))))))))..((.(((....))).))...... -28.85
xx((((xxxxxxx(((((xxxxx)))))))))xx((x(((xxxx)))x))xxxxxx -28.14
..((((.......(((((.....)))))))))..((.(((....))).))...... -28.58

jandersonlee · January 4, 2023, 5:15pm

@wayment Awesome! Thanks for the quick fix. I’ll see if I can do some analysis of designs this way. I still don’t quite get why the dG and fe energy values are so different and the direction of their difference, but maybe I don’t quite understand what they represent.

Any chance we could get a hook added for Vienna2 subopt output as well? That’s what I prefer to use for my detailed design ensemble analysis.

Thanks again!

jandersonlee · January 4, 2023, 5:31pm

@wayment Also, the documentation for the free_energy function lists the motif option in the doctring, but it’s one of the declared parameters. An oversight? In which direction?

wayment · January 4, 2023, 7:22pm

Yup, RNAsubopt is in there too!

#Stochastically sample structures

from arnie.sample_structures import sample_structures

seq='GUGGGGUGGUACCGCGAUAAUCAAUCGUCCCCUUGUGUUCACGAGGGGGCGUUUUU'
sample_structures(seq)

output:

['((((.......)))).......((.(((((((((((....))))))))))).))..',
 '((((.......))))...........((((((((((....))))))))))......',
 '((((.......)))).......((.(((((((((((....))))))))))).))..',
 '((((.......)))).......((.(((((((((((....))))))))))).))..',
 '((((.......))))..........(((((((((((....))))))))))).....',
 '((((.......))))..........(((((((((((....))))))))))).....',
 '((((.......))))...((..((.(((((((((((....))))))))))))).))',
 '((((.......)))).......((.(((((((((((....))))))))))).))..',
 '((((.......))))..........(((((((((((....))))))))))).....',
 '((((.......)))).......((.(((((((((((....)))))))))))..)).']

Unfortunately RNAsubopt itself doesn’t accept the Vienna motif input. But with both Vienna (and EternaFold!), you can do the same stochastic sampling while enforcing a constraint for the motif in a specific position:

seq='GUGGGGUGGUACCGCGAUAAUCAAUCGUCCCCUUGUGUUCACGAGGGGGCGUUUUU'
constraint='..((((xxxxxxx(((((xxxxx)))))))))........................'

sample_structures(seq, constraint=constraint)

Out:

['..((((.......(((((.....)))))))))...(((((......))))).....',
 '..((((.......(((((.....))))))))).(((.(((....))).))).....',
 '..((((.......(((((.....)))))))))...(((((......))))).....',
 '..((((.......(((((.....)))))))))...(((((......))))).....',
 '..((((.......(((((.....)))))))))..((.(((....))).))......',
 '..((((.......(((((.....))))))))).(((.(((....))).))).....',
 '..((((.......(((((.....)))))))))..((.(((....))).))......',
 '..((((.......(((((.....))))))))).(((.(((....))).))).....',
 '..((((.......(((((.....)))))))))..((((((......))))))....',
 '..((((.......(((((.....)))))))))..((.(((....))).))......']

For eternafold:

sample_structures(seq, constraint=constraint, package='eternafold')

Out:

['(.((((.......(((((.....))))))))).)......(((......)))....',
 '(.((((.......(((((.....))))))))).)......(((......)))....',
 '(.((((.......(((((.....))))))))).).....(.....)..........',
 '(.((((.......(((((.....))))))))).........)..............',
 '((((((.......(((((.....)))))))))........))..............',
 '(.((((.......(((((.....))))))))).)........(..........)..',
 '..((((.......(((((.....))))))))).......(.(...)).........',
 '..((((.......(((((.....)))))))))............(....)......',
 '..((((.......(((((.....)))))))))........(((......)))....',
 '..((((.......(((((.....)))))))))..(((((.......).))))....',
 '..((((.......(((((.....)))))))))..((((.(......).))))....',
 '..((((.......(((((.....)))))))))........................', 
...

[I haven’t updated the eternafold hook to not return 100 structures every time, ha.]

I updated the above colab notebook with this too.

jandersonlee · January 4, 2023, 9:13pm

@wayment Thanks. Too bad the motif doesn’t work for subopt. I’ve made some changes that allow for non-stocastic results if you specify a “delta” value for Vienna2; it returns a list of (shape,energy) pairs. If you would like to look at the changes I can forward them.

>>> sample_structures('GAGAGAGAAAACUCUCUCU',delta=3)
[('(((((((....))))))).', -10.2), ('.(((((((....)))))))', -8.3), ('((((((......)))))).', -7.7), ('..((((((....)))))).', -7.3)]

wayment · January 4, 2023, 10:16pm

Thats awesome! It would probably make the most sense to structure it as a different function in Arnie for clarity, since sample_structures is used in other code to truly stochastically sample from the boltzmann distribution. Maybe calling it get_subopt_structures ?

Would you be willing to submit it as a PR to the repo? GitHub - DasLab/arnie: Python utility to estimate, compare, and reweight RNA energetics across many secondary structure algorithms.

jandersonlee · January 4, 2023, 11:08pm

I didn’t change the behavior of sample_structures when delta is the default 0, so old code will be unaffected. If you want to split it out in a separate function you can, but it shares so much code that the principle of code factoring would probably make that function and sample_structures both wrappers to a common function in any case. I’ve sent you an Eterna PM with a link to a Google doc copy of the altered file since I don’t know how to do a Github pull request.

wayment · January 5, 2023, 4:55pm

yes, a separately named function could certainly wrap sample_structures.

I’m afraid I can’t promise time to add Arnie functionalities beyond incorporating small bug fixes, but its good it sounds like you have a working script that does what you want?

Jennifer_Pearl · January 6, 2023, 5:16pm

I have nando’s aptamer code for vienna2 if you want it. I know I have the compiled code and pretty sure I have the source code. I actually think nando’s github still has it actually. BTW, nupack 4.0 now has native python libraries and that is what I am using for my new version of Sara. I can share my github with you that has my current SARA development code to give you a jumpstart. I havn’t done much in it lately, tbh, but it does grab the pairing probabilies. The new nupack version makes it really easy as it outputs teh pairs as a matrix at first and can natively convert it to a numpy array structure. You can then iterate through it and grab what you need… (i lobe python soooo much) Nupack does not have an aptamer bonus but it does allow for multistrand folding as well as controlling for concentrations of mRNA in solution. Im trying to get back into the research again but been dealing with massive burnout from everything over the last few years.

jandersonlee · January 6, 2023, 5:52pm

My power was out for a day an a half with the CA storms so I’m slow in responding. I’m fine making changes myself and forwarding them to you if you feel like including them or not. I always try to be careful not to break existing functionality unlike some agile/fragile developers. However if that’s too much work for you I’ll just keep the changes for myself.

jandersonlee · January 6, 2023, 5:57pm

@Jennifer_Pearl I have a version of Nando’s C++ based aptamer code that I’ve been working with, but Python is more in the front of my brain these days so Arnie seems to fit the bill for me at the moment. As for NuPACK 4.0, I hesitate switching horse at the moment as I’m just trying to quickly hack something for T-Box and Arnie with Vienna2 seems to do what I need, but thanks for the heads up. I think I took a quick look at it and it has very restrictive licensing too.

Jennifer_Pearl · January 8, 2023, 11:06pm

@jandersonlee I forgot to mention that Sara is right now written to the point that it will take one of the spreadsheets generated by the wetlab and store all the data in datastructures I wrote, and each lab is its own thing. Its pretty nice actually I think. You can then feed that data programmatically into anything you want. It might make your stuff a bit easier in the future… I wrote it with the intention of integrating Sara with some deap learning later on and wanted to be able to feed data to whatever I cooked up.

Right now im using it to have python cross check pairs and probs with other designs and find commonalities in structures.