Modular RNA using Junctures coupled with Index and Search

ICP200 in chat last night stated that they (he? she?) have created a database of structures that fold into known secondary sequences extracted from PDB. It’s a start.

I’d like to see it done in a systematic way that would assist in mix-and-match modular RNA creation. What I propose is the following:

Use stack ends of some moderate length (try 3 to start) as a “juncture” tag (at most 6*6*6=216 of these). Break known-to-fold RNA sequences (from PDB, EteRNA labs, etc.) into segments based on the junctures found in their secondary structure, then classify and index them. Using a fixed sized attached stack as a juncture will make matching pieces easier.

There will be two basic types of sub-structures created: stacks (or “sticks”) and “fobs” (simple loops, hooks/tails, bulges, hairpins, multi-loops, and combinations of these joined by stacks of length less than 3). These will be like the knobs and sticks in a Tinkertoy ™ set, except instead of one type of juncture (hole) there are many. This higher degree of constraint on matching can help to *reduce* the exponential search space by partially constraining the matches at each stage.

Stacks will have two ends, each of which will match a specific kind of juncture. Even short stacks of length 3 have two junctures: one representing the orientation for each end. They are classified by their length and indexed by the two junctures they present.

Fobs will have one or more junctures depending on the number of attached stacks. They are classified by the secondary shape that they present when you substitute # for the attached 3-element stacks (*excluding* the implied main branch). For example a tetra-loop would be “…”, a 2-3 loop would be “…#…”, a compact 4-branch multi-loop with no intervening nucleotides would be “###”, and a 1-0 bulge next to a 0-1 bulge would be “.(#.)”.

Every secondary structure can be broken into fobs joined by stacks. To piece together an RNA sequence, break the secondary structure into stacks and fobs by finding the junctures and classifying the parts, then there can be a recursive with backtracking search to find parts that fit the bill. Start by picking some fob (possibly the tail) and look up pieces that fit the shape classification. Then look for stacks with matching junctures to extend the structure. Given a stack that leads to another fob, try to find a fob that matches the incoming stack juncture.

To reduce the search space, it might be better to start at the hairpins and build attached stacks back to connecting loops or multi-loops. Possible combinations of hairpins and stacks can be cached for multiple assembly attempts of the attached fobs as the structure is filled in back towards the tail. Any heuristics used to guard against mismatched bonding can be applied at each step so that an entire structure is not built up only to throw it away because of a possibly sliding hairpin.

In addition to single fobs, a database could also be created of known to fold hairpins (endloop and full attached stack) or larger substructures.

Thanks for documenting the overall problem. I’m trying to write a program to automate the process you’ve outlined above. I have broken it into smaller coding challenges and posted them on the social coding site…

Maybe you could help clarify the challenges I’ve posted to make it easier for others to understand.

I’m also posted a rough web interface for the database at…

I must admit, I don’t quite fully “get” your challenges myself. They are a bit more involved than what I have in mind. It does look interesting though.

I think the solution to challenge 3 could basically “build” a secondary structure you supply from fragments (taken from PDB or maybe even a database of successful EteRNA designs), breaking base pairs as needed.

Then you copy in the sequences from each fragment to create a new sequence for your design. If you had to break any base pairs to satisfy the target structure, your sequence would have gaps that could be filled in with the solution to challenge 2.

Here’s a start. I took the PDB data that jpbida posted on github and scraped it looking for simple structures.

The Google doc spreadsheet Parts and Pieces from PDB contains few sheets with pentaloops, tetraloops, triloops, stacks, 1-1 loops and 2-2 loops found in the source data. Some oddities exist in the data including sone reported non-GC/UA/UG bonded pairs, and unbonded pairs (e.g in loops) that EteRNA would predict as bonding.

A more rough version of the scraped data with 98K entries (3.6MB) can be made available to users by request. (It is too big for a Google Doc spreadsheet.)

Feedback is welcome.

Oops. Bug in the original scraped data. Should be fixed now. Google Doc updated.