Sequence Fragment / Substructure Library

Introduction

A complex RNA secondary structure can be viewed as a collection of simpler substructures. The goal of this strategy is to use past synthesis data to build a library of sequence fragments organized by the substructures they were intended to create. The information contained within this library then can be used to build or rank new sequences for structures that have not been investigated in lab.

Sequence fragments should not be re-used if they have failed to form the desired substructure in past syntheses. Conversely, sequence fragments should be re-used if they frequently permit the desired substructure to form.

Identifying a substructure

The previous lab, “Water Strider,” will be used as an example.

Each loop (hairpin, internal, or multi-branched) has a corresponding substructure that consists of the loop and its adjoining stacks. “Fused loops” should be counted as a single entity with one corresponding substructure. The sequences for all stacks would appear twice in the fragment library. This is not necessarily problematic, however.

A note on my loop nomenclature: Starting from branch closest to the 5’ end of the molecule, go around the loop and list whether there is a branch (B) or unpaired base (X). You’ll end up with something like 5’-BXBB-3’ or whatever.

Describing a substructure

What information should be specified about a substructure? I envision a hierarchical organization of the fragment library, from general to specific information about the substructure.

  • All sequence fragments
    – Basic substructure types (hairpin loops, internal loops, multiloops, fused loops)
    — General substructure information (loop sizes, branch distribution in multiloops)
    ---- Moderately specific information (length of stacks attached to loops)
    ----- Very specific information (instances in molecule)

Each sequence fragment would also be described by two other pieces of information: the number of times it has been synthesized, and the number of times it has successfully folded into the intended substructure.

A simple example

Things obviously get more convoluted when multiloops, asymmetric internal loops, and fused loops enter the picture. Also, there is still the matter of how less-than-exact matches are used when ranking or building new sequences. I’ll be posting my ideas in the next few days, but any and all input is welcome.

The image I posted as an example isn’t showing up.

I think you assume that substructures are isolated and independent. I doubt that RNA can be treated like integrated circuits, but it might be worth pursuing to determine if there is any merit.

That is indeed a simplifying assumption made by this strategy. Making that assumption allows EteRNA to “learn” from patterns that emerge in past synthesis data. On another thread, it was mentioned that that EteRNA might begin high-throughput RNA synthesis. The EteRNA group will then have the problem of putting mountains of data to actual use.

You do raise an interesting point, and one that is not limited to my strategy. For example, the parameters used by RNA structure prediction software are based on observations of very small RNAs in solution. The bulk properties of these molecules are then used to formulate parameters that are applied to far more complex systems… such as the RNAs that the EteRNA group is synthesizing.

Back to your point, though. This strategy would compensates for the “isolation” of each substructure in two main ways.

First, the information describing each substructure will give some indication of how frequently it appears in the molecule. This will indirectly account for the fact that that repeated use of the same sequence fragment creates alternative pairing possibilities.

Second, there is significant overlap between substructures. Each substructure is defined by a single loop (generally a destabilizing feature of the molecule) and the local environment that stabilizes it (nearby stacking interactions). Each stack will appear in two substructures - one for the loop at both ends. When a trial sequence is evaluated, EteRNA may find that the sequence fragment corresponding to the stack stabilizes one loop but destabilizes the other. This is more realistic, as a given sequence may stabilize one part of a molecule and destabilize another.

I really like the idea of looking at the substructures and thinking of solutions as a hierarchy of simpler parts. Perhaps we could even use the synthesis results to test if the substructures behave isolated and independent?

Have you already extracted the synthesis data on substructures from the previous labs?

Found this thread *after* starting a new thread on Modular RNA using Junctures coupled with Index and Search

Very similar in concept except that I propose using fixed sized stack ends as modular “junctures”.

Here’s a start. I took the PDB data that jpbida posted on github and scraped it looking for simple structures.

The Google doc spreadsheet Parts and Pieces from PDB contains few sheets with pentaloops, tetraloops, triloops, stacks, 1-1 loops and 2-2 loops found in the source data. Some oddities exist in the data including some reported non-GC/UA/UG bonded pairs, and unbonded pairs (e.g in loops) that EteRNA would predict as bonding.

A more rough version of the scraped data with 98K entries (3.6MB) can be made available to users by request. (It is too big for a Google Doc spreadsheet.)

Feedback is welcome.

Oops. Bug in the original scraped data. Should be fixed now. Google Doc updated.

Hmm. Sometimes the same sequence is reported as folding different ways in different contexts. For example the sequence GGCUUAGAAGCAGCC is reported as folding 4 different ways across 22 occurrences. Should the library be restricted to sequences that are only reported as folding one way?