How can we improve search?

afirst · April 24, 2012, 7:58am

Hi everyone. I just joined the EteRNA team and will be improving the RNA search functionality. A bit about myself – I’m also an engineer at Google, and am excited to have the privilege of spending my 20% time on a project that is revolutionizing the way scientific research is done.

Right now, I’m collecting ideas on how you’d like us to improve the RNA database search. Please throw any suggestions you have my way. Thanks!

Eli_Fisker · April 24, 2012, 11:01am

Hi Andrew!

Big welcome in Eterna.

Engineer from Google, awesome! That might bring us closer to Adrien’s vision that “We need to become the Google of RNA”.

You have propably already been introduced to our forum posts with ideas on the topic. But in case you haven’t, I have dug up some of the posts from the forum, with our ideas on a comming RNA library system. Here are the links.

New idea on the RNA sequence search tool
[How can we handle big amounts of data](https://getsatisfaction.com/eternagame/topics/how_can_we_handle_big_amounts_of_data
)
In here are some ideas on a RNA library system:
Eterna dreams
CSV file

Good luck!

Edward_Lane · April 24, 2012, 3:06pm

Nice to have more coding power, welcome and thanks in advance

Eli_Fisker · April 24, 2012, 4:18pm

One more from here:

As I understand from the science papers Rhiju posted, RNA has motifs, special 3D patterns that reoccour. I’m thinking, if there there is an information system that specializes in saving and grouping 3D information for comparative search, (for when we are going to know something about RNA’s 3D structure), that could be useful too.

I’m thinking that some other science projects involved in eg. protein folding, molecule structure or Ribosomes, already have a library structure for how to view and search after things in 3D, that we perhaps could learn from.

jandersonlee · April 24, 2012, 5:19pm

Hey Andrew, Great to have you on-board!

In addition to the Forum Posts that Eli Pointed out above, the Discussion on new lab system with 20,000 synthesis per month thread has some discussion near the end on searching the proposed “database” of lab results.

Searching an RNA results database seems (to me, a relative search novice) to be complicated by how many variables there are to search on and how contextual and interrelated they are. For example:

Target/Estimated Secondary Shape:

Looking for solutions to a “tetraloop” may be relatively straightforward, assuming you want a tetraloop with an attached stack of at least 3 base-pairs, you might search with a selection clause something like “shape contains ‘(((…))))’” . But how do you tie in whether you want “successful” results or “all” results and constrain the selection to labs with a synthesis score >= 95? Also, how do you search for a more open pattern like a four branch multiloop with no nucleotides between the stacks? Using a typical “*” in the search pattern for an arbitrary match wont work. You would have to add a character like “#” to represent a () balanced sequence of structure characters then search for something like “((((((#)))(((#)))(((#))))))”.

Sequence:

Sometimes you want to find lab results that contain a particular sequence of nucleotides instead. For example look for “GUGU”, to see if that causes troubles or not. But then you might like the result to be able to include the sequence, the matching portion of the secondary shape, and the SHAPE data results for that sequence. It might also be interesting to see what (if anything) that sequence was matched with in either the Target Shape, the Estimated Shape, or both.

SHAPE Data:

The SHAPE Data is part of the lab results. It records whether a nucleotide appeared to be bonded or unbonded in the lab test. Right now the access is mostly visual rather than numerical. Also it can be binary (bonded/unbonded) or binned/continuous measurements. Sometimes it could be useful to know how successful a particular sequence was in forming a target shape, or to search for sequences and shapes that meet some particular threshold of success. How to specify that threshold in the search is an interesting question, for instance, how to look for “near misses” that had only one mismatch. Also, some cases like GNRA tetraloops may give misleading SHAPE results; how to ignore/compensate for that?

Synthesis Score:

Lab results are assigned a synthesis score from 0 to 100. Sometimes it may be useful to constrain a search to only look at labs with a score in a given range (like score>=95). This is based on the idea that mismatches and miss-folds in one part of a design could affect what might otherwise have been a successful region.

Visual/Query-By-Example:

Ultimately it might be nice to have a visual search interface like the puzzle-maker where you could build a shape, fill in some of the nucleotides and leave others as undefined (e.g. N=any) or partially constrained (R=GA, Y=CU), select some of the nodes to define a target shape (e.g. 4 branch multi-loop), then ask for matches. This might work well for associating Secondary Shape and Sequence in a query, but how to add SHAPE Data constraints etc. is even more of a UI research question.

JavaScript:

There is talk of creating a programming mode where users can design and test (against the model) RNA sequences in JavaScript. Some way to integrate the database search with this functionality could be a great boon. For example, if a design could be specified where at least the initial value for some region could be selected from a query result or from a user library of shape data populated from a query result. Just thinking…

Adrien_Treuille · April 24, 2012, 7:50pm

I’m so excited to see progress on this side of EteRNA.

Quasispecies · April 25, 2012, 3:20am

I had the idea of building a “fragment library” awhile ago.

The idea is to break every sequence into fragments that correspond to a loop and its attached stack(s). This database could be searched based on:

-Loop type (hairpins, internal loops, bulges, multiloops, and external loops)
-Loop size (number of unpaired bases, branch distribution in multiloops)
-Length of their attached stack(s)
-Synthesis score(s) of the molecule(s) where the fragment occurs

jandersonlee · April 25, 2012, 4:03pm

Curious thought: could it be useful to include mismatched loops in this library, tagged as such? That is, if a loop forms in an alternative folding of a sequence, then in some sense, Nature is showing a *preference* for that loop/folding. Of course it depends on how accurate the alternative folding predictions are, but such an index could possibly pose a useful resource.

Also, in addition to loops, stacks and necks (yes I know necks are stacks) could be a useful library.

Quasispecies · April 25, 2012, 4:48pm

Yeah, that was sort of the idea of including synthesis score. Maybe instead of just including synthesis score of the design overall, include a score of the fragment itself (how well did the fragment fold, regardless of how well the molecule folded).

Hopefully players (or the EteRNA bot) could build an RNA piece-by-piece, using fragments that were successful in past labs and avoiding those that were not. I tried to do that in the aptamer “cross” labs, with varying degrees of success.

Eli_Fisker · April 25, 2012, 5:01pm

I like the idea of fragment score really much. That would mean I could go hunt my interchangable neck theory. And we will find out how well it is possible to play with RNA fragments as lego’s.

tsuname · April 26, 2012, 5:48pm

Hi all,

I’m a member the Das lab where we have recently finished coding up a search tool for our repository of chemical mapping data. The repository is found at http://rmdb.stanford.edu/repository/ and the search tool at http://rmdb.stanford.edu/repository/a…. Maybe you guys could take a look at the tool and see if there is anything you like/dislike about it as a starting point for the EteRNA search tool. Just a bit of a warning, wer are just starting to test the tool and it may break =P

ICP200 · April 26, 2012, 6:16pm

Hi Quasispecies and Fisker. I’ve been thinking about build a 3D homology program using component based assembly. Essentially, breaking a large secondary structure into pieces and then seeing if a 3D structure for those pieces exists. I’ve posted all the know 3D components at github and I’m putting together some coding challenges to build an automated assembly program. You can download the sequences and secondary structures from github. Check out challenge #2 for the details.

https://github.com/jpbida/RSIM/wiki/C…

Eli_Fisker · April 26, 2012, 7:00pm

Hi Jpbida!

Sounds awesome that you are building a 3D RNA program. Thanks for sharing the details with us.

I’m having trouble with downloading and installing the file. I get the message 502 Bad Gateway, nginx/1.0.13

Eli

ICP200 · April 26, 2012, 7:18pm

Are you having problems getting to the wiki? or using git to clone the repository?

Eli_Fisker · April 26, 2012, 8:16pm

Hi Jpbida!

I’m having problem with the Git.

ICP200 · April 26, 2012, 8:28pm

Github has a lot of documentation describing how to clone a repository. Google “cloning a github repository” and you should be able to find a solution.

If you don’t need the code and only want the data you can download the files through the web.

https://github.com/jpbida/RSIM/tree/m…

All the pdb files for individual components are in comps.tgz
The secondary structures of the components are in ss_comps.txt
The sequences are in seqs.tgz

Eli_Fisker · April 26, 2012, 8:41pm

Hi Jpbida, now I know where to look at the data. I’m still not sure how to use it. As I don’t know how to see from the data, if a sequence does good or not. But time and chat discussions will propably help. Thx for pointing me in the right direction.

ICP200 · April 26, 2012, 9:40pm

Hi Andrew,

Welcome to EteRNA. I outlined a general graph representation that could be used to for search and for building classifiers or scoring functions. You can checkout the description here:

https://github.com/jpbida/RSIM/wiki/C…

The idea is to represent all RNA structures as graphs and use existing algorithms to search for subgraphs.

ICP200 · April 27, 2012, 3:23am

@Fisker
I just created a merged dataset that makes it a little easier to find sequences that match a target secondary structure.

https://raw.github.com/jpbida/RSIM/ma…

You can search for a target secondary structure in this file and find a sequence that has been shown to experimentally fold up into it.

For example,

Searching for (((…))) finds the sequence CCUUUAAGG, in the pdb file 1c2w (http://www.rcsb.org/pdb/home/home.do)). So you have the secondary structure, the sequence, and the 3D structure. Hope this helps.

Eli_Fisker · April 27, 2012, 6:21am

Hi Jpbida!

Big thx, now it looks more understandable.

I can’t accest the last link, where I should be able to get hold of the 3D structure. It says: HTTP Status 404 - /pdb/home/home.do) - description The requested resource (/pdb/home/home.do)) is not available. Mat got the same problem.