Hey Andrew, Great to have you on-board!
In addition to the Forum Posts that Eli Pointed out above, the Discussion on new lab system with 20,000 synthesis per month thread has some discussion near the end on searching the proposed “database” of lab results.
Searching an RNA results database seems (to me, a relative search novice) to be complicated by how many variables there are to search on and how contextual and interrelated they are. For example:
Target/Estimated Secondary Shape:
Looking for solutions to a “tetraloop” may be relatively straightforward, assuming you want a tetraloop with an attached stack of at least 3 base-pairs, you might search with a selection clause something like “shape contains ‘(((…))))’” . But how do you tie in whether you want “successful” results or “all” results and constrain the selection to labs with a synthesis score >= 95? Also, how do you search for a more open pattern like a four branch multiloop with no nucleotides between the stacks? Using a typical “*” in the search pattern for an arbitrary match wont work. You would have to add a character like “#” to represent a () balanced sequence of structure characters then search for something like “((((((#)))(((#)))(((#))))))”.
Sometimes you want to find lab results that contain a particular sequence of nucleotides instead. For example look for “GUGU”, to see if that causes troubles or not. But then you might like the result to be able to include the sequence, the matching portion of the secondary shape, and the SHAPE data results for that sequence. It might also be interesting to see what (if anything) that sequence was matched with in either the Target Shape, the Estimated Shape, or both.
The SHAPE Data is part of the lab results. It records whether a nucleotide appeared to be bonded or unbonded in the lab test. Right now the access is mostly visual rather than numerical. Also it can be binary (bonded/unbonded) or binned/continuous measurements. Sometimes it could be useful to know how successful a particular sequence was in forming a target shape, or to search for sequences and shapes that meet some particular threshold of success. How to specify that threshold in the search is an interesting question, for instance, how to look for “near misses” that had only one mismatch. Also, some cases like GNRA tetraloops may give misleading SHAPE results; how to ignore/compensate for that?
Lab results are assigned a synthesis score from 0 to 100. Sometimes it may be useful to constrain a search to only look at labs with a score in a given range (like score>=95). This is based on the idea that mismatches and miss-folds in one part of a design could affect what might otherwise have been a successful region.
Ultimately it might be nice to have a visual search interface like the puzzle-maker where you could build a shape, fill in some of the nucleotides and leave others as undefined (e.g. N=any) or partially constrained (R=GA, Y=CU), select some of the nodes to define a target shape (e.g. 4 branch multi-loop), then ask for matches. This might work well for associating Secondary Shape and Sequence in a query, but how to add SHAPE Data constraints etc. is even more of a UI research question.