Lab Data Mining Script Prototype

I’m working on a scripting tool to find examples of synthesized substructures from the results of previous labs. For example you can look for triloops with attached 3-stacks using a search for substruct="(((…)))" which finds nearly 3000 examples.

The “substruct” search field can be used to specify a substructure pattern to look for in the lab candidates. Dot and round brackets can be used with the usual meanings. In addition, a vertical bar ‘|’ can be used to indicate a balanced pruned structure. For example “(.(|).)” looks for 1-1 loops with just the closing pairs and unpaired bases, while “(((.(((|))))))” looks for 1-0 bulges with attached 3-stacks. Likewise “((((|))((|))((|))))” looks for a 0-0-0-0 (4-way) multiloop with at least attached 2-stacks. Adding a leading ‘|’ to the search anchors it to the “hook” area, so that “|(|)…(((((((…)))))))” will match cases where a lab structure is separated from a barcode by two unpaired bases.

You can restrict the search to labs results scoring within a given range by specifying an optional minscore and maxscore value (defaults 0 and 100 respectively).

If you are simple interested in finding which (if any) labs have a given substructure pattern, you can specify soln==“none” and it will list the locations of the structure matches found within the “secstruct” of any synthesized labs (with links to the lab results pages).

You can also restrict the search to lab results where some particular subsequence is found, such as subseq==‘GGG’ for sequences with 3 or more GGGs somewhere (anywhere) in the design (not just in the targeted substructure).

For output, the default is a JavaScript compatible format, but by specifying mode==‘csv’ you can get an Excel spreadsheet compatible listing of results.

The script speed is still not optimal: it can take 15-20 minutes to search all synthesized lab results for a given substructure pattern, so you need to make sure that you set the script timeout value to at least 1800 seconds (30 minutes) to be safe, then go grab a coffee or lunch.

Possible Future Work

Right now, the script merely finds and displays all matching structures, unsorted and unranked. In some cases, there may be hundreds or thousands of matches. One possibility may be apply a metric like that derived for Computationally Selected Elements to the SHAPE values for the substructures, then group by similar sequences and sort by average rank.

The substruct search pattern syntax also does not have the flexibility as yet to specify stacks of a precise length, so you cannot search for examples of successful stacks of length exactly equal to 4 for example.

Assistance with either of these goals (or other extensions) would be welcome.

Jandersonlee, I’m really glad to see this. You’ve inspired me to think more about how to find comparable structures (and/or sequences) across multiple labs, which my data mining tool currently doesn’t address at all. I’ll probably try a somewhat different approach, but the design space for a data mining application is huge. The more people we can get interested in trying out their own ideas, the faster we’ll make progess.

Yesterday I downloaded the synthesized lab data from the GET api: it’s about 5-6MB. Since it doesn’t change but once every few weeks it might be reasonable to download and pre-process it periodically for better substructure searching. I’m sure we can do better than 15-20 minutes!

Can you describe the (logical structure of) the “synthesized data” you are putting together? For searching target structures, I’ve been thinking of caching lab info in a Google spreadsheet, and using Google Apps Script to implement a container bound server (coded in Javascript) that would implement an HTTP GET interface where you could specify the substructure you were looking for and the results would be the labs containing that result. (Of course, a new fly in the ointment is that labs are now no longer constrained to have just one target structure.)

I have in mind using a combination of query language and search algorithm designed as an experiment in how much we can milk out of Javascript’s built-in regular expression capability, so that that a query would take no more that a few seconds to complete. (But this still would only address searching for labs that contain a certain target motif, not specific sequences, as you are currently tacking.)

This is a good lead-in to a closely related topic. The biggest bottleneck we have right now is that it takes so many individual server queries to gather up all the data we need for a tool that mines data from multiple APIs, each of which was designed specifically to support the one Eterna GUI. I’ve communicated briefly with Justin about providing an SQL type interface to all the Eterna data tables, so that any tool could efficiently gather up all the data it needs at one time in just one query. He thought it was a good idea in principle. But he has tons of things to do, so I think it will take a groundswell of player/developer demand if it is to happen in the near future.

One of my concerns is that the current group of player/developers may not enough people who would appreciate the leverage to be gained by using SQL directly. I’ve told Justin that I can assist anyone (through the Player Created Tools group) with the construction of the query they need. But it would help if there were one or two other players who could jump in and say “Yes, that’s just what we need!”. Do you happen to fall into that category?

@Omei: “Can you describe the (logical structure of) the “synthesized data” you are putting together?”

For now I’m just caching the results of a labs api query. Haven’t got to any post processing yet. I’m still thinking about what sorts of queries or analysis one might possibly like to do. Some brainstorming in that regard could be fruitful. Perhaps a separate forum thread? What sorts of analysis do people now do by hand? What might they want to do if it was easier?

@Omei: “For searching target structures, I’ve been thinking of caching lab info in a Google spreadsheet, and using Google Apps Script to implement a container bound server (coded in Javascript) that would implement an HTTP GET interface where you could specify the substructure you were looking for and the results would be the labs containing that result.”

Sounds AWESOME. If you could describe how to do that with a simple example as a Google Doc it would be great. Node.js is not too hard to work with, but it means hosting your own server somewhere. If we can do this sort of thing with a Google Apps Script that would be *wonderful*.

@Omei: “I’ve communicated briefly with Justin about providing an SQL type interface to all the Eterna data tables, so that any tool could efficiently gather up all the data it needs at one time in just one query. He thought it was a good idea in principle. But he has tons of things to do, so I think it will take a groundswell of player/developer demand if it is to happen in the near future.”

I think that I’m with Justin on this at this point. A properly functioning SQL interface to all of the data tables that does not open up the barn door (so to speak) is non-trivial. I think if we can show that some particular kind of table search or join might be useful it would be simpler and less work (for Justin) to get a new API for that rather than trying to do everything. That viewpoint could change in the future.

Comment on Jandersons remark: “Some brainstorming in that regard could be fruitful. Perhaps a separate forum thread? What sorts of analysis do people now do by hand? What might they want to do if it was easier?”

I have tried demonstrate part of what I do when I analyse lab data in the following two docs. Perhaps this can help give some inspiration for what could be a help.

Ways to deal with lab designing - in huge labs

Ways to deal with lab voting - in huge labs

On the subject of a SQL interface, I may have given some wrong impressions. Justin was nothing but positive. But I’m sure he doesn’t have full control over his priorities, and the dev team as a whole has a lot of irons in the file.

And I wouldn’t want Eterna to make it possible to execute anything other than SQL SELECT statements. I think the only barn door of concern would be poorly constructed queries that eat up too much CPU or other resources.A daemon monitoring those resources and killing queries that exceed some maximum should handle that concern.

And I was thinking of it as a time-saver for Justin in the intermediate-to-long run. Yes, he can create one (or several) special queries in the time it would take to expose one with the generality of a SQL query. But there are all kinds of useful utilities we could cook up, if we didn’t have to ask Justin for a new query specific for it. I’ll make up an example off the top of my head. Dennis has found that the total number of votes for a lab sequence is a very weak predictor of the sequences score. (So weak that he doesn’t even include it as a factor in his prediction.) But that might be due mostly to the snowball effect, where a large majority of votes seem to be cast mostly because everyone else has voted for them. I would be interested in knowing if there are certain players who are actually skilled at score prediction, because asking them why they picked this design over that could lead to better bot algorithms. As it stands now, that would take a ton of individual queries from the votes, solutions, lab and user APIs. Even if I wanted it badly enough to code it, I wouldn’t release it as a general tool because it would be too slow (in my mind, at least) to be useful. But a single SQL query could get exactly the data I needed in a single query, and in a few hours I could probably throw together a rough prototype and distribute it to other players to see if they thought it merited making an application out of.

I’m just now reading through Eli and Omei’s Google Doc on Omei’s Lab Data Mining Tool. Great stuff! Definitely some ideas I want to steal/borrow/use.

  1. Normalizing the shape data. I figured this was needed, but didn’t know how to do it. After reading the document I think I understand.

  2. Group By. Great idea. Especially when you get nearly 3000 tetraloops! And averaging the normalized shape scores for feedback seems a great idea too. Could also be extended to “binning” by synthesis scores - e.g. 100-96, 95-91, 90-86, …

  3. Drill down. Being able to “undo” the group-by for one particular grouping in order to see the explicit cases is a wonderful feature.

  4. Color coded displays. Time to brush up my HTML skills - or get someone else to jump in! Might also be interesting to have some feedback on the stand deviation of a group as well as the mean (average).

For prototyping at least, I’m curious to see if a 6MB JavaScript file would load properly into some/any browsers? Since the lab data only changes every few weeks, you could embed the latest collection of lab results into a JavaScript file along with the code and only update it as necessary. That way, once loaded, all queries would be inside the browser! Also, no cross-domain problems.

Searching for substructures across labs means that positional offsets can vary. However we could use the offsets within the search string as a proxy. To make things easier, having searched for a pattern+constraints, clicking on an “N” in the meta result string could toggle the group-by option for that position in the following section.

Still don’t have concrete ideas on searching for “stacks/stems”. Suggestions welcome!

It turns out you *can* load a 6MB javascript file, at least in Chrome under MacOS.

Select … from … where … is still a major undertaking for nontrivial (not equality) where clauses. Assuming assuming joins (multiple from relations), which are needed to avoid multiple queries, or do-it yourself joins, it get hairy. I’m not saying that it cannot be done nor that it would not be useful. Just that doing it right could take some effort.

Is the hairy effort you are most concerned about Justin’s? Or player/developers’ (in learning how to compose SQL queries)? I had imagined that Justin could just append the user’s query string to "SELECT ", run it through the standard PHP sanitation function, pass it to MySQL, and return the results as the http response, perhaps after some light filtering and/or converting it to JSON in a very simpleminded way.

But then, I have never been a SQL database administrator, so I really don’t know how robust MySQL is to abuse (either intentional or unintentional).

Re Sounds AWESOME. If you could describe how to do that with a simple example as a Google Doc it would be great. Node.js is not too hard to work with, but it means hosting your own server somewhere. If we can do this sort of thing with a Google Apps Script that would be *wonderful*.

It shouldn’t take me long at all to put up a simple example and give you access to the spreadsheet and code. It looks like we’re not going to get any new lab data this week, so I’ll plan on doing that this weekend.

Re Still don’t have concrete ideas on searching for “stacks/stems”. Suggestions welcome!:

Wouldn’t your query string “.(((((((.|.))))))).” match only stems of length 7? (Well, it wouldn’t match 7-stems connected on one or both ends to a bulge, which might or might not be what you wanted.) Or is that not what you had in mind?

I’ve been a webDB admin. It’s REALLY had to do robust sanitation of end-user queries unless you carefully control what they can do. Simple PHP sanitation is probably nowhere near enough. For example there is a SELECT INTO syntax. And SQL joins are BAROQUE.

And thank you for your kind words about my tool. Feel free to lift whatever ideas (or code) you want.

As for other questions/comments:

[Group by] Could also be extended to “binning” by synthesis scores - e.g. 100-96, 95-91, 90-86, … . It certainly could. I’ve thought about, and in one case implemented as an “experimental” feature other alternatives. One of those would be to generalize the grouping.to handle the standard conventions for sets of bases, such as R for purine (either A or G). (I don’t remember whether the documentation mentions it, but you can use these for selecting sequences in the current version.)

Color coded displays. Time to brush up my HTML skills - or get someone else to jump in! There’s really not much to coloring. You can see how I’ve done it by looking at my source code.

Might also be interesting to have some feedback on the stand deviation of a group as well as the mean (average). My closely related thought on that was to report the standard error of the mean estimate (i.e. the standard deviation divided by (the number of instances minus one).) That one would an indication of how statistically significant differences in the mean were. And I might choose to display that as a coloring of the mean, rather than dedicate a new column of numbers.

Searching for substructures across labs means that positional offsets can vary. Grouping and display of heterogeneous target structures was probably the single most significant issue that caused me to limit my initial objectives to mining data for one lab at a time. But now, with the arrival of multiple target structures in the same lab, I need to get past that block. I’ll look forward to seeing how you handle it.

Bulges, multloops, exactly. I’ve thought of adding another meta-character like comma ‘,’ or colon ‘:’ which is a zero-length match that indicates that the bases on either side must *not* be part of the same stack, so ‘:((((:|:)))):’ would always and only match a 4-stack for instance. (Actually, ‘:((((:|))))’ should suffice for this case.)

It also turns out that the 6MB was *not* the full synthesized solution set, just the top 10 candidates for each lab. Probably some more preprocessing will be required.

Bulges, multloops, exactly.
Another possibility would be to use [and] in place of ( and ) to specify the ends of stems. So the query for any stem of length 4 would be ‘[(([|]))]’. This comes from my own (developing) plan to try for a query language where the target string is first transformed into a more specific representation that encodes the secondary structure, not just the paired/unpaired distinction.

For example, if the outer pair of dots in the target structure ‘.((.((…)))).’ were part of a junction, the target structure would be changed to be ‘j[[b[[hhhh]))]j’, where j,b and h denote junction, bulge and hairpin, and the [and] denote the ( and ) that occur at the ends of the stem. When specifying the query, a user could either use the generic dot and brackets or the more specific symbols. That is, behind the scenes the query processor would make the following substitutions for each character in a user’s query string:

. replaced by [bjh.] // there’s also an i and o for inner and outer
( replaced by [([]
) replaced by [)]]

and the derived string then used as a regular expression.

I guess the main difference is that instead of introducing metacharacters, I’m looking at allowing (but not requiring) the user to be more specific than the three characters used in dot bracket notation.

I still need need a way to express the “well-formed substructure” you’re representing by |. This can’t be expressed as a regular expression, of course, because the definition is inherently recursive. But maybe I can get “close enough”. :slight_smile:

OK. More brainstorming on an initial cross-lab, shape-matching UI.

The first phase, which I think I’m starting to get a handle on now, is finding subshapes. The (.)|: grammer may suffice for a starting point, at least for experienced users. Omei’s idea of specifying loop type is another option: b=bulge, m=multiloop, i=internal, h=hairpin, d=dangling to match the RNA parser scheme is one possibility, perhaps leaving : for an unspecified stack break. The expansion would allow a search for ‘m((((h|h))))m’ as a 4-stack joining a multiloop to a hairpin. I’ll probably begin with just (.)| unless adding : and the rest seem simple.

The second step, could be filtering, such as by score range, or base grouping. For example, using ‘((.((|)).))’ as a substruct pattern would find/select 1-1 loops with closing pairs and backing pairs; adding ‘NSGSN|NSGSN’ as a base filter would restrict it to loops double-G boosts and Strong (CG) closing pairs. Or with ‘((…))’ a ‘NSGNRASN’ filter would select only synthesized GNRA tetraloops with CG or GC closing pairs. Likewise minscore=94, maxscore=100 would select maches only from winning lab designs.

Further grouping and analysis is a bit more fuzzy in my mind as yet. As with Omei’s tool, the user could specify group-by terms within the context of the pattern match–except really the default is to group and the exception is to create distinct cases. Continuing the ‘((.((|))|))’ example, a ‘NXNXN|NXNXN’ grouping string would create XXXX clusters based on the closing pair bases, grouping together the N cases which would allow for “any” backing pairs to be lumped together. For example, the groups might be CCGG, CGCG, GCGC, GGCC in this case (since we filtered on designs with only strong closing pairs). In this case the N means group together the results for these locations, and X means that differences in the bases in this location results in a distinct group.

Omei uses the average lab score to represent the group, but I wonder if it might also be possible to specify a “metric” for ranking the selected bases, based on a 0…1 normalized SHAPE estimation. One thought was to allow a 0…9 weighting value to be associated with each base. If the base was to be unbonded, the normalized SHAPE value is multiplied by the weight. If it was to be bonded, (1-SHAPE) is multiplied by the weight. The weight values are summed, then divided by the total weights to give a 0.0 to 1.0 overall ranking for each design. Only the patten matched regions of the design are used in ranking, ignoring the cut ‘|’ regions. The default weights could be 1 everywhere. Using a 0 weight would exclude the shape results for a given base from affecting the ranking. That is, for this example, the default weight string would be ‘11111|11111’.

Once we have metrics in place, my next thought would be: why should the end-user always specify the grouping strategy? Perhaps the system could try multiple grouping strategies and automatically report on which ones seem to show statistically significant differences between the groups…

Thoughts? Other ideas?

Jandersonlee, this is a great discussion!

Re useful stats for the comparative groups, I’m currently working on enhancing those. Here’s a screenshot comparing the current released version and my working version. I think the changes mesh pretty closely with your thoughts, though you might well choose different specifics.

The two new columns are “Base Score” and “Base Value”. Base Value is the (unweighted) SHAPE value, averaged over both designs and bases. I’ve finding that it is useful because it characterizes how “tightly” the group is bound as a whole. Base Score is the number of points that would be awarded by the default scoring algorithm, summed over bases and averaged over designs. This is useful because it characterizes how the group contributes to the total score.

Slightly off-topic, but related: I’ve come to the conclusion that legacy scaling (probably in addition to variations in the scoring algorithm over time) has thrown so much noise into the data, that for data comparison purposes, its better to recalculate the design score rather than use the one recorded in the data query…