Finding Clusters of identical lab designs‏

This post Stability of barcode, affects SHAPE colors elsewhere in the design made me realize that I wished that there was a way to dig up identical designs or close designs in a more effective way. I have discussed it with Omei and here pass on our ideas:

Hi Omei!
I was wondering if it would be possible to dig up clusters of similar designs. It was the hairpin skewing with the SHAPE data that got me wondering. I know there are more designs that are the same in the main design than mine and the one you had. I know Mat did some experiments and I’m sure others have as well. But as is they are very hard to find.

Your current script on finding similarity, starts from an individual design. But what if one doesn’t know which designs are similar and still want twin designs? Could a script be made to dig up these twins (or clusters of designs) where eg. the main design is the same but the hairpin barcode not? (or any specific area of interest)

Eli

Date: Fri, 23 Aug 2013 10:23:01 -0700
Subject: Re: Clusters of identical designs

Eli,

There are certainly algorithms for doing clustering. I spent a few minutes searching the Web to see if I could find anything ready-made that we could easily use, but didn’t find anything very promising.

What would be pretty easy, though, would be a script that takes a lab ID as input and outputs any designs that are identical (not counting the barcode.) That could be a starting point. From there we could think about generalizing it, i.e. add features for comparing only certain positions, or for finding close (non-exact) matches or automatically processing all labs, or …

Do you think the simple case would be useable enough to get things started?

Roger

On 8/23/2013 5:31 AM, Eli Fisker wrote:

You already did that script that could could find close designs from a lab ID. However I do like the add on ideas like comparing only certain position. And I love the automating processing all labs, when searching for a specific sequence. We have the sequence search in lab. But it is limited to that specific lab.

I think all what you mention could be really nice. And I do think the simple case could a stepping stone to the clustering.

I just realized we already have an interactive script to find all the **exact< /b> matches for a given lab – the Lab Data Mining Tool. It’s not terribly elegant, but if you specify “Group by positions” as 6-69, the summary section will consist of one group for every unique solution. if there are any duplicates, they will be at the top of the list. Click on one of those lines, and the details section will show which designs they are.

I tried it on Cloud Lab 1, and it turns out there were 6 sets of duplicates there.

This works equally well for finding duplicates on any part of the design**

Whoops. Sorry about my sloppy HTML coding.

Oh, sweet! Thx :slight_smile:

Np, your tool just got much more amazing.

Here is a demonstration of how to do this, with picture examples:

Finding clusters of identical lab designs