The TB Challenge: How should we select A, B and C segments from the larger possibilities?

The Khatri lab at Stanford has identified three RNA molecules that, when present in human blood in certain ratios, are highly indicative of active tuberculosis.  Designing a riboswitch that selective triggers phosphorescence in the presence of the proper ratios is the focus of the Eterna TB challenge.

As players, we have the opportunity not only to design the riboswitch, but also to help choose the specific input and output sequences (A, B and C) that the switch will be sensing. This additional choice needs to be made for two reasons:

  1. The RNA sequences that were identified are actually 50 bases long, whereas all our past experience has been with shorter segments, and
  2. We can choose to target either the RNA sequences or their complementary DNA sequences.
    There has been some internal discussion about this, and there are many possible criteria for making the choice, but no obviously correct answers. So Rhiju has decided to open up the discussion to include all players.

To get things started, Wuami will soon be following up with a post about the pros and cons of targeting the DNA rather than the RNA.  I’ll also follow up with a summary of my initial thoughts on choosing subsequences to increase the possibilities for designing switches with high fold changes.  

We’ll see where it goes from there!

3 Likes

Hi all -

The analyses that the Khatri lab performed are from microarray experiments with data deposited in the [Gene Expression Omnibus.](http://Gene Expression Omnibus. “Link http//Gene Expression Omnibus”)  Typically, the RNA from blood samples is reverse transcribed into cDNA , which is then spotted on the microarray chip.  This chip contains DNA probes that bind to the cDNA.  The DNA probe sequences for the 3 targets are:

[A] GCAGGAACAACAGATGCAGGAACAGGCTGCACAGCTCAGCACAACATTCC
[B] CCATGGTGATGGATGGTTTGGAAAGGGAATGTTGGTGCCTTTTGTGCCAC
[C] ATTACTGTACATAGAGAGACAGGTGGGCATTTTTGGGCTACCTGGTTCGT

These are the same as the RNA sequences in the blood (with T->U of course), but the cDNA sequences (those actually detected in the experiments) are the reverse complements:

[A] GGAATGTTGTGCTGAGCTGTGCAGCCTGTTCCTGCATCTGTTGTTCCTGC
[B] GTGGCACAAAAGGCACCAACATTCCCTTTCCAAACCATCCATCACCATGG
[C] ACGAACCAGGTAGCCCAAAAATGCCCACCTGTCTCTCTATGTACAGTAAT

You can read more about the details of the how the data was collected in the following papers.  Unfortunately, one or two of them are not open access but the vast majority should be.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492754/
http://www.nature.com/gene/journal/v12/n1/full/gene201051a.html
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026938
http://jid.oxfordjournals.org/content/207/1/18.long
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3356621/
http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001538
http://www.nejm.org/doi/full/10.1056/NEJMoa1303657
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0070630
http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-74
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0045839
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4515549/
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4734838/

My suggestion is that extensive chemical mapping analyses be performed on the sequences, which would allow us to identify the regions of the sequences that are exposed to chemical probing and therefore less likely to be participating in significant structures in solution. These may be prime candidates as sequences for binding to our RNA sequences.

As a second option, you could identify the regions of the sequences that have the greatest concentration of guanine and cytosine residues, which allows players to maximize the number of G-C base pairs within the RNA dimer.

A careful and slow approach of our goal would possibly ask a first question: can we solve the [A]*[B]/[C]^2 problem for some (any?) set of {A,B,C} inputs. In which case, we possibly don’t need to worry too much about which sequences we’re testing first. The answer to this question alone is not trivial. It requires modelling the problem into an Eterna lab and being able to solve it.

I actually created prototypes on the dev server, using the “old” oligos we’re familiar with:
http://nando.eternadev.org/web/puzzle/3391075/
http://nando.eternadev.org/web/puzzle/3391094/
I won’t claim that this is the best possible model, but reducing  [A]*[B]/[C]^2 into a 4-states puzzle sounds like a pretty good result already. This said, you’ll all rapidly notice the small issue I need to work on: folding engine performance. I’ve been very busy these past months, so I haven’t finalized this part, but I’m working on it, and a workaround should be available pretty soon (don’t expect miracles though)

Now, if we’re in the business of designing the TB diagnostic, everyone (I mean most players here, I’m pretty sure the scientists are all aware of what I’m gonna say) should at least try to understand what we’re dealing with. The RNAs present in plasma/serum (blood) are actually stable, which is quite surprising, since there are ribonucleases (RNA-shredders) present in blood too. So, these RNA strands are most certainly bound and packed with proteins and/or some kind of lipids. A diagnostic device will certainly need to purify the RNA before anything can be done with them.

Then, players also need to understand that the probes listed by Michelle are just that, probes. Let’s take the A example: this 50-nt probe is a marker for the presence of a transcript of the human GBP5 gene. The actual mRNA (final, after transcription and splicing) is about 4000 nts long, not just a 50-nt oligo, and that’s what the diagnostic device will be dealing with. Coming to Brourd’s first idea, we would need the Das Lab to SHAPE-probe the full-length purified mRNA… @DasLab: possible? does it make sense?

Since there will be all sorts of RNAs in the samples, one of the most important thing to pay attention to is to make our test as specific as possible. We don’t want the test to get polluted by extraneous signal coming from other genes. Here I’d rather trust someone who actually knows how to BLAST (I don’t), for choosing a proper set of targets.

Another factor that shouldn’t be forgotten: our tools simulate and predict at 37°C. The diagnostic device and the tested blood will most likely be at room temperature…

Those were my 2+epsilon cents, for now :slight_smile:

Thanks to the article links and the information already provided here, I feel like I am getting a clearer picture of the overall problem and potential solution.  What I don’t get is what exactly would result in the detectable fluorescent signal? I understand the basics of how RNA detection can be accomplished using the MS2 system.  But, I think it would help to know, when attempting to design, whether our design is interfering with or aiding in signal detection.  Or maybe not, I’m new here :slight_smile:

As nando said, a BLAST homology search needs to be done on the probe sequences first to see if there is any false positives that could occur. It has been a while since I’ve done a BLAST search using the Wisconsin Package, but I know NCBI’s site has BLAST available for use on the web.

OK.  I think that I asked the wrong question.  What is going to be labeled and with what? 

I’m not a scientist but Nando’s comment about getting SHAPE data for the whole 4000nt mRNA makes sense with Brourds comments to be able to know what exactly we are dealing with. Whenever I need to tackle something I first quantify what that is, going to whatever lengths are necessary.  

@JRaiKetchum: agreed, and nice to know we have knowledgeable people around for that kind of tasks :slight_smile:

This said, @wuami: isn’t that exactly what you did when you selected the following set:

A (GBP5): ACAGCUCAGCACAACAUUCC
B (DUSP3): GUUGGUGCCUUUUGUGCCAC
C (KLF2): UUUUGGGCUACCUGGUUCGU

?

For the curious ones, you can see the A marker on http://www.ncbi.nlm.nih.gov/nuccore/NM_001134486
Scroll down and look at the nucleobases numbered 2058 to 2107. Notice that this segment spans the last two exons.
Check for instance http://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000154451;r=1:89260582-89270863… where you’ll see the marker start in exon 10, jump over the intron 10-11, and end in exon 11.

1 Like

These are each the last 20 bases of their respective 50-base sequences.  I think they passed wuami’s screening, but I doubt that they were selected on the basis of being superior to other possibilities.

My own initial thought for selecting the subsequences was to apply the lessons from the massive amount of switch data we have collected to first try to identify triples of sub-sequences that would be the most amenable to the design of good switches.  With multiple candidate sets, we could then evaluate them from a variety of perspectives, which would include specificity for the RNAs we’re looking for (and not others likely to be found in human blood.)

When the subject came up a week or two ago, I wrote down some of my thoughts in Choosing TB segments for the A*B/C squared challenge.  Eli agreed with the premise that the possibilities for achieving high switch scores is dependent on the subsequence we choose, and had his ideas on what is important.  He added his thoughts to the above document.

As far as I know, neither of us has gone back to revisit the question or refine our explanations.  I had hoped to do that before posting this, but that isn’t going to happen in the next day or two.  So I’ll just throw out my initial suggestions for the triples:

TB A (GBP5):    CAGGAACAGGCUGCACAGCU
TB B (DUSP3): CCAUGGUGAUGGAUGGUUUG
TB C (KLF2):     GAGACAGGUGGGCAUUUUUG

My main criticism about your proposal is precisely its basic idea: it is argued that it would be advisable to choose oligos such that inputs and outputs are vaguely related. For one, I strongly doubt that MS2 will be the signal used in the final diagnostic. So why bother taking it into account?

Also, it seems to me much more valuable scientifically speaking to develop methods that let us create devices where inputs and outputs are actually as independent as possible from each other. My arguments in favor of sequence independence:

  • I don’t think that the oligos we’ve been using until now were particularly selected for their compatibility with the MS2 signal, and yet, it seems to me that the community has been able to produce pretty good logic gates and successful A/B measurers.
  • Our next goal or target may very well not give us a choice.
    In conclusion, I’d rather see the Eterna community learn how to deal with these problems, without counting too much on helpful sequence coincidences.

Why can’t you just use the BLAST available on Nando’s second reference above?  I’m very new to all this, so it may be a silly question.

Nando, I can understand that perspective.  But taken to its extreme, we needn’t even bother use sequences from the identified three TB RNAs.

Regardless of how successful this next round turns out, there will be a huge gap between a riboswitch demonstrated in the Das lab and a functioning diagnostic. Everything else being equal, the better the switching characteristics of our first efforts, the better chance people downstream in the development line have of making progress with it.  Chances are, they will identify additional constraints that we have no way of anticipating at this point, and we will be changing the rules of the game as we learn more.

So my vote remains for starting by trying to make the best switches we can to meet the requirements we currently know about.  Choosing sequences that promise to work best using the techniques we currently know best certainly doesn’t prevent discovery and development of using other techniques with the same sequences.  If anything, I suspect it will encourage the development of alternatives (clearly, we’ve already demonstrated that there are alternatives) because it creates a clearer picture of what it is possible to shoot for.

Some more input on the experimental choices we have:

I had previously asked Rhiju whether it was possible to test more than one set of inputs this round. He didn’t say no, but it didn’t sound very encouraging, since it would increase the time and expense of the experiment.  But today I asked him whether we could consider changing the lengths of the input sequences from the currently proposed 20 bases each, and he said “totally!”.

FWIW, It’s actually quite easy to do a BLAST search at https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch.  Here’s all it took for me to search for the GBP5 tRNA segment I posted below:

(The first input line, consisting of the one character ‘>’, makes the two lines together a valid FASTA formatted sequence).

… and it is pretty straightforward to get the gist of the graphical summary:

BLAST found two sequences that were clearly better matches than any others. Scrolling down the page a bit, we see:

So those two best matches are identified as coming from the GPB5 tRNA.  So far, so good.

But what no one here (including me) has been willing to do is to claim the expertise to draw any definitive conclusions as to what extent the other close-but-not-exact- matches might have down the line when the switches we design are being tested on whole blood.

1 Like

@nando Indeed it is. I ran the probe sequences through BLAST and selected the 20 nt that had the least complementarity to the top hits that did not match the gene of interest.

That’s interesting.  Were they actually better than all the other possible sequences? Or was it a multi-way tie and these were just the ones that happened to get picked?  It seems like an enormous coincidence if, out of the ~30 possible subsequence for each of the three 50-base sequences, the unique best match for all three turned out to be in the same ending position.

@omei: For gene C (KLF2), the first 20 would be a reasonable choice too.  Generally, if there’s a hit that matches a large chunk in the middle of the sequence, that’s going to force your choice to be on one of the ends.