Pseudoknot Finder tool

DigitalEmbrace · May 10, 2023, 11:28pm

We have a Pseudoknot Finder tool! Thank you jnicol for developing this great script for us!

After adding the Pseudoknot Finder script to your Favorites, open the OpenKnot lab puzzle and initiate the booster. The dialog box will pop up. Enter a title for the organism you are searching and a description such as the source (hyperlinks are supported), and paste the long sequence. Here is an mRNA I found on RNA Central.

Click the Apply button. The script will search 100-base segments at 20-base intervals, then return a list of pseudoknots found.

If desired, click on each Pseudoknot entry to view folding in Natural Mode. Once the player is comfortable with the tool, sequences can be submitted without reviewing each one. I’m keeping this one even though it has 6A in a row. Probably will synthesize and test fine. And if it doesn’t, not a big deal. We have plenty of slots.

I’m keeping this one as well even though it folds with the 5’ end. Only base 26 is involved.

If you want to delete a sequence from the list (not submit it), click Enabled to change to Disabled. Please note, any mutation made to the sequences in the list will not be saved when sequences are submitted. I asked jnicol to keep this tool very simple so that new players can learn it easily, no bells and whistles.

Click the Submit 6 designs button and designs will submit!

The script can process up to 500,000 bases, but a more manageable length RNA to search for pseudoknots is the 10,000 - 50,000 range. (To close the dialog box, click the script in the booster list again.)

Eli_Fisker · May 11, 2023, 8:09am

Hi all!

If you are wondering about where to find genomes to browse for pseudoknots with Jnicol’s new Pseudoknot Finder tool, here is a way to do it. The tool don’t discriminate, weather you give it RNA or DNA, it will chunk out fine potential pseudoknots.

This a demo of how to find a FASTA sequence. A FASTA sequence is just an easily searchable genome format for computers. Like instead of the computer searching a code with millions of basepairs, the code is broken up in smaller chunks that are easily searchable.

Here is how to get the FASTA. Go to NCBI. Choose the option Genome, beside the search box.

For this particular booster, viruses seems to be of the perfect genome size.

Type in Flaviviridae

This will give you a bunch of viruses that are of a reasonable genome size:

I pick out Hepacivirus C. This lands me on a site with a lot of info. I scroll to the bottom:

This give me the representative genome for the organism and in this case two. Notice the NC in the post name. Those I always go for, they are the quality sign of a reference genome. The page also spills how large the organism is. The first one is 9.65 Kb which is kilo bases = 1000, meaning it is around 10000 bases long. This one is doable.

I open the post for Hepatitis C virus genotype 1:

Notice the title has complete genome. This means you are not just having a single protein or so. You got it all. It also spills the exact length of the sequences in base pairs.

Now click on FASTA (top left). And you have the DNA for the organism of your interest.

Highlight all the DNA and nothing else. Copy it and paste it into Jnicol’s Pseudoknot finder tool like this:

Click Apply. The tool starts running through the sequence and tests if for pseudoknots. Depending on the size of your genome, it may take a while.

It says it has found 79 pseudoknots. Which is a reasonable amount to look through.

If you want ideas for viruses in this size range, I have collected a bunch that could make nice targets in the Most wanted organism spreadsheet. I have picked mainly from viruses that are troublemakers for food production. Which I have found on this Wikipedia page on Plant pathogens.

If you want to work with much larger organisms than viruses, you should give the bulk submission approach a thought.

Other viral genome families with a tendency to small genome sizes should be: Picornaviridae, Caliciviridae, Togaviridae, Paramyxoviridae, Orthomyxoviridae, Rhabdoviridae and Coronaviridae.

mjt · May 11, 2023, 5:31pm

Huh. Wow! This is crazy good. I hope I did it right.

jnicol · May 11, 2023, 5:38pm

Great write up, thanks! Actually that interaction with base 26 is a bug, the tool should not have allowed that sequence

Astromon · May 12, 2023, 4:02pm

These instructions are very clear. Thanks!

mjt · May 12, 2023, 9:17pm

Something’s wrong. The pseudoknot finder doesn’t work for me anymore. Everything looks normal. It goes through the motions but no pseudknots are found.

DigitalEmbrace · May 12, 2023, 10:33pm

@jnicol The Pseudoknot Finder tool isn’t finding any pknots for me either. Can you take a look?

jnicol · May 13, 2023, 12:44pm

I made a change to fix the base 26 issue and broke the checking, its working now please test and let me know

mjt · May 13, 2023, 4:11pm

It worked well. It still lets pseudoknots through that may have more than 5 of the same bases in a row, but that’s ok I think.

DigitalEmbrace · May 13, 2023, 5:13pm

@mjt Yes, that’s okay. Testing natural RNA segments with more than 5 of the same base in a row will be useful.

Eli_Fisker · May 25, 2023, 5:42am

I wish to highlight that there is a new version of the Pseudoknot finder that Jnicol has made. It is called Customized Pseudoknot Finder.

Its main difference is that it has two extra settings. You can choose Any knot and you will get each and every pseudoknot detected. Or you can choose Kissing loops and you will get kissing loops (which are a minority among the pseudoknots).

Then there is my favorite setting - pseudoknot bindings. It introduces an extra level of filtering beyond what was already there in the starter script (a minimum of 3 pseudoknot bindings). I typically set it at 4 pseudoknot bindings. This way I get rid of some of the weaker pseudoknots without loosing out of too many potential pseudoknots.

This setting is equivalent to filtering level -4 in jandersonlee’s spreadsheets for bulk submission. It looks out for getting both 4 pseudo brackets and 4 normal brackets as a minimum. Generally jandersonlee and I fitted the filtering to the amount of pseudoknot potentials we got back. If we get few potentials back - typically for viruses, we used lower level of filtering. If we got a ton of potentials back, typically for bacteria and eucaryotes, we raised filtering level.

f1 = is there one set of (Basic level before more filtering)
f2 = []
f3 = [[]]
-f3 = ((([[[ )))]]]
-f4 = (((([[[[ ))))]]]]
-f5 = ((((([[[[[ )))))]]]]]
-f6 = (((((([[[[[[ ))))))]]]]]]
-f7 = ((((((([[[[[[[ )))))))]]]]]]]

Eli_Fisker · June 3, 2023, 5:54pm

Submitting a bacterial size organism with Pseudoknot Finder

I have submitted a +1 million bp size organism with the Costumized Pseudoknot Finder. Here is how I did it.

I took the FASTA from the monster size viral organism that I wanted to submit, including the FASTA title header. Mimivirus terra2 genome

Instead of pasting it into Pseudoknot Finder as usual, I instead pasted it into a bioinformatics tool. I set fragment length to 200000 bp and clicked Submit.

Sequence Manipulation Suite: Split FASTA

This give me fine portioned six 200000 bp fragments (okay, the last one was shorter). I use the fragment title as the puzzle title:

Then I just repeated the process in Pseudoknot Finder for each fragment.

Here is the result:

Working with 200000 base fragments was a rather painful experience, 100000 bp is probably a better size. You can adjust this to what size you personally prefer working with in the Pseudoknot Finder.

Now you can do bacteria or a smaller chromosome if you like.

Eli_Fisker · June 10, 2023, 8:12pm

Easier method for working with 1MB+ organism

I found a way to load only the genes from an organism and seach them with the Costumized Pseudoknot Finder

I had been wondering about if there was a way to filter genomes in a more effective way. I have long believed that the main part of the pseudoknots will be in the coding sequence. Partly because pseudoknots like having lots of GC’s and the main part of the GC’s are in the coding regions. I have also seen intron regions versus coding regions. Introns contain a lot of weird base repeats and a much higher ratio of A’s (hard in lab). They also have less G’s and C’s which are needed for formation of strong pseudoknots. Basically I realised that if I only searched the genes, I would be more likely to get a good amount of pseudoknots with less search.

However this method is not best used on viruses and bacteria, because their genes are pretty close positioned, very little “junk” = non coding regions. Plus viruses and bacteria generally don’t hve introns. Viruses and bacteria are Pseudoknot Finders primary scope. Where this method may be of most use is with larger organisms or chromosomes from larger organisms.

I wanted to submit Bodo saltans virus strain NG1 - a 1.3MB virus that infects protozoans. Protozoans are single celled organisms that causes sleeping sickness, chagas and many other bad diseases. Here is how I got its genes:

This gets a file with the genes and their names:

Now to the step that makes it easier than working with the bioinformatics tool. You need a google sheet to dump all the sequences in. And you need the first row blank. Then you need to add a filter for the first column A.

What I did by filtering out all the “>” was to take out all the lines with gene names. Now everything is ATGC. Which is something that Pseudoknot Finder can deal with. Voila!

As soon as you have the fragments in the spreadsheet it is really just like bulk submission. Copy out a as large a portion of fragment as you find comfortable working with in the Pseudoknot Finder. I typically set my pseudoknot bindings to minimum 4. Unless I’m working with tiny sequence where I really really want to find a pseudoknot.

It was a much less painfull experience doing things this way, compared to when I submitted slightly smaller Mamavirus and using fragments from the bioinformatics tool FASTA split.

Now I demonstrated on a virus. However there is no reason why this can’t be done to full genomes also. You can just dump all the fragments directly in the spreadsheet and skip the step with removing gene names. Actually you can skip the spreadsheet part entirely and just go from the document you download from. But the row numbers in the spreadsheet makes it easier to keep track.

I’m not saying that the regions outside the gene are unimportant, just trying to make the most of the slots, by focusing where I think we will find the most pseudoknots. Based on what base frequencies I see in introns and non coding regions.

These are my thoughts about making things easier. Remember it is not the only way.

jenni111 · June 11, 2023, 9:23am

Sorry to have to ask another question, but I dont seem to be able to load the data from ‘notepad’, which is were it is ‘sent-to’ into google spreadsheets, am I missing a step?

Eli_Fisker · June 11, 2023, 9:45am

No worries. After you ask to create the file, it will download to your browser in a blink of a second and after this you will have to find it among your downloaded files. Depending on your browser, you should be able to find it as your latest downloaded files. Otherwise you can find it on your computer by opening your Downloads.

jenni111 · June 11, 2023, 9:48am

I have the data, in notepad, but can’t get it into google spreadsheets. I use firefox.

Eli_Fisker · June 11, 2023, 9:53am

Sorry, I misunderstood. You need to highlight it all, either by clicking and dragging with your mouse. Or by CTRL + C which will copy it out. Then to get it into google sheet you will need to paste it by CTRL + V. Google sheet are not too happy about using its menu for pasting or copying. It wants us to use the keyboard short cuts.

jenni111 · June 11, 2023, 9:59am

ok, the CTRL+C worked - I had tried everything else. I couldn’t work out why I could get into excel but not sheets thanks. I just need more submissions now and I will hopefully get this underway.
Thanks for you help

Eli_Fisker · June 24, 2023, 9:58am

Rare diseases and how to find complete human genes

DigitalEmbrace has brought up that Rhiju and co will first be scanning the human genome in 2024. I have an idea for where to start with homo sapiens…

There are a lot of rare diseases, caused by a mutation in a gene. Here is a story about two little boys with a genetic mutation that mean they have dementia. There are 70+ such rare diseases, with the same result - a child with dementia.

‘You mean there’s nothing?’ The families fighting for their children with dementia by The Guardian

I have downloaded the entire Orphanet database over rare diseases and isolated just the gene names. When I removed all the doublets (many diseases share a gene), I ended with 4418 human genes, that have rare diseases clustered with them.

Orphan diseases related genes

There is a special useful trick when it comes to finding human genes. I earlier shared tips on how to find refseq genomes, a scientific agreed upon represenative genome for the organism and also often the most complete version up till now. When it comes to human genes, there is something similar. It is just called RefSeqGene. It includes introns and exons.

To get a RefSeqGene, first you need the name of a human gene. Before I found the databases with rare diseases and gene names, I found genes behind diseases, by looking them up in Wikipedia.

Here is how to find a RefSeqGene. Open NCBI and choose Nucleotide in the menu. I decided to search for the human gene PFKM.

Most of the time you will end up with a search result like this. A direct link to the RefSeqGene.

The few exceptions I have hit upon, are for genes that have not yet been curated. Then I specify Homo sapiens for organism and try pick the first variant of the refseq’s.

From here on, it is just like any other Personalized Pseudoknot Finder run. I copy the FASTA.

I use the amount of basepairs to judge how much filtering I would add. If the gene is anything less than 50000 bp, I will typically set to Pknot bindings at filter 4. Above 50000 bp I will opt for Pknot bindings filter 5 and for above 100000 I will pick Pknot bindings at filter 6. I don’t recall having pulled filter 7 on a gene yet, but if a gene gets longer than 300000, I’ll probably consider.

I also adjust filter setting to how many pseudoknots it looks like I will get for a search. So I sometimes do a prerun, just to get an idea of how many pseudoknots I will get. If I can see I’ll get more pseudoknots than I care to look through, I raise the filter level. This was the case this time, so I set my filter at 5 instead of the normal 4.

You can pick whichever genes you want from the Orphanet document. When I work my way through them, I’ll search lab to check if it is used and register it to your name, if you have used a RefSeqGene.

When I get the Pseudoknot Finder results back, I look all the pseudoknots through, alongside with jandersonlee’s ArcKnot tool, as to pick out the judged stronger knots.

How may this help?

We are not running pseudoknot tests of the diseased version of a gene. Which may also be interesting. However there are so many disease variations that I wouldn’t know were to start for now and we wouldn’t have the slots to run them all. However getting to know where pseudoknots are in genes that are functional, I think is where we will get more bang for our slot bucks. It may be useful knowing where there are pseudoknots in a gene. So if there is a version of the gene with one or more mutations at the spot of such pseudoknot, it may be valuable knowledge. Also if pseudoknots turn out to be medical targets, they may be useful for creating medicine to slow down or speed up the function of a gene. Depending on if it is overactive or not working. Pseudoknots as medical targets is potentially akin to ASO medicine, just using a different method.

Additional data

For those who wish to look up a specific rare disease or know what disease/s their chosen gene is involved with, I have made an extra spreadsheet with the full Orphanet dataset:

Full orphanet file

DigitalEmbrace · June 24, 2023, 1:15pm

I’ll be curious to hear how long these genes are. Some of the introns can be long. Knowing if the original pre-mRNA forms a pseudoknot will be helpful for medical researchers. If a player has a personal interest in a disease caused by a specific mutation, then feel free to scan the mutated region also. Scanning the mRNA also would be useful as the mRNA hangs around in the cell much longer than the pre-mRNA. ASOs target pre-mRNA before splicing but small molecule drugs might target mRNA. The mRNA is shorter for scanning but I don’t know a good source for finding the sequnce.

I found the factor VIII (F8) mRNA sequence on a UCSC website. I clicked X01179 - Human mRNA for factor VIII. I’ll try to explore that site more when I have time. I used the original Pseudoknot Finder tool since the sequence isn’t terribly long.

I also entered the F8 exon16 sequence from our OpenASO round 1 puzzle in the Pseudoknot Detective puzzle. The isolated exon sequence is challenging to locate by searching online. Perhaps a task to keep in mind for future OpenKnot puzzles!