Bioinformatician Manja Marz on non-coding RNAs, the dynamic programming approach, and homology search
What is there in the parts of genome that do not code for proteins? How many kinds of non-coding RNAs are currently known? How can non-coding RNAs in our genome be detected despite variability in sequence? These and other questions are answered by Professor for High Throughput Methods at Friedrich-Schiller-University of Jena Manja Marz.
As I went to school, I learnt that our genome, our human genome, consists of proteins, and these proteins consist of exons and introns, and then, between these proteins we do have so-called garbage, that’s what my teacher taught me, and now we understand a little more, after having all these high-throughput sequencing techniques, about our genome. We know that we have these proteins, but they just make a very tiny fraction of our genome. So it something like 1.5%, that’s what we though in 2007, and now it’s corrected to nearly 3% which means we have something like 97% of our genome being garbage? How’s that possible? Nowadays, people tend not to use the word “garbage” anymore, because otherwise we would think we’re just garbage.
Well, nowadays we know quite a lot of them. Some years ago we knew there’re these called transfer RNAs, tRNAs, which act for the protein-coding RNAs to be translated, then we knew another class of non-coding RNAs which are rRNAs, ribosomal RNAs, actually, these are also important to translate mRNAs into proteins, and maybe some people heard about snRNAs, the spliceosomal RNAs, they act during the process of processing, they splice, for example, the introns out and the exons together, and, apart from that, people didn’t know so much about non-coding RNAs some years ago. Nowadays we know much more.
So, we think there’re not only 3 classes, like the tRNAs, the rRNAs and the snRNAs, nowadays we, at least, know 2500 classes, which are stored in databases, special databases for non-coding RNAs, but, again, what are they doing? Why do we have them?
Why our genome consists of these non-coding RNAs? We try to understand a little more. One very famous class are the so-called micro RNAs, they act in different processes, they’re also transcribed, again, they’re also processed, so they are matured, so there are coming a lot of proteins and a lot of other parts, within the nucleus, still, and they’re shortened so that we just have very small acting fragments. Actually, after transcription, they may have a length of 120 nucleotides, so they’re not very long, but after processing they’re only in the length of about 22 nucleotides, and these 22 nucleotides are very important.
So, the go, for example, to the mRNA, messenger RNA which want to become a protein, but they go to these sites and then these genes are silenced. That means, maybe they’re not any longer translated, so we don’t have a protein afterwards. It’s a very important regulatory mechanism, and most of the non-coding RNAs we know nowadays are, actually, regulatory RNAs. What can they regulate? Now you know about the very famous microRNAs, regulating the mRNAs by silencing. However, they can also act in different ways, they can help to transcribe during the process with interacting with the polymerase, but they can also have completely different things. What they do is after transcription, usually, they form a so-called secondary structure: RNAs can interact nucleotide by nucleotide, similar to DNA, but different. In DNA, usually, you have a C interacting with a G, and a T interacting with an A. However, for RNAs, we have thymines replaced by uracil, and now we have interactions like G-U, which is new, additional to G-C and also U-A. So we have another possibility to interact and, therefore, forming a kind of secondary structure has a new flexibility. These secondary structures are important for the function. If they have the correct structure, they can function somewhere in the cell, either in the nucleus or in the cytoplasm. If they don’t have the correct structure, they don’t function. This is a very big difference to proteins, because for proteins it doesn’t matter what secondary structure they have, it only matters what is the sequence of the nucleotides, and 3 of these nucleotides are translated into 1 amino acid.
Now we come to the very important bioinformatical part. How can we detect these non-coding RNAs in our genome? Let’s assume we have our genome completely, how can we find in this huge long genome our non-coding RNAs? For proteins, this is rather simple, because the sequence should remain the same in order to get the function and we just search for the same sequence of nucleotides.
For protein-coding DNA this can be done with different algorithms and it’s quite easy to achieve a nice result in that aspect. However, for non-coding RNAs this is different since the sequence doesn’t matter that much, but only the structure to gain a function.
So we do need actually completely different algorithms and programs to find them. We may start with something similar to the proteins and try to find something sequence-based, however, that doesn’t work very well.
And so, we have to find algorithms considering our non-coding RNA secondary structure being important for the function. It appears that, for example, human spliceosomal RNA, let’s take U4, and we compare that to a chicken U4 snRNA. If we try to align them and we lay the sequences below each other, the sequences really don’t look similar at all. They really look so different, so how can we actually find them? But if you consider the secondary structure, then we’re able to really see similarities and therefore we can assume that the function might be the same.
In real, it is a bit different because these non-coding RNAs are also interacting with proteins. However, in silico it’s not possible nowadays to predict how a non-coding RNA or an RNA in general is actually interacting to a protein, because we don’t know anything about the interactions of RNAs and proteins. This will be, definitely, a challenge in the future to find out how RNAs and proteins are interacting. Now we know more about how we try to find non-coding RNAs in the genome, however, it’s still quite hard, because what we know so far is if we go into a lab and we find a non-coding RNA which is not translated, then it is possible for this organism to search by homology search, also considering the secondary structure, some other non-coding RNAs in other organisms.
However, what happens if you do not know anything about these non-coding RNAs? This appears now since a few months or years, we now know there’re also so-called long non-coding RNAs. Long non-coding RNAs have, by definition, a length of more than 200 nucleotides, they can range more than 10 kilobases and they can usually include introns, which are spliced out. These long non-coding RNAs are sometimes anti-sense to proteins where they interact to, however, they can also just act in cis to somewhere in the genome. These long non-coding RNAs are hard to find, because its secondary structure is interrupted by the introns. So, finding them is even a harder challenge. What we usually do is we go for high-throughput sequencing data so we try to find out what is transcribed to a certain time point in the cell and this we can sequence, we can get back to the genome and then try to find out what parts of the genome are not proteins but still transcribed? That might be possible non-coding RNAs and after mapping these back we can nowadays find long non-coding RNAs as well. However, their function still remains unclear from a bioinformatical point of view and then we have to go back into the lab and try to find a function.