Biotech

Bioinformatics After High Throughput Sequencing

Bioinformatician Manja Marz on non-coding RNAs, the dynamic programming approach, and homology search

Manja Marz Professor for High Throughput Methods, Friedrich-Schiller-University of Jena.

What is there in the parts of genome that do not code for proteins? How many kinds of non-coding RNAs are currently known? How can non-coding RNAs in our genome be detected despite variability in sequence? These and other questions are answered by Professor for High Throughput Methods at Friedrich-Schiller-University of Jena Manja Marz.

As I went to school, I learnt that our genome, our human genome, consists of proteins, and these proteins consist of exons and introns, and then, between these proteins we do have so-called garbage, that’s what my teacher taught me, and now we understand a little more, after having all these high-throughput sequencing techniques, about our genome. We know that we have these proteins, but they just make a very tiny fraction of our genome. So it something like 1.5%, that’s what we though in 2007, and now it’s corrected to nearly 3% which means we have something like 97% of our genome being garbage? How’s that possible? Nowadays, people tend not to use the word “garbage” anymore, because otherwise we would think we’re just garbage.

Biologist Anna M. Krichevsky on RNA interference, ways of studying microRNA, and other non-protein coding regulatory RNA molecules

So, people try to understand what are these 97% in our genome, and we understand not too much nowadays, but we understand, step-by-step, a little more. We have a lot of repeating elements there: SINEs, LINEs, but we have also a lot of other fragments in our genome, which is actually interesting, because they’re transcribed, similar to proteins. So you know about proteins, proteins are transcribed, then they’re processed, still in the nucleus, and then they go into the cytoplasm, and there they’re translated at the ribosomes. But these other “genes”, which are also transcribed, they’re called non-coding RNAs, or we call them shortly ncRNAs. So these non-coding RNAs, it doesn’t mean that they’re not coding for something, they do have a function, and, actually, a very important function, but they do not code for proteins. So what happens with them, we still have polymerases in our nucleus and these polymerases come to our genes, our non-coding RNA genes, these genes are transcribed and then, after that, they can have various functions, but usually they’re not going to the ribosomes and they’re not translated into proteins. So, what are these genes and what are they doing?

Well, nowadays we know quite a lot of them. Some years ago we knew there’re these called transfer RNAs, tRNAs, which act for the protein-coding RNAs to be translated, then we knew another class of non-coding RNAs which are rRNAs, ribosomal RNAs, actually, these are also important to translate mRNAs into proteins, and maybe some people heard about snRNAs, the spliceosomal RNAs, they act during the process of processing, they splice, for example, the introns out and the exons together, and, apart from that, people didn’t know so much about non-coding RNAs some years ago. Nowadays we know much more.

So, we think there’re not only 3 classes, like the tRNAs, the rRNAs and the snRNAs, nowadays we, at least, know 2500 classes, which are stored in databases, special databases for non-coding RNAs, but, again, what are they doing? Why do we have them?

Why our genome consists of these non-coding RNAs? We try to understand a little more. One very famous class are the so-called micro RNAs, they act in different processes, they’re also transcribed, again, they’re also processed, so they are matured, so there are coming a lot of proteins and a lot of other parts, within the nucleus, still, and they’re shortened so that we just have very small acting fragments. Actually, after transcription, they may have a length of 120 nucleotides, so they’re not very long, but after processing they’re only in the length of about 22 nucleotides, and these 22 nucleotides are very important.

So, the go, for example, to the mRNA, messenger RNA which want to become a protein, but they go to these sites and then these genes are silenced. That means, maybe they’re not any longer translated, so we don’t have a protein afterwards. It’s a very important regulatory mechanism, and most of the non-coding RNAs we know nowadays are, actually, regulatory RNAs. What can they regulate? Now you know about the very famous microRNAs, regulating the mRNAs by silencing. However, they can also act in different ways, they can help to transcribe during the process with interacting with the polymerase, but they can also have completely different things. What they do is after transcription, usually, they form a so-called secondary structure: RNAs can interact nucleotide by nucleotide, similar to DNA, but different. In DNA, usually, you have a C interacting with a G, and a T interacting with an A. However, for RNAs, we have thymines replaced by uracil, and now we have interactions like G-U, which is new, additional to G-C and also U-A. So we have another possibility to interact and, therefore, forming a kind of secondary structure has a new flexibility. These secondary structures are important for the function. If they have the correct structure, they can function somewhere in the cell, either in the nucleus or in the cytoplasm. If they don’t have the correct structure, they don’t function. This is a very big difference to proteins, because for proteins it doesn’t matter what secondary structure they have, it only matters what is the sequence of the nucleotides, and 3 of these nucleotides are translated into 1 amino acid.

Now we come to the very important bioinformatical part. How can we detect these non-coding RNAs in our genome? Let’s assume we have our genome completely, how can we find in this huge long genome our non-coding RNAs? For proteins, this is rather simple, because the sequence should remain the same in order to get the function and we just search for the same sequence of nucleotides.

For protein-coding DNA this can be done with different algorithms and it’s quite easy to achieve a nice result in that aspect. However, for non-coding RNAs this is different since the sequence doesn’t matter that much, but only the structure to gain a function.

So we do need actually completely different algorithms and programs to find them. We may start with something similar to the proteins and try to find something sequence-based, however, that doesn’t work very well.

And so, we have to find algorithms considering our non-coding RNA secondary structure being important for the function. It appears that, for example, human spliceosomal RNA, let’s take U4, and we compare that to a chicken U4 snRNA. If we try to align them and we lay the sequences below each other, the sequences really don’t look similar at all. They really look so different, so how can we actually find them? But if you consider the secondary structure, then we’re able to really see similarities and therefore we can assume that the function might be the same.

Assistant Professor at Harvard University Alexander Gimelbrant on epigenetic regulation, olfactory receptors, and inheritance of cell state

To calculate the secondary structure for sequence, there are different approaches. You can do it by stochastic methods, but I think the most widely used one is something being similar to what is probably used in nature, we try to gain the minimum free energy out of these molecules. Whenever a molecule is interacting, energy is released, this is an exothermic reaction. This energy, which is released, we think, the more energy is released, the more stable is the secondary structure. Probably, that’s what people think, the more stable the secondary structure is, the function can be attained and therefore, we try to find the secondary structure which has the minimum free energy for that construct. How to find the minimum free energy for a given sequence, then? This is usually done by a dynamic programming approach. We try to go for all the possibilities, try to estimate or calculate the minimum free energy of this given structure and then we compare and try to find the best one, the best minimum free energy. That usually gives us an idea of how a sequence could look like in the nucleus and how the secondary structure is formed and what it might look while having function it has.

In real, it is a bit different because these non-coding RNAs are also interacting with proteins. However, in silico it’s not possible nowadays to predict how a non-coding RNA or an RNA in general is actually interacting to a protein, because we don’t know anything about the interactions of RNAs and proteins. This will be, definitely, a challenge in the future to find out how RNAs and proteins are interacting. Now we know more about how we try to find non-coding RNAs in the genome, however, it’s still quite hard, because what we know so far is if we go into a lab and we find a non-coding RNA which is not translated, then it is possible for this organism to search by homology search, also considering the secondary structure, some other non-coding RNAs in other organisms.

However, what happens if you do not know anything about these non-coding RNAs? This appears now since a few months or years, we now know there’re also so-called long non-coding RNAs. Long non-coding RNAs have, by definition, a length of more than 200 nucleotides, they can range more than 10 kilobases and they can usually include introns, which are spliced out. These long non-coding RNAs are sometimes anti-sense to proteins where they interact to, however, they can also just act in cis to somewhere in the genome. These long non-coding RNAs are hard to find, because its secondary structure is interrupted by the introns. So, finding them is even a harder challenge. What we usually do is we go for high-throughput sequencing data so we try to find out what is transcribed to a certain time point in the cell and this we can sequence, we can get back to the genome and then try to find out what parts of the genome are not proteins but still transcribed? That might be possible non-coding RNAs and after mapping these back we can nowadays find long non-coding RNAs as well. However, their function still remains unclear from a bioinformatical point of view and then we have to go back into the lab and try to find a function.

Support our cause Serious Science is a team of creators that are passionate about knowledge.

By donating to Serious Science, you enable us to continue producing and sharing free, high-quality educational content and expand our collaborations with top experts and institutions.

Donate through Patreon