Genome Annotation

Bioinformatician Tim Hubbard on the process of identifying genes, protein genes and RNA genes, and how many genes are there in the human genome

videos | May 14, 2020

Genomes are really relatively new: we only had the first genome of a whole organism in 1995, and by 2000, we had the whole genome of the human. But the critical point if you want to use that genome is understanding what parts of that genome are functional and what they do. Of course, we’ve understood for a long time that genes are the building blocks of how a genome functions, but where are those genes? You have 3 billion letters in the human genome; if you just look, it’s just a long string of four letters of DNA, the four DNA bases. How do you work out where the genes are?

You might think that it’s possible to just look at the genome and spot where the genes are, and in fact, if you look in bacteria, you can at least use computers directly to work out where the genes are because they’re in nice simple blocks: you have a start of a gene, you have the main part which that specifies a protein and then you have the top part and you can just look for that pattern. But in vertebrates, almost all complicated animals, genes are not simple blocks; they’re split up into fragments, exons and introns, space between them, and inhuman those spacing blocks can be very, very long, and so it becomes very difficult to spot where the gene is reliable.

Genome Medicine

Bioinformatician Tim Hubbard on the ‘100,000 Genomes Project’, how to use genome data in diagnostics and treatment and what challenges the genome medicine is facing

So people have written programs to work out where those genes are, but they’re not very reliable. In actual fact, the only way to really do that even now, because we’re now twenty years after the human genome was sequenced, is to rely on looking at what’s actually in a cell. The manufacture of a gene starts with the DNA, then you make a copy, which is RNA of the gene, and then you process that RNA, and then that RNA is translated into a protein. You can isolate those pieces of RNA from a cell, and then you know all the fragments that were specifying genes in that cell, and then you can align those back to the DNA and use that to work out where the genes are.

So that’s really the only practical way of doing things right there: you need computers, but you’re basing your identification of where the genes are on those fragments of RNA. What are the problems of that? Firstly, there’s a lot of noise in a cell, so you get lots of incomplete RNAs or some noise, and so you don’t get a very clear picture necessarily. Another problem is that in every cell type, a different set of genes are active. It means that there’s a different set of RNAs available in that cell, so no matter how many cells you look at, you haven’t got a complete set; you’ve only got a partial set.

So do we know where all the human genes are? The answer is we don’t entirely, because how do we know that we’ve looked at all the different cells that exist? A human body has 37 trillion cells; we don’t know how many different cell types there are, and even if you had all those cells, those cells have different genes active at any particular time.

At some of those early stages of development, we know some of those genes that are active that aren’t seen at any other time. So when we rely on RNA, we’re always looking at a slightly incomplete picture as to where all the genes are. But we have made a lot of progress in using that data to identify where genes are. So there are fairly robust collections that have been generated over many years and are available in databases such as Ensembl and other genome browsers around the world, which allow you to go and look at a piece of a genome and see what genes are there in that genome.

There’s another problem with identifying where genes are, and that’s around what a gene actually does. For a long time, the view was that DNA makes RNA, and everything ends up being a protein. When we annotated where these genes were, we always tended to look for the part of the gene that was going to make the protein. But as we’ve learned from looking at the RNAs that we could find, there are many, many cases where there’s no protein. In fact, there’s now a whole class of RNA genes which have been identified, many of which have been given functions; people have identified that they actually do have some functional importance. Before the human genome was sequenced, there was a guesstimate of 100,000 genes, and there was a competition to guess how many genes there might be when we’d sequence the whole human genome. After we first analyzed it, the estimate was around only 30,000 genes, and that progressively got less and less and came down to around just under 20,000 genes, but that’s genes where the genes make a protein. We now realize that there may be as many as another 20,000 genes or maybe more, genes that don’t make a protein, that just make a piece of RNA, and the RNA itself has some functional significance in the cell.

5 Books about Genetics

What to read about molecular and experimental biology, as recommended by Professor Konstantin Severinov

How do you know if a gene makes a protein or not? You can look for a pattern which specifies the way that you translate RNA (which is four letters) into protein (which is 20 amino acids); you can look for groups of three, patterns or groups of three in the sequence. But that’s kind of a statistical way of checking the sequence. It may be that there are some very, very short proteins. If you look at all the expressed RNAs, you will find lots of possibilities for making very short proteins, but most of those just wouldn’t be real. So, in fact, at the moment, because our ability to predict is so bad, we need to actually have some other experimental data to know if there’s really a protein there.

The people who’ve been working on gene annotation are progressively now looking at mass spectrometry data as well because that shows when a protein is really there. The annotators have been going back and using that data to correct some of the annotations, remove some cases where we thought there was a protein but actually we haven’t found any evidence of the protein, or cases where it looked like it was just an RNA gene but actually, maybe there’s a small protein that’s being made. So this business of annotating genes is an ongoing process because until we can really computationally process the whole genome and work out where everything is by direct methods, we’re going to have to rely on these experimental data sets of RNA and protein to do this annotation. Still, those data sets will always be incomplete because they come from particular cells making particular sets of proteins and RNA.

I think the main open questions for this area of gene annotation are around how many genes there are, what do these RNA genes do. It’s clear that some of them are functional but it’s not clear how many of them are functional, how many may be just noise, what the general process of the cell uses them for.

It’s relatively new; we’ve only known that they existed on a large scale over the last five or ten years. Of course, everything this will link back to epigenetics, which will give us a better picture of what molecules bind to DNA and regulate how these processes of manufacture of RNA get started.

So, the future directions for genome annotation are around these RNA genes, what they do, and how many of them there are. We know functions for some, but in many cases, we don’t. In fact, this project of working out well the human genes is going to go on for a very long time. Maybe we’ve found most of the protein-coding genes, but we know that there are alternate forms of those genes alternate splicing, and the more cell types we collect data from, the more we discover new alternative forms. So there’s a whole community; there’s a project called GENCODE, which manages this annotation and will be busy annotating the human genome for many years to come.

Become a Patron!