 what an exciting day we're having. What I'd like to tell you about is one of my favorite recreations, which is snooping in the genetic databases. It's possible to snoop in the genetic databases because they're public access and they're public access because they're tax-supported. So anybody can do this, and so feel free to try it if you want to. New genomes are being sequenced all the time. There are hundreds of new genomes, and they usually get sequenced at about the scaffold stage, which is when the fragments that have been sequenced and put together into larger pieces are gradually put into very large pieces. And about that time, the genome will get published even though it isn't fully annotated. All the genes haven't been identified, so annotation can take years. This is the genome data viewer, and this is just a small subset of species that have been annotated so far, and I just want to show you a couple of lizards. So here's two groups of lizards. There's this branch and this branch. And on one branch we have the common lizard along with skinks and geckos. On the other branch we have this green anole, plus also monitors, iguanids, and chameleons. And also snakes are on this branch. So this is what it looks like when the annotations are complete for a particular species. And you'll notice that these two different kinds of lizards have different kinds of chromosomes set up. So here you have kind of what looks like sort of a regular set of chromosomes. And over here with the anole you've got a bunch of teeny tiny chromosomes, and this is typical of what you might see in a bird. So it's a different sort of sort of chromosomal arrangement, and probably we're going to find that the monitor has a similar set of chromosomes to what we're seeing here. So this is my Snoopy. This is Berenice Komodoensis, or the Komodo dragon, or the Komodo monitor, and it's quite a big lizard, usually over 10 feet long as an adult. If he was sticking his tongue out you'd see it's forked like a snake. The Komodo genome was first reported in 2019 by Lyndonaw, and it's still being annotated. The genome is about half the size of the human genome and contains about 1.5 gigabases or 1.5 billion bases of DNA. When a new genome gets sequenced I like to look at the scaffolds for various genes just to see what they look like. As I mentioned, most of the databases are public so that you can do it too. It's really a lot of fun, it's kind of like finding your great-grandmother's love letters in the attic. So this is the gene that I was interested in looking for, it's the amylogenin gene, which encodes a protein involved in the mineralization of the tooth enamel. It's a gene that's in every vertebrate with teeth, and in some like birds that don't have teeth anymore. It's an interesting gene because it's very highly conserved, meaning that it doesn't change much between species, which makes it relatively easy to identify. It's also interesting because in humans it's on the X chromosome, and it's also embedded inside another gene called ARHGAP6, which has nothing to do with teeth at all. Most genes outside of bacteria are broken up into pieces called exons, which are separated by non-coding regions called introns. The amylogenin genes in the first intron of ARHGAP6 between ARHGAP6 exons 1 and 2. So this is the location of the human gene. It's on the X chromosome, as I said, and it's in a part of the X chromosome called the pseudo-autosomal region, which means that it's a little piece of the X chromosome which matches to a little piece of the Y chromosome. The X and the Y chromosomes have to be able to recognize each other during gamut formation so that they have little pieces that do match, even though most of the X chromosome is not represented at all on the Y. The human amylogenin gene has seven exons, but only the last five of them are used for coding the protein. This is the amino acid sequence of the protein. Each of these letters represents a different amino acid, and four of the amino acids are present in fairly high concentrations, praline, leucine, histidine, and glutamine. And especially praline and leucine, praline is P, glutamine is Q, and they are present in very high concentrations in this protein. So how do you find a gene if you don't know where it is? The nice thing is that genes are older than species. I mentioned before that amylogenin is a common gene in the vertebrates. So it was invented, it was first produced in the genome, in the ancestor of the vertebrates, and once a gene gets invented it tends to stay in all the descendants of whoever it was that invented it. So to find a gene in a newly sequenced genome, you start with a gene that you know about genomes or related organisms. One of my students, such one when, did a similar search for amylogenin in the enolgenome, which was the first reptilian genome to be sequenced in 2011, and he started human sequence. But since we now know what the sequence looks like in another lizard, enol, now I can use that one to look for the amylogenin gene in varinus. So the enolgenome has a few more bases in it than the varinus genome, it's a little bit bigger. They're similar to mammals in having X and Y chromosomes with the male as having the X and the Y, but the amylogenin gene in the enol is not on the X chromosome, not on the enol X chromosome, it's on the enol chromosome 3. So that chromosome contains a lot of other markers found in the pseudo-autosomal region of the human X, so the enol X and Y are therefore different chromosomes from the human X and Y. So these are some of the databases that I used in looking for that amylogenin gene and they're all public access, so any of you can look at these two. And the NCBI, the National Center for Biotechnology Information, has a search tool called BLAST for Basic Local Alignment Search Tool, and that's what I used to find this gene. So this is just a summary of how I found the gene in the Komodo scaffold. I used the enol messenger RNA as a query sequence to search in the amylogenin genome, sorry, to search in the Komodo genome, and the database that I searched was the Varanus Komodoensis Hull Genome Shotgun Skaffolds, so the Skaffolds remember are segments that are composed of overlapping sequence but not yet identified with specific chromosomes. So I found all of the enol exons, or at least exons 3 through 7 which have the coding sequence, in a single scaffold, SLA01, and I was able to identify each of the Komodo exons by comparing them to the enol exons, and then I translated it to get the protein sequence. So this is the enol messenger RNA. It has 7 exons. The exons are in alternating colors here so that you can see where the breaks are. So there's a break between exon 1 and 2 right here. But the coding sequence doesn't start until the third exon. So this is the start codon where the first amino acid is. And then the last amino acid is in the seventh exon here, and this is the stop codon. So the stop codon is almost immediately after you get into the seventh exon. So this just shows what the search looks like. This is a little piece of Skaffold SLA01, and this is the match with enol exon 6. The Skaffold is a pretty good sized one. It's just a little bit smaller than the human X chromosome, so it's got 138 million bases in it. And the exon identity is about 78%. So this just shows you how I was able to identify the boundaries of the other exons. Fortunately, introns all begin and end with the same two bases. GT at the beginning of an exon, which you can see at the beginning of an intron, which you can see at the end of this exon 4, and then AG at the end of an exon, which you can see at the beginning of this exon 5. And then these two exons go together. So I located all the exons and then put them together to get the coding sequence for the amelogenin protein. There were two possible start codons here and here, and you'll see later it turns out that the second one is the real start codon. So then I translated the sequence using the transeq software from EMBL, and this is the translation that I got for the amelogenin messenger RNA. And as I mentioned, this is probably the real start, this is the real beginning of the sequence. So this has, if we take these first three off, this has 191 amino acids, with four amino acids, the same four at very high frequency that we saw in the human protein. So then I compared this with several other amelogenin proteins in different species using DNA star laser gene mega line software. And first I compared the sequences of a null, the monitor sequence that I had translated and the human. So I lined those up and compared them, and I also later compared an additional sequence from the horse. So this is what the alignment looks like in the anole, human, and commodo amelogenins. And as you can see, as you would expect, the commodo and the anole sequences are much more closely related than the human sequences to either of them. So these are the amelogenin sequences from monitor, anole, human, and horse. If you just sort of eyeball these, you can see that human and horse, for example, both have this pphvgh sequence that starts right here, and that the anole and the monitor have a slightly different one, pghvgy. So those two are similar, and these two are similar. And notice that they all end EEVD, glutamate, glutamate, valine, aspartate. This shows the comparison of those four species. So here's human and horse, here's commodo and anole, and it's interesting, I think, that the human and the horse are actually more closely related than the two lizards are. So the human and the horse sequences differ by about 8 amino acids. The commodo and the anole sequence differ by about 15 amino acids, or 15% of their amino acids. And the lizard and mammalian sequences differ by about 38% of their amino acids. So I did this adventure about a year ago, and in the meantime, they've been working on the annotations of the commodo sequence, and so I thought, well, let's just look at it and see how it compares with what I figured out here. So this is the amino acid sequence for verniscomodoensis in their annotation, and it's much, much bigger. The translation starts here, and it ends down here. So it's about three times the size of the annotation that I figured out, but the one I figured out is right here in the middle. So this is basically the sequence that I figured out, except that this D is not in it. So in their annotation, it goes from this V to this V, and I actually think they may have an incorrect annotation for exon 7. So the other question that I was interested in is, is the amylogenin also embedded in this ARH-GAP-6 gene in the monitors? So I used the anole ARH-GAP-6 gene mRNA as the query sequence to compare it with the anole scaffold, and I found all 13 of the exons from the anole messenger RNA in the same scaffold where I found the amylogenin. And this is one of the larger exons, and this shows the comparison between them. So they're about 86% identical to the anole, or the monitor is about 86% identical to the anole sequence. So this is just a list of all of the exons that I found in that region, and the important thing about it is that they're in order, if you look at exon 9 or exon 8 or exon 10, they're lined up in the same numerical sequence. And then exon 1, I found on the other side of amylogenin, which is right where it is in humans, so that looked interesting. And then I compared that with the amylogenin sequence from the annotated, or from the ARHGEP sequence from the annotation, and there are a few differences. First off, I was not able to identify exons 1 and 2 in the monitor scaffold, and the reason for that is that sequences that don't get translated into protein aren't under any kind of restriction by selection. So they can change, they can change without killing the animal that they're in. And so they apparently have changed enough that I could not identify them. But they are annotated here, and they are, these exons 1 or 2 are in reasonable locations for the rest of the exons. The other difference is in the annotated sequences, exons 1, 2, and 3. Their exon 1 is not where my exon 1 is, and they also have two additional exons, which don't match with mine, exons 2, and they also have an additional exon. So their exon 4 matches to my exon 3, and then all the other exons between exon 3 and exon 13 match in both sequences. So I translated both of those, I translated their protein and my protein. So my protein has exon 1 product here, exon 2 product here, and it starts exon 3 with these four amino acids, DGQK. And the annotated sequence has three exons, a teeny tiny exon 1, and their exon 2, and then they have a third exon before we get to the place where it matches all the way down. So this is their fourth exon, and my third starts with those same four amino acids. So I took those two, and I matched them with other lizard proteins. So these are the matches that I got with my ARH gap 6 sequence. And with the anole, which you can see sticking his fork tongue out here, it doesn't match until my, until the monitor residue number 47, which is that DGQK. But I did find matches to exons 1 in two other lizards, or these are the two that I picked up. So exon 1 is here, exon 2 starts about here, and then their exon 2 ends about here, and goes into the next, it goes into, let's see where is it, well it's in here somewhere, goes into the third exon. And the same thing is true in this lizard. So this is the viviferous lizard, zootheca vivipara, and this is the green anole. So I did find matches with exons 1 and 2 in two other lizards. If I tried that with the varinous annotation, I did not find any matches to any lizards until this part where they come together. So they start at the DGQK. So I have a lot of faith in the annotation that I put together. And one thing I need to say here is that when these annotations are first put together, they're put together with the computer algorithm. That is probably, I may be the first human to actually have looked at this annotation. So it probably wasn't done by human, it was probably done by an algorithm. That is probably what accounts for the differences between them. Nevertheless, I'm going to write to NCBI and offer them my version of it. So this is just some references that you might find useful if you want to try this yourself with any gene that you like. As I said, the databases are all perfectly open so that you can do this. And it's a lot of fun. So are there any questions? I'm sure there are because I have not been keeping up with the chat. Now what they, I see Stephen had a question about, did they start with an mRNA? They start by looking for what's called open reading frames. That is bits of the DNA that can be translated continuously. So they look for large open reading frames. And then they look for the signatures of the introns. So that's how they identify the exons. So that's what the algorithm looks for. So they probably didn't start, as I did, with the messenger RNA from something else. Well, I just do it for fun because I like looking at new sequences, newly synthesized sequences. And it can be difficult to identify small exons. In fact, the exon 1 in ARH gap 6 is actually not as highly conserved as the other exons. Yes, it probably is valuable to look at the 3D, but those are not so easy to predict. They're still working on algorithms to try to figure out how to do that. I'll just write them a note and suggest the alternative. I did it once before, once upon a time, I gave my students a project with cat beta-globans. And so I was looking at the cat beta-globans, and I looked at the lion-globin. And it didn't look right. It looked more like a primate-globin than it did like a cat-globin. So I wrote them a note and I said, you know, I think maybe you got the wrong sequence for the beta-globin in the lion, so would you check on it? So they did. They checked with the people that turned it in, and they found out that they had, it was just a mistake in the database. So there are probably other errors in the databases that people will pick up. It is a lot of fun. It really is because there's always several new sequences that are popping up in the literature every year, and you can play with them. Another gene that I like, oh, it's a great undergraduate project, Stephen. Yes, I always have them do something like this. But another, I had another student who was looking at the color vision receptors in lizards. And they're quite different from species to species. I mean, they're all identifiable, but they're not all the same. The way that the color vision receptors are divvied up are not quite the same in all species. Yes, varagon, that's the general rule with the water on the outside, water living on the outside and the water hating on the inside. But there are a lot of other things that contribute to that. So protein folding really is quite complicated. Yeah, the disulfide bond from cysteines are good clues, too. Origami is exactly the way I think about it, cisigie. In fact, I sometimes tell the students that this is like origami. And what we're trying to find out is what are the folding rules? I don't think I've ever seen a protein that looked like a swan, but I wouldn't be surprised if there was one. That's great in Europe. I'm just going to go back through the chat and see if I've missed any... Well, I know I missed some questions. I do sometimes look for words in the amino acid sequences. I looked for some words in Titan, and I couldn't find my own name, but I did find Elvis.