 So, we're going to start this talk with a short outline, so we're going to have a biology review to make sure everyone's on the same page, don't want to leave anybody behind, followed by what the hell is the second gener code, a little bit of tutorial on information theory, how we use information theory to hack the second gener code, and then some potential applications. So since this talk's a little biology heavy, maybe for a biohiking village, we'll have this review. So I'm guessing most people here know that cells have a genome which contain genes on chromosomes, and these genes are made out of A, T, Cs and Gs. And this is essentially the gener code for all the proteins encoded in the cell and all functions encoded in the cell. So how do they actually, you know, read this and go through it? We have transcription which reads a gene from the DNA strand and makes a copy of it in RNA, mainly messenger RNA. And this messenger RNA goes to the ribosome, and the ribosome is actually the transfer point between the nucleotides, the genome, and actual proteins. So the ribosome will read the mRNA and bring in tRNAs, and the tRNAs recognize the mRNA through complementary triplets, so A to U, G to C, and each one of those tRNAs carry an amino acid and add it to the growing enzyme chain. So that's our basic review. And then most people from high school biology remember something like a translation table, right, the genetic code table. And what's interesting to focus here is that over about, what, 3 billion years of evolution, there might be four changes to this genetic code table. So this table is pretty static, and static is pretty boring, at least for biologists. So we're going to ignore that, and we're actually going to focus on what we call the second genetic code, all right? And the second genetic code is essentially using tRNAs. So tRNAs are just RNA molecules, they're genes that are transcribed in the same exact way as any other gene, but they don't get translated into proteins, they stay as RNA. And these tRNAs are essentially the, what is it, the link between the genome and proteins. So the tRNAs have this tertiary 3D structure that we see right here. All tRNAs have this structure, and they have this secondary structure, which is base pairing. So the tRNA recognizes the mRNA with this codon sequencer, and they actually have an amino acid attached right here, all right? And this is going to become important later on. And for anyone who is not that great at protein chemistry, we have about 20 amino acids that make up all our proteins. So the big question, again, with the genetic code, the second genetic code, is how do tRNAs get charged with certain amino acids, right? So these tRNAs are encoded and they don't actually carry an amino acid at first. So in red is the actual tRNA. And then the other color is the protein that charges the tRNA with the correct amino acid. So if we get charged with an incorrect amino acid, things go wrong, you probably die, right? So tRNAs contain a set of positive features which help the right interaction. So they get charged with the right amino acid. And they also have a set of negative correlations or negative identity features which help them not get charged with inappropriate. So what we end up with is what we've termed this tRNA interaction network. Since the tRNAs are confined in structure, because they have to fit inside that ribosome, they can't diverge. And the easiest thing in the cell to do, if you don't want to get charged wrong or have the wrong interaction, you just change shape. Well, tRNAs can't change shape, otherwise they don't work in translation. So we have what we call essentially this drive to conformity to maintain this shape right here. But at the same time, they have to be charged with the correct amino acid and there's 20 different amino acids. So they have to have some kind of identity feature that diverges from each other. And this is that interaction network that dictates the second gen of code. What features a lot of tRNAs get charged with the correct amino acid? So when we're talking about these certain features, what we have here is the secondary structure of a tRNA, which maybe if you've taken college intro biology, you've seen something like this. But this is essentially just a base pairing of a tRNA. And features we're talking about are having an A, T, C, or U at certain structural positions within the tRNA. So having an A at position one or G at position one give us a little clue about the function of that tRNA, actually. What amino acid it carries. So an example here is having, they say having an A at this position right here gives us this kind of little bit of a probability distribution. And here we can see pretty clearly that most of the probability goes into an amino acid that we labeled as D. So if you have an A at this position, you most likely carry the amino acid D. And that's a pretty clear pattern. But at the same time, we have more convoluted patterns. We're having an A at this position. Again, most of the probability is in having an amino acid A. But there's a non-zero probability of having other amino acids charged there. So the view is a little more convoluted and hard to understand. And then when we incorporate the fact that you can have different nucleotides in different positions that give us different information about what amino acid you carry. Again, having a G at the same position here actually gives us information that you're going to probably carry an E amino acid. And over here we get another convoluted probability distribution, which is a little harder to interpret. And this is going to be difficult for us just to look at because there's about 73 nucleotides in a tRNA with four possible states. So we have four to 73 possible combinations, which is not something we can actually interpret very easily, right? So how do we go about making sense of this? Well, in about the 80s we hadn't looked at information theory, which helped us deal with this. And it would strive from regular information theory. So anyone that deals with encoding or cryptography is probably pretty familiar with information theory, right? Proposed by Shannon in 1948, he defined a measure which he called entropy, and the actual equation for entropy is this lovely equation, which most people don't enjoy looking at to begin with. So the best way to think about entropy and actually using entropy is going through a simulation. So we run through a simulation, one that we've seen beginning programming classes all the time, teaching random number generators. We roll a fair six-sided die 24 times, and we're going to pretend that we don't know how to calculate the probability of a die. We want to do a simulation to figure it out. So we roll that die 24 times, and we get something that might look like this, right? And the next thing we do is count up the number of times each number occurs. So we do that very quickly. We get essentially four, we count them up. We have almost equal probability represented by this bar here. But since it's a small simulation, it's not perfect, right? Reality is one-sixth for each side if it's a fair die. If we could just run this simulation to infinity, we get that proper probability, right? The question is, how do we describe the uncertainty we have in the outcome of this dice roll, right? And entropy allows us to do something like that. So we have our fair distribution here, and we actually go through the entropy calculation, which I've done for you so no one has to do logs in their head. So we run this real fast, and we get what's called 2.58 bits, is our entropy. And if you think about it, if we need six unique combinations, we need three bits to encode that, right? So we can think about bringing it right back to computer science and coding the same ideas, right? So that allows us what kind of uncertainty we have in this outcome of this die roll is measured as 2.58 bits. So how does this quantity change when we change the uncertainty? So if we roll a bias die, where 40% of the probability is actually sitting inside rolling a one, what does that do to our entropy calculation? Most cryptographers, anyone familiar, entropy is gonna go down, right? So we run our calculation again real fast. We go down to 2.36 bits, and the old number's 2.5 bits. So rolling a bias die, we're a little less shocked by the outcome, right? So we're a little less uncertain about the outcome. Big question is how do we actually apply this to biology and extract information from genetic sequences? So we do that with what we call molecular information theory. We assume a lot of times that the background distribution of nucleotides in the genome are even, so there's an even amount of T's, A's, C's, and G's, which gives us what we call maximum entropy. We calculate that, it's two bits. Again, if we think about that, if we need to encode four unique states, we need two bits to do that, right? So it's running back to the same encoding theory. So how do we apply this idea besides just calculating it? Is what we usually do is take portions of the genome, let's say a gene or a promoter that we're interested in, and we ask how variable are the positions, or how informative are the positions in that gene? So we have an example here where I've made a fake alignment of maybe four organisms for a promoter, and we actually start calculating the entropy per position. So down here are the probability distributions that we actually observed, and we go step by step to this. So the H here is just our maximum entropy, assuming the background is equal frequency, minus the entropy of this actual position in the alignment, and that is what molecular information theory people term information actually. So we go through our calculation. So the first position, we have equal distribution again, so we have, again, maximum entropy, two bits. So we have two bits for maximum, minus two bits equals zero bits, and we form these little figures, which we call sequence logos, where at position one we have zero bits of information, so we have nothing appearing. We essentially go and do this again for each position of the alignment that we have. So we go here, we have all As, so we're not concerned about that position at all, so we have zero entropy in that position actually, and we go two minus zero equals two bits, so our figure reflects that. We have position two, going to two bits, and because it's only an A, we push an A, and we can go through this calculation through all of it, but that's something most people can probably do at home at this point. So how do we actually use information theory to hack the genetic code? So just a quick review, remember, so we have these features or states around the tRNA that dictate function, and these are probability distributions. Now we know how to deal with probability distributions and calculate uncertainty based on certain features or states, right, before we do it with positions. Here we're doing it with sequence alignments of tRNAs, so what we do is align tRNAs altogether here, and we label the tRNAs with what amino acid they carry. So these carry alanine, these carry valine, it doesn't really matter that much, except that we label them correctly, we produce the alignments, and we're doing the same exact calculation. So maximum entropy, which in this case, we have 26, so the number of amino acids, minus the entropy for a certain state and position. So if we go through the calculation really quickly, if we look at position one, and position one having an A, we realize that if you have an A in position one, you're only an alanine. So here that's about four bits of information, which we reflected in this graph. So we're not inserting at all if we know that you have an A in position one, you're an alanine. Same game with the C, if you have a C in position one, only valine tRNAs have a C in position one, so we essentially have maximum information at position one if there's a C. So we essentially start to be able to decode the second generative code. What features dictate what amino acids you get charged with? So eventually, when we actually run the program that I've created, you get these maps, which look confusing, but are a little better than just looking at four to the 73 possible combinations where most of them don't matter. And we produce one for A, C, G, and U, which are working with RNA. And this essentially produces our structure function map. So your structural position one having an A, we know what function that gives us, oh how confident we are about what function that gives us. So we can actually use these function maps to do applications. So before moving on, we'll go over the software real quick, if anyone else actually do this at home. So the first step you have to do in the genome is identify tRNA genes, which we use a program called tRNA scan SE, which will mine the genome for tRNA genes. Then we do a secondary alignment, essentially mapping to that clover leaf shape that I showed previously, and we use a program called Cove. And if anyone's done any kind of natural language processing, it uses a context free grammar model to do that. And then for the information statistics and the graphics, we use a program which I've termed name BP Logo Fund, but it needs a brand new name and hopefully recursive acronym so I can stay with the new idea. This is not peer review yet, but it's available on my GitHub and will probably be published in a scientific journal within the end of the year. But we know that it's, we're pretty sure it's correct. So now we actually move on to the potential of hacking the secondary code. Why do this besides just figuring out how things work, right? So the big one that we actually have information from and data showing that it can happen and works, which we've developed in the lab I'm working in, is combating parasite infections. So approximately 20 million people have a parasite infection across the world at any given time, which is a lot of people, more than a million die annually from parasite infections, which is terrible. And this mainly happens in poor nations that have, don't have public health districts to clean up water and food sources. So Africa, Asia, Latin America, they already have limited access to medical care and they also have a lot of parasites. The big issue is developing effective treatments for parasites has been difficult. The previous talk with the cancer that was up here, whereas essentially trying to kill the cancer before it kills you are pretty similar to the same treatments for parasites. They try to kill the parasite with the drugs before it kills you. And they haven't been able to develop any new ones and it's different than antibiotics or antibiotics attack a certain future in bacteria that we don't have. Parasites have really similar biology and there's not many target or surface area to attack a parasite that doesn't harm us. So we wanted to ask, do these tRNA action networks look different between the parasite and the human? And this graph looks a little complicated and we use this to apply for a grant, but any of these colored circles are areas where the tRNA interaction network in the parasite significantly differs from the tRNA interaction in the network and the human. And these are possible areas to attack the parasite will not affect in the human. And you go, well, that's cool, but we've actually teamed up with some chemists in Ohio State and they ran small molecule analysis and we found about five potential drugs that affect the tRNA interaction network within the parasite and do not interfere with the tRNA interaction network of the humans. And if you interrupt that tRNA interaction network in the parasite, it can't create proteins properly and then pretty much eventually it dies immediately. So this is where we ran the grant will probably move to animal phases eventually, but that's a real world application that we actually have real data for. So moving on to something that we think might be another potential application but we don't have real data for yet is invasive species. As a biologist, we think about this all the time. In the US it costs us more than $120 billion a year. We're lost mainly from insect pests and fungal pathogens attacking our crop species. And again, a lot of these insects and fungal pathogens are very similar to the plants and very similar to us. So things that kill them can be bad for us or bad for the plants. So we're looking at if we can take a pest and if it's tRNA interaction network, again, it's significantly different from closely related animals in the area that we don't wanna kill. We can engineer possibly a pesticide that can target specifically a certain species of organism and only kill that and actually have no harm to any other closely related species or humans. And this map is just color coded to the actual potential harm of invasive species across the world where we have kelp, we have the United States with the most money lost if this happens. So this is another potential application that we're excited about and haven't fully explored yet. And the last one is essentially expanding the generic code. So a lot of people have thought, they add nucleotides and try to expand the generic code, give more combinations, because you get a triplet, you have 64 combinations, but if you add two more nucleotides, you're gonna bump up those combinations right and increase coding capacity. But to take advantage of that coding capacity, you essentially have to create new tRNAs that get charged with those exotic amino acids or it can recognize the new nucleotides that have been inserted into the genome. And this process is actually really slow. So they take a tRNA and approaching the charges it from outside, from a different organism and essentially randomly mutate it. Under the random mutation, they reinsert it back into the host cell and see if it interacts with that tRNA interaction network, and then if it can still charge the amino acid properly. So creating these things can take a year or two years and a couple million dollars worth of equipment and resources. But with our interaction network, we can actually view where a tRNA interaction network within a human or whatever cell we want, those areas are not being used and can actually be exploited and mutated to quicken the process. So we can actually direct it instead of doing random mutations and it essentially drives down the cost pretty fast in the time. So it's on that. Talk's over. Here's my Twitter, email, GitHub, if you want any of the code. I'll be pushing the presentation up to the GitHub and to my webpage. So there's a lot of information, if anyone wants to get told of it and actually go through it slower. I'm willing to talk to anybody, I'm academic so that's my job, is to help people. And I want to thank everyone for showing up to a talk that's relatively early for DefCon with all the drinking every night. And thank the biohacking team for putting this together. So, questions? Thank you.