 Okay, how's everybody doing today? So my name is Rohan and today I'll be talking I'll be presenting my talk fighting the flu with machine learning so There is my Twitter handle and my github if you didn't catch that it'll be on the last slide So when I'm taking questions, you can look at it So I am going to be an 11th grader this fall in high school Thanks, so I've been programming for two years now mostly in Python I've also done a little bit of Java and R and I'm not very proud of the Java part because I did start with Java and that gave me a terrible introduction to programming so Yeah, after I started Python. I was like this is way cooler I'm doing Python now and specifically I like scientific programming and data analytics and machine learning stuff like that So today I'll be taught and I like doing projects. So today. I'll be talking about one such project. So Let's let me start let me start with the problem is The flu so chances are you may or may not have had the flu before so millions are affected worldwide and They're currently there isn't a great way for scientists to make vaccines for the flu So they just vote on what drugs and anybody should go into the flu vaccine for that year and this practice has been called questionable by the National Institutes of Medicine, so and Sometimes the vaccines don't always work if you remember in 2009 There was a worldwide pandemic of the H1N1 flu virus so What I wanted to do in this project was create a better way for scientists to come up with vaccines and my solution to that was to Come up to predict how flu strains would look like in the coming year So that scientists could analyze that and then have specific vaccines Tailored towards them tailored towards those specific flu strains would have a lower probability of getting sick So here is a nice graphic of how the flu infects a host So in that first panel you've got the virus, which is the small smaller Circle and the cell so those little purple dots on the outside of the virus Those are called those are called those are proteins and they attach to the receptors on the host cell So and then after the virus gets in the cell it dies and it injects It's viral DNA and RNA and it's other genetic material and as the cell grows So so do that DNA RNA and those proteins and eventually When the cell is about to die all those viruses leave the cell and they Infect and they go on to infect other cells. So from this from this from this research I To Morse I decided to look at those specific proteins those little purple dots on the outside of the cell In the flu those are called hemagglutinin and uraminidase and being biologists They had to come up with extra long names for them. So So for now, we'll just refer to them as HA and NA. So HA regulates whether the virus can enter the cell or not and NA Regulates whether that cell can whether the virus can leave the cell. So I Thought to myself if I could knock those proteins out or come up with some or Predict how they would look in the future then scientists can create vaccines which could take out those proteins and thus and Prohibit the cell from and prohibit the virus from entering the cell So this is my approach So I would first predict future genetic sequences of the flu and Scientists could analyze that and I would use and I would do that using phylogenetics and machine learning So phylogenetics is a study of is a study of the relationships between individuals on an evolutionary tree So this is pretty much a phylogenetic tree. You've probably seen one of these so So right there on the bottom you have E which is the child of F and so what I would do is I would Say hmm, how how how has the genetic sequence of E change from F and how the genetic sequence of F has changed from G So I so if I could train a machine learning model to analyze these relationships as the flu changes generation by generation then maybe I could predict like maybe a child of E a Child of E. Maybe what what would that look like? Maybe would it be a a cc? Would it be a a tc? So I wanted to figure that out So my setup was to first obtain data from the influenza research database or IR DB which is funded by the National Institutes of Health and read in the genetic sequences of hemagglutinin and your minidase and train a machine learning model based on the phylogenetic Relationships so what I wanted to do was train the relationships between G F and E so how has F change from G and how has E change from F but Me being inexperienced as I am I wasn't sure how a homemade machine learning model would work because the flu data is quite complex and I wasn't sure if my homemade algorithm could match that level of complexity and predict as well I wasn't I also wasn't sure what machine learning model to use how to measure accuracy and How to read in a fast file which is a common file format for biological data? So this is where my first library came to my rescue and that was bio Python So bio Python deals with and everything everything bio biology in Python. So What I so I use this library to get past that initial hurdle Initial hurdle of how to read in a fast file originally I was reading it in as a regular txt and that wasn't working out very well because I had when you read it in as a txt There's there it just there are a bunch of like random characters that are in a fast file would show up So what I was looking what I used to do was look for patterns Look for specific characters which show up right before the gene sequence actually shows up And that wasn't working out very well. So with a with bio Python I was able to read in a fast file much more easily So after I got past that hurdle it was time to create the algorithm myself so I did some research and I found a Decision tree Decision tree algorithm would be the best fit because it was easy for me to write and it was it would be easy for me to Debug and as well it matches a phylogenetic tree as every time a decision tree branches It it's similar to how a phylogenetic tree branches from From its parent to the children and as a decision tree branches from one question to more questions or a decision But there are lots of problems So I I wrote a classification algorithm Although I really what I really wanted was a regression algorithm as the algorithm was too simple and it underfit the data So my initial hypothesis was Was confirmed because I could not write an algorithm that matched the level of complexity that the flu data had So scikit-learn helped me as you probably most of you know I scikit-learn is the go-to machine learning a library for Python and with this I was able to use lots of algorithms I could just import them really easily. I didn't have to write them myself. So that saved a lot of time So now with bio Python and Scikit-learn by my side This is the this is my setup which is updated to reflect the change the changes that I made So the first step stays the same. I still get the data from the influenza research database Next I would read in the I would I would read in the flu genetic sequences using bio Python And I'll show you how easy it was for me to for me to get that data really to get that data Using by Python and I had to add another step which is encoding the data. So With scikit-learn or rather with my previous algorithm, I would simply input The eight the Gen X sequence, which is a big string like roughly 1700 letters long ATGC Just common just random combinations of those four letters and I could just input that directly into My flu prediction algorithm. However with scikit-learn it only takes numerical input. So I had to encode those Genetic those letters into numbers. So I had to add that step and The last step stayed pretty much the same with the exception that instead of writing my own algorithm I would use the scikit-learn library So here's this is a diagram that I drew Showing you the out the overview of everything I would do. So what I would do was create a phylogenetic Tree for each protein. So pretty much the grandparent to parent grandparent to parent to child relationships between Between flu strains next I would encode the data and then after that I would give the data to my scikit-learn algorithm and Test the accuracy using cross validation and then add more data and this and just continue doing this until I reached a favorable result So this is what I could do with bio Python in just one line I could I could read I I could get the the data which is stored in in that in HA 100 which is 100 faster files and by doing by doing that that this by turn putting them into a list I had I had one big list of all the different faster files So what it would be would be a Grandparent a parent and a child and then a group of three and then another grandparent parent child and so on for a hundred of these So instead of looking for specific characters, which signaled the beginning of a genetic sequence I could I could use I could write in one line. I and I could have all of the all of the relationships ready So this is my encoding method So essentially what I did was for a a Corresponded to one t corresponded to two and so on so what I what I would do was I would take these huge Strings of ATG G's and C's and convert them into one big number. So What I did initially was What I did initially was take these all of these letters and put them in one big int of 1700 digits and I soon found out that Only five of them were being stored. So I tried putting them in a float that worked a bit better However, it could only store up to 15 digits. So what I ended up doing and What I ended up doing was Breaking them up into floats of 15 and then putting each float of 15 digits into one big list With roughly 80 elements So I would have it one two three four five one two three four one two three four up to 15 Then another element in the list with more digits and so on for roughly 80 elements And that would be the entire genetics genetic sequence of the h8 protein and This was my this was my algorithm my Decision tree algorithm before psychic learn and this was my decision tree algorithm after So This is and this is not even the whole thing. This is like half of it. So Pretty much from 80 something lines to six so that That was Incredible, I was I was blown away like how can how for for me? I've never I've never seen this before so I was really excited and it turns out this had really good accuracy as well But there were some drawbacks in using by a python and psychic learn by python not as much but With psychic learn I was I could I was I was I was I had to stick with training my Machine learning model on simply the parent to child relationships because you in psychic learn You don't have that I couldn't add that extra generation of that grandparent Which I would like to add because you know more data always results in a better model You can't you can't beat more data So I could only go one generate what like one correlation so I could only go from parent to child and psychic learn also couldn't take letters or characters and I I wish that I could have put all of those base those base pairs those ATG C into one big int So I wish it could hold more significant digits But you can only get so much so now I'll talk about my results So This is H1N1. So H1N1 is a subtype of the flu. There are lots of different sub types So I trained mine on the most common ones which infect humans because I want to protect humans so So Each one N1 is you this was this was the one that broke out in 2009 It caused a global pandemic resulting in a lot of deaths. It was actually the most potent flu since the Spanish flu in all the way back in 1918 so This one is especially dangerous because it mutates very often and although this does not affect humans that much Thankfully, but it still mutates really quickly and that makes it really dangerous This is another flu subtype called H3N2. This is more common in humans and doesn't mutate as quickly so You probably noticed that the accuracies with these three With these three algorithms decision trees random forests and extra trees They performed a lot better in H3N2 And I think this is because H3N2 doesn't mutate as much So it's easier for the algorithm to pick up on those changes because there's not as many so This is H3N2 and this is H1N1 and those two those those two bars the yellow is the Neurominidase which regulates the exit and the blue bar is the hemagglutinin which Regulates the entry of the cell so with the exception of the decision tree algorithm and extra tree algorithm The my model performed really well So I ended up going with the random forest algorithm and these these three are all decision-based decision tree based So I compared my my Algorithm with two previous European studies. So this one right here is a German study and this one is Study done in Spain I believe so both of these do not use machine learning rather. They just rely on formulas to predict how the flu will change in the coming years and As you can see the results look very encouraging so to sum it all up I wanted to create a better way for scientists to make vaccines for the flu and by and so how I went about doing that was to Predict how the flu would change in the coming year so that scientists could create vaccines specifically tailored towards them I used biopython and scikit-learn to simplify to make my to make my algorithm shorter as you saw in the Psycho scikit-learn helped me and the results look very encouraging. So thank you all very much So yeah, I just feel I'm taking questions right now. So Anybody yeah Yeah Yes, they they code for the amino acids But what like what I decided to do was if you can take out the the Sequences that are coding for those amino acids like take out like if you can if you can Create a antibody which can perhaps change one of those sequences like I'm diving really deep into biology here, but So I believe you ta is this is the codon, which is that three Three base pair long sequence to stop coding for any more acid for any more protein So if you can maybe create an antibody which would change Which would perhaps move that little like codon For and in the beginning then you can stop that humabutinin from being coded So it won't be able to enter the cell. So I just so I was just predicting that Oh Yeah, that was a great talk. Were you able to communicate your results to any Lots of lots of APs coming up, but now it's summer now it's summer. So I think that's definitely on my to-do list Before school starts is communicate my results to scientists definitely send that sending out some emails Yeah, you said you were encoding a Set of base pairs as numbers and storing them as long floats in a list, right? So why not store them each as codon instead of a set of 15? That's actually a good question. Like I could I could actually do it that way. I think I'll definitely try it storing them as individual codons so that Scientists can like like see where each one is like Mmm, how do I? Yeah, I mean I could I could I could definitely try it like that any other questions. Yeah, all right. Thank you very much