 I'm Barbara Hahn, I'm a disease ecologist at the Kerry Institute. I work with Shannon and Kathy who are just down the hall from me. Mike asked me to talk to you guys today about machine learning, which seems like a really daunting field or area to talk about. So I'm going to try to give sort of a broad overview and really connect it more towards the type of data sets and questions that we might ask as ecologists. And I'm going to use some examples from work that I do for my program looking at infectious diseases. But I think really it would be more useful for both of us if you guys would interject and ask questions and say, hey you haven't talked about this yet, but I keep hearing this term, have you heard of it, can you tell me what it means and if it's relevant to us and that kind of stuff. So I'm going to sort of start us out with a broad picture of where machine learning fits into the bigger world of AI and data mining and that kind of stuff. And then I'll sort of start to move towards examples and sort of rationale behind when we would want to use these types of things, but please feel free to just stop me and say, this might be non sequitur, but can you talk about this other thing? I'm going to talk about machine learning, but I'm going to sort of set the stage. Machine learning I think is oftentimes conflated with artificial intelligence. And I mean they're sort of used interchangeably in the media, but really machine learning sits underneath a broader umbrella of artificial intelligence. But I would say that in contrast to how this sort of diagram is showing it, machine learning sort of bleeds over and is sort of taking over a lot of these other areas because it's sort of like the the engine that that creates the understanding the patterns that come from all of these big data sets that these other systems are sort of sitting on top of. So IBM Watson, they are sort of a question and answer type of algorithm that's sitting underneath that. So that that sort of bleeds over into this speech to text to speech. That's sort of intuitive. Like have you guys seen those videos of I think it's actually here. There's like a really famous robotics lab. Yes, that's the one Boston Dynamics. There are these really sort of sad videos of them training a robot to pick up a box and then they knock the box out of his hands and then they figure out what the robot does, like how it picks it up and then they'll like try to push the robot with a hockey stick and then it'll sort of self correct. And I mean it's really amazing to watch this robot sort of take in the information about what's happening and then reorient itself. It's just a little sad to watch it continue to be bullied with a hockey stick. And then there's this really exciting area of image recognition and machine vision. So if any of you guys have used Pinterest or Google image search, this is the kind of thing that it's using, this type of algorithm. Underneath machine learning, so there's this thing called deep learning, which I think actually just sort of it connects to machine vision a little bit. But deep learning is sort of an even more specialized and even more cutting edge version of machine learning. And predictive analytics is just sort of this broad umbrella term that tries to capture the feeling that machine learning, what it really excels at is making predictions based on the input data that you give it about something based on some pattern. So what machine learning really does is identify patterns from data and then learns those patterns and tries to figure out how to optimize some tasks that you sign up, whether it's classification or identifying some number that's associated with a regression tree or separating things into groups, multiple groups and things like that. So there is a more formalized definition of this that was put forward by a guy named Tom Mitchell who wrote one of the seminal books, Machine Learning. He says it's a computer program, okay so that it's a program, it's an algorithm, and it's said to learn from experience with respect to some class of tasks. So it's a program that's learning some, to do something based on experience and you measure it with performance P. And then there's this other book, which those of you who are really interested in machine learning should check out, there are a couple of different editions of this now. But hasty tips surrounding Freeman said that vast amounts of data are being generated in many fields and it's our job to make sense of it all, to extract important patterns and trends. So that's the learning part and to understand what the data say. And we call this learning from the data. And I think that definition is pretty agnostic to field but it really does apply to ecologists where we're trying to make sense of the world around us and sometimes that means that we are formulating our own hypotheses based on observation and then we do something to test those hypotheses in a rigorous way. And that sort of assumes that we have a process in mind. And so it's important for us to assign error and to understand how well we've gotten that process model correct. Other times you have lots of data and you can make lots of observations. Sometimes they're very high frequency. Observations about the world around us and then we want to extract pattern from that information that's suggestive of hypotheses that the data are telling us about. And that's sort of a top-down approach to generating hypotheses and gleaning information about what the system is showing us. The Coursera Course Machine Learning taught by Andrew Ng, who's one of the founders, one of the real innovators in this area, he actually sort of invented deep learning. Elements of statistical learning is a Springer series book. So a lot of the Springer series books are free online, especially if you have a nice university subscription. Like there's a ton of the Springer yellow, the yellow series that are free and they're all really useful. I think this also has lots of great code. It's quite math-sy. So there's lots of, like all of the formulations are in there for all the math that's underneath the hood. When you think about machine learning, there's a couple of ways that you can break it up. The two main things that you'll hear about probably are supervised learning and unsupervised learning and reinforcement learning is way on the other side there because it's a little bit weird. And then semi-supervised learning is, of course, between supervised and semi-unsupervised. But what supervised basically means is that you have kind of a label. Like if you are thinking about a matrix and you have rows and you have columns, those columns are related to the row in some way, right? So if your row is like animals and you're measuring some thing about the animal, that thing is the label. And if you wanna use a bunch of variables to predict that thing, you would want a supervised learning type of approach to that, right? Because you're trying to make some associations between a whole bunch of predictors and some label. When you don't have a label and you just have lots of data, lots of columns about a system, it might be that you want to learn about the patterns in that system. And so you ask maybe an unsupervised learning algorithm to cluster them into groups according to how far apart you can separate them on the basis of their predictors, other variables. You might wanna ask it to summarize. So you guys are very familiar with probably things like PCA. These are really common dimensionality reduction techniques, which just means that you're taking a whole bunch of features and sort of collapsing them down into those fewer number of features that are really capturing the majority of the information there. There's another group of algorithms that are called association rule algorithms, and those are based on market basket analysis. Have you guys heard about this example where people that are in supermarkets wanna know how best to sort of market their products. And so if you crunch the data on what objects are found together in somebody's shopping cart, you'll find that close to Super Bowl Sunday, beer and chips and salsa are usually found in the cart together. So you might wanna put those things together in physical proximity in the store. At sort of any given time of the year, you might find that like wine and diapers are kinda near each other. You know, so like, I need parents in the room, but we'll identify with me. Yeah, like these association rules are sort of generated from the data where their data are grouped in some way and they're giving you some rules that you can use as rules of thumb to make predictions about what you might find in another basket. And then you have things like anomaly detection where you're kind of trying to establish a baseline pattern, according to some group of features, and then you wanna be able to detect the anomalies, like the weird thing. So a lot of the financial fraud algorithms are anomaly detection. Semi-supervised is a, it's not, I think depending on who you talk to, it's not really recognized as like a field. It's sort of like a mash of supervised and unsupervised where the majority of, especially the ecological cases, you have a label for some small number of samples, but for the majority of the other samples, you don't really know what the label is. You have no observation for those other ones. And so what semi-supervised learning tries to do is it tries to glean pattern information from the unlabeled things and use that information to help you do a better job of predicting from the information that you have labels for. So it's kind of borrowing information from both the labeled and the unlabeled things to do a better job of prediction. So in all of these cases, in these cases, the job is max amazing prediction accuracy. And then reinforcement learning is this weird thing where, so like walking, like if you wanna teach a four legged thing to walk like in simulation, you can give it a goal, say I want you to minimize the time it takes for you to get from point A to point B and you can put some obstacles in front of it. The way that you program these things is you don't give it like the rules of the game. You don't tell it like how to, how to scale those obstacles or anything, but you give it this goal and then you give it sort of incentives and penalties for like dying or like falling, okay? So if you see another example is video gaming. In video games, like I think there's this one game called Space Invaders that's super old where there's like aliens that shoot things down and there's this tank at the bottom that's trying to shoot them and there's like little blocks where you can hide behind and in this one example where if you teach, if you train this little tank to try and get rid of the aliens in the beginning it dies really fast. Like it doesn't know what it's doing, it doesn't know what the rules are, but it knows it gets a penalty if it dies. So it gets shot right away, it dies and it's like, oh, that was a penalty I'm gonna try again. And so you leave that model running overnight and by the morning it's like, it's got the whole game figured out to the point where it like shoots the last alien before the alien gets to the spot where it's supposed to be to get shot. Like it's so good at learning that it's kind of amazing. So that's one example of reinforcement learning and it's a little bit weird because of the way that it's programmed but it's underneath machine learning. So here I just grabbed a couple of my favorite examples of some ecological use cases of machine learning. So Mike Walsh did this nice study where he's trying to understand the climate suitability for anthrax pathogen. And I'm not really sure whether he used random forests in this particular model but that's a very recent one that's come out. And there are some older papers. Anna Davidson did this great study where she tried to predict the IUC and threat status of the world's mammal species using a bunch of their features and the analyses suggested some groups of traits that are really important but in sometimes in really intuitive ways and other times in counterintuitive ways that suggest multiple pathways to extinction and what we should do about that from a conservation perspective. And then my collaborators, J.P. Schmidt and John Drake did this great series of studies. I think there's another paper that follows up to this one but they're basically asking this question why are some plant genera more invasive than others? And they do that by using features about a whole huge database of plants. And I think they found things like polyploidy and their ability to adapt to certain environments. Obviously there's traits that are very suggestive of when they're able to do this better than other times. I thought that was a really ingenious way to use that data set. And they followed it up with another bioeconomic forecasting model that suggests when it's most cost effective to try and eradicate something that has a certain suite of features. So those are the sort of non-diseasy ones and then there's a bunch of things that I do that try to predict when an organism should be a problem for disease. But really quickly, I wanted to kind of talk about this, I don't know, I wouldn't call it a dichotomy but sometimes I feel like machine learning and forecasting get a little bit confused as well. If you talk to ecologists, I feel like when we talk about forecasting we're talking about making some prediction in the future. So it's an inherently a time series based word. Like when we forecast something it's something in the future, right? But when you talk to computer scientists they don't always mean that. So in computer science forecasting is sometimes used synonymously with machine learning and really we just mean prediction. So if you are forecasting something you're making a guess about some state of something in the future. Something that you don't know the answer to yet and that could also be counted as forecasting. But in ecology I think you're fundamentally identifying some underlying process model where machine learning does not do that. There's no process involved. There's no hypothesis that you're testing. You start from data, you use the data, you end up with predictions based on the data and the process is not there. Like you're not specifying it at all. I think that Mike is gonna talk about data assimilation methods later but basically what data assimilations try to do at the high level is use incoming data streams to update the states of a model that you have specified and then you sort of update the predictions and the forecast as you go using data that you take in. You're sort of ingesting it along the way. And an analogy to that in machine learning might be called online learning versus batch or offline learning. And that's where offline learning is where you take a data set, you sort of split it into training and tests and you make your predictions on that data set and or on the test data and sort of you're done. When you take in a new piece of information that's not in that original set of data you wanna update your whole modeling process using that new information and then come up with a slightly modified model that can then make predictions on everything else. That's online learning where you're taking in new information and updating your model. That might be an analogy to data assimilation. They're pretty different. I'm gonna go into a series of examples from disease ecology to kind of illustrate the utility of it for some of the work that I do and also to walk you through a particular algorithm and talk about boosted regression. And the general questions of interest that I try to work on are what organs are most likely to give rise to zoonotic diseases. So zoonoses are just diseases that originate in animals and are transmissible to humans. It can cause problems. And then I mean more ecologically speaking I'm really interested in what makes that fraction of organisms unique. So what is it about certain vectors that allow them to carry a lot of parasites that are infectious to humans? What pathogens infect humans versus others? The vast majority of them don't cause any problems in humans. And then for hosts, like what makes some hosts really good reservoirs of disease while others don't seem to be problems at all? The first time that we did this, we used rodent data because rodents sort of have a bad reputation and we know that they carry lots of diseases. There's lots of them and they're fairly well studied. And so we started off by just mapping all of the ranges to get a sense of where they are. So there's 244 reservoir species that we know of. They carry between one and 11 zoonoses. So right away you can see that you cast this as a regression problem because you have a count. So my label is a number of zoonotic diseases or you can cast it as a classification problem and say I just wanna be able to separate reservoirs from non-reservoirs. You can do it either way. But there's 2277 total rows of data so this is probably the best, the highest number of labels I've ever had to work with so it was a good thing I picked rodents to begin with. It was somewhat of a less challenging problem than the other groups that I'll talk about. But we use intrinsic features about the rodents so things that describe their biology, things that describe their ecology and their life history. So I didn't list all of them here. I think we had about 80 variables at the end but they kind of loosely captured these categories of features. One of the hallmarks I guess of machine learning is that you're kind of implying that there's lots of data to work with. And I think when we hear about business examples or sort of industry standard examples in computer science you're talking about like hundreds of thousands of rows or millions of rows of data. I mean that's not the case here. Obviously there's just 2,000 rows and I've gotten this to work with less than 1,000 rows of data, a couple hundred rows of data before too. But you have to be a little bit careful about the dimensions of the data set. So if you have a data set that's really short but really, really, really wide you get the curse of dimensionality do you guys know what that is? So you just don't end up having enough examples to capture most of the combinations of those features so your algorithm just can't learn or it'll come up with some results that are just not accurate. Here we had a pretty nice size, nice dimensions, a good number of rows and a reasonable number of features. Most of them ended up not being super important but we collected them from lots of sources which is another part of machine learning that I didn't, I guess, anticipate. I don't know why I didn't anticipate it. There's like, it's like 85% data cleaning. It's like lots of tedious like why is this not falling? Primates never get that big. Why is this number off? And they don't live up there in Antarctica. That number's off too. Just fixing those kinds of little things in the dataset which is more common than you think they might be. So we collected a bunch of data from multiple sources and of course in ecological data we run into common issues all the time, right? There's hidden interactions between these things that you just can't know in advance. You wouldn't have even imagined to test for it in advance. There's obviously collinearity among a lot of variables. There's weird outliers that are outliers because they're biologically meaningful outliers not because they're just inconvenient and you wanna test them out. There's missing values so of course like in any dataset you have lots of data for some species and not a lot of data for other species and some features are sort of non-randomly missing which is a huge problem, typically. And then you have diverse data types. You know, you can classify things into categories. Other things have really precise numbers with variances attached to them. So that spans the gamut. The data types are kind of, I think a strength and a difficulty with ecological data. So there are some early solutions to dealing with some of these issues, right? Classification or regression trees you guys have all seen pictures of. You just the classic regression tree. And then neural networks are another option that has come out of the machine learning literature from the computer science world. But these things are not without their shortcomings. So carts are really sensitive to small perturbations in the data when you start building the tree. So if you mess with the data early on you can get a totally different tree. And the prediction accuracy there is obviously really dependent on the quality and the starting point of your data. And neural networks work great but they have a tendency to overfit. And so the model is not generalizable to new records. So there are some shortfalls there. And so the newer methods tend to overcome these outstanding issues. And the one that I've been working with the most often is boosted regression trees, which is just a combination of two algorithms that I'm gonna walk through. So there's a magazine IEEE spectrum that's read by, it's an engineering magazine that I had actually never heard of. But they wanted to do like, cause the work that I do is really weird but it uses tools that are sort of off the shelf and easy for computer scientists to understand. So they thought it was cool. And they wanted to do sort of a feature slash like a general application type of a paper. So I worked with the artists to try and describe what it would look like from the traits that we use. And I don't know if you guys can read this in the back but there are things like body mass, geographic range area, sexual maturity. These are really intrinsic properties of species, right? So all these data are at the species level. And so this top tree here is just a classic, classification tree. So each of these dots is a species and we're trying to classify them into two groups, reservoirs and non-reservoirs. So you have like nose and you have yeses. And the idea is that if you have a perfect tree then all of the nos and yeses will like segregate, right? And you'll get all the answers perfectly right. But of course, as you can see from all the open circles down here, it didn't do great. It classified the species correctly only 64% of the time. And the way that I achieve this is just by randomly selecting a variable, splitting it into two groups so that each of these two groups are homogeneous and the values are far apart from each other as possible. So you're sort of maximizing the distance and maximizing the homogeneity within these two groups. And then you keep doing that. You just keep splitting until some stopping point. And that stopping point is a criteria that you tune the model for. Okay, so that's one of the hyperparameters. So it split on these variables, it got some answers and then the model says, okay, how did I do? That's how it gets this classification accuracy. And it says, okay, I did awfully. Okay, so it's got its predictions and it got a lot of them wrong. And now it's the boosting algorithm assigns weights. So it says I got some of these just I tanked, I did terribly and some of these I did great. I'm gonna assign my weights accordingly and I'm gonna build a new tree based on these weights. Okay, so what's effectively happening is the boosting algorithm is now linking, chain linking these, this first tree and its results to the next tree which it's gonna use to maximize classification accuracy of the residuals, okay, of the weights. So this second tree is based on this first tree and it selects terrestrial burrowing, if it's a terrestrial or burrowing animal and then it continues to split, it gets some answers and the second, the combination between the first and the second tree is 71% accuracy. So it's doing a little better, it's climbing up. The third tree it does it again, this time splitting by carnivore herbivore. Some of these features are repeated so geographic range is here and it's also here. So there's nothing to say it can't select the same thing again and this time it's achieved 79% accuracy. So you can imagine that like even if you start off with a terribly weak classifier in the beginning it might just be a little bit better than 50%. By the time you've added 3,000 or 5,000 or 50,000 trees this ensemble of weak predicting predictive models together form a really powerfully accurate ensemble classification tree. So now you guys know how to do business regression. Easy, right? So the way that we oftentimes measure prediction accuracy is based on this receiver operator curve which is just the false to true positive rate and a bad model would predict no better than a coin toss so that's if there's no signal in the data there's nothing that's relating the predictor variables to the label. And if you get a prediction with 100% accuracy and especially if you're working with ecological data there's definitely something wrong and it's probably overfit and life is not that easy. So you want something in between those two and so this is how our model did for the rodent data. So how well can we train the algorithm to predict rodent reservoirs using trait data? It performed with 90% accuracy on the test data. So you guys understand this concept of training a test data. You're splitting the data up, training the model on the training data and setting it loose on the test data and on the test data it performed less than 95% which was the training. It got to about 89%, about 90% performance. So that tells us that the model is pretty good at classifying between the reservoirs and the non-reservoirs just on the basis of their traits. So just intrinsic features of these animals is enough information for this algorithm to do a good job classifying those two groups. And so as soon as I got this answer I was like I wanna know what are the rodents that are supposed to look like they carry stuff but they are unknown to carry things at the moment. So when we pulled out those species that the model says they should carry stuff and we have no information about whether they carry things we got eight species predicted with a greater than 70% probability to be novel reservoirs so undiscovered reservoirs of zoonotic diseases. And this map is so ugly, I'm sorry. The four just meet the redder areas just mean there's more species ranges overlapping each other. But the hashed regions that you see here are two particular species that were confirmed to be positive for new reservoirs for zoonotic diseases. Between the time that we finished the analysis and I gave the first talk at ESA to the time that I submitted the results for publication. So I don't know if you guys do this where you're like okay I'm done with my model I just wanna check it one more time just to make sure that I'm right. So then I took the results out and I was going through the species one by one just to check to make sure that I had mislabeled stuff. And those two guys came back as positive labels and I'm like no I have to retrain my whole model and do the whole thing again. But it turns out that on the way between that time scientists had independently observed these guys infected with echinococcus and Lyceum and I see such a pair of sites in the field. So it gave me some, I mean it's not in the paper because I couldn't say that oh like I had these model results before and then these guys confirmed it and now I couldn't say that. So I updated the model and but I just wanted to throw that story in there. There's a way to, so the thing that I find sort of gratifying and a little bit scary about these models is that you can generate these very like clear predictions about this species, it exists here and we think that it carries something. Like that's something you can go out and validate, right? Or you can look into the data and say people have sampled it, the sample sizes are awful, maybe we should jack those up a little bit and see if we get something. And then when you're right it's like oh that's kind of gross. Like I, you know this is, I think I was giving a talk in Minnesota and this species is like very common over there. So people in Minnesota are like oh what's that species again? I don't, I want to make sure I don't pick up any rodents. So the other thing that's really exciting about this work I think is that once you've done the modeling you can ask the algorithm to tell you which features were the most important in its prediction accuracy. There are multiple ways to do this but one way is to perturb the, do a permutation analysis or you can like perturb the values of the other variables while holding one at its mean and see how the prediction accuracy drops. I mean you do that for each of the variables and the more the prediction accuracy drops the more important that variable was for getting prediction accuracy high. So when you do that analysis you get these curves. So these are called partial dependence plots and I just pulled out four just to give an example but the most important feature for prediction accuracy in our model was how many other mammal species that rodent co-occurs with across its geographic range. So it turns out that rodents that have really low mammal diversity across its geographic range are more likely to carry zoonoses. They're more likely to carry more zoonoses. And then there was a bunch of life history features that were really predictive. So the blue bars are just a frequency histogram of the whole data set. So here is what most of the rodents look like in terms of their litter size and this is not logged. So most rodents have between two and five pups in a litter but species that are reservoirs tend to have more than four. So even though the majority of rodents have less like around here reservoirs tend to have more than that. Reservoirs also tend to reach section maturity much earlier than the majority of species. Sorry these figures should be blown up a little bit so you can see the variation there. They also tend to be larger as adults so about the size of an Australian wood rat for scale. And they tend to have neonates that are slightly larger than the majority of species. So most species have pretty small neonates but reservoirs tend to have babies that are about the size of a baby red squirrel. So the thing that I found really sort of satisfying about this result is that there are lots of scientists who have studied rodents and their immunity and what their immunity looks like when they have a fast life history strategy versus a slow life history strategy and where the fast life history strategy rodents tend to live. And so there's sort of this rich empirically driven body of understanding about rodents in general. And so when we got this trade profile back out of the machine learning algorithm it was clear that it was giving us the signal of something that was fast living. So things that have larger litters, they reach section maturity early. They tend to live in like seasonally dynamic habitats. And so it sort of corroborated what all of these sort of one off studies have shown independently for different systems. It kind of gives you this intuition about how the system is working just on the basis of data, right? So you're not out there like testing a particular hypothesis about fast life histories, which you can do, it might take us a lot longer to get there as a discipline because you're sort of paying attention to the lay of the land and like who's finding what and getting a pulse on it and then going to test the hypothesis. This allows you to start from data first, identify some hypotheses that might be worth testing and then moving forward with those. So sometimes though, when you do this analysis it performs really well in terms of prediction accuracy. You can get really, really high prediction accuracy but like this ecological signal is not that clear. And so one example of that was this analysis we did for a field of virus positive bath pieces. So field of viruses are like Ebola virus, Marlboro virus, these hemorrhagic fever. So this, we did this analysis right after the West Africa outbreak of Ebola started to get like kind of out of control. And so we thought, well nobody knows what the reservoir is we'll just start with that and see what we get. The data were really hard to work with because there were only 21 field of virus positive labels and that's like, I don't know if you guys work with disease data but there's something called seropositivity which is basically like, I don't know, you've been exposed, like your blood looks like it's been exposed and there's some antibody signatures but it doesn't mean that you're actually infected. It just means that you might be infected or you might not. You might just be in the vicinity of something that wasn't, that Ebola was there and so you got exposed to it. So anyway, we scraped together all of those data and there's 21 labels out of 1116 total bath species globally. So that's not a high percentage of labels. So this model was considerably harder to train. This is just a bunch of distribution information. We got a prediction of where new carriers should be. And so there's a couple of things I wanna point out here. We did the analysis globally. All of the field of virus data have ever come out, they've only ever come out of this region of the world. So like the fact that there are bats that were popping up in the Western hemisphere, like everyone's like, huh, what does that mean? Does that mean we have Ebola? That's not what it means. Like we probably don't have Ebola. Actually the people that do the testing tell me that we've never had a positive sample, but the virologists and the bat virologists say that's really interesting because that might indicate that there's a niche opening and there might be something that's similar to a field of virus that's serving the same sort of, filling the same sort of niche, but in the Western bats, which was not a hypothesis that I would have come up with not having been a bat virologist, right? So the point here is to work with bat people or to work with people who are domain experts in the data because it's really easy to train a model on anything and then make erroneous inferences because you don't understand what your data actually mean or any of the biological dependencies in your data. So we got a couple of positives that made good sense based on the fact that some crazy guy experimentally injected this bat with field of viruses. I don't know who pays for these studies, but he got it to replicate Ebola virus in the lab, which is one of the only times that we've been able to get a bat to do this. But this species was not included in our dataset because it was never confirmed in the wild to be positive for Ebola viruses. So we're like, well, we're just going to be conservative and call you negative. But when we got this prediction and we checked the literature, it's true that it can replicate Ebola virus in its bloodstream. And then this species is actually really closely related to epimox bronchetti, which is a fruit bat that is a known reservoir for Marburg. It's like one of the only species that we know for sure replicates and sheds virus into the environment. It has been responsible for human outbreaks of Marburg. So this all makes sort of intuitive sense. And then the coolest slash awful thing about this is that when we zoomed in on, so Southeast Asia, like if you look at this map, obviously there's a hotspot in Southeast Asia, we zoom in there and you ask, okay, well there's like almost, there's like 25 plus species of bats that are overlapping and they all are within this 90% percentile of being positive for Ebola virus. What does that mean for us? Like are there field of viruses in Asia? Why are we not having outbreaks? Like we know that there are field of viruses there. We know that they circulate in wildlife. We know that there's high biodiversity in Southeast Asia, but we have no human outbreaks. And the question is, is there something biological that's keeping the outbreaks from happening? Are there outbreaks that we're not seeing? Raises a whole bunch of surveillance questions. But since this study was published, there have been new field of viruses detected in this region. And also there was a new field of virus found in a new reservoir. And this guy was number five in our list of produce of, you know, we rank ordered everything. This is the one that had the fifth highest probability of being positive for Ebola. So when this paper came out, maybe six months after our paper had been published. So obviously like we had no, I didn't know that these guys were working on it. Ebola viruses are like really hot topic right now. But you know, it's one of those things where you're like, yes, no, that's awful. I don't know whether they're cheer or like not cheer. So the, we found that field of viruses we know, so we have this intuition that field of viruses tend to hide out in really mega diverse areas. And that bats who are more likely to be field of virus positive, their ranges have extraordinarily high mammal biodeversity, which is totally opposite what the rodents showed, right? Rodents have depopurate mammalian diversity and field of virus positive bats have really high biodeversity. They also tend to occur in really large population groups. So I don't know if you can see this picture, this dark area, they're all bats. They're roosting together in a single cave. And these guys are in protective gear with like respirators and everything because it's a huge health hazard. And then the other really cool output was that field of virus bats tend to have high production, which is this sort of size corrected measure of reproductive output, like fitness measure. So like how much biomass do you produce per unit body mass is what production means. And here's an example of a mother. I think this is a, this is some species of fruit bat. And this is that's offspring. Just to give you a sense for how big this ratio can be. So, I mean, we can like hand wave about what we think these variables mean, but the honest answer is like, I don't know. Like, I don't, that doesn't really indicate a clear ecological story to me. So we tried to do the same type of analysis with Zika virus and primates. So right after the Zika virus outbreak in primates in Brazil, South America, in the Americas, we tried to figure out, well, okay, the Zika virus outbreak is generally under control. Like it's gonna, it's, we have herd immunity now. It's probably not gonna break out in the same, at to the same level as it did before. But the, based on all of our biological understanding about flaviviruses in general, we think that there's a high likelihood that if the 80s, Egypt and Mosquito are there and they're carrying Zika virus, they're transmitting to humans. They're able to pick it up and transmit it to another human. They can pick it up and transmit it to a primate. Because primates are in all other, in many other mosquito-borne flavivirus systems, they are the wildlife reservoir. And so that was the motivation for this work. Can we identify the primates in advance that are susceptible to becoming these spillback hosts and therefore reservoirs of Zika virus in the long term? And what do we need to do in terms of our conservation management and decision-making, preventative healthcare in order to make sure that these repeated spillovers from primates then back to humans does not happen in the future? So these data were really hard to work with because I don't know, I think most data are. Like my first foray, the rodents, those were like the easiest data I had to work with and everything after that was totally downhill. So we had traits for 285 primate species, so way fewer data than we're ever used to working with. And we had like no labels for Zika virus. We had like two species that have ever tested positive for Zika virus. And we have like four other mosquito-borne flaviviruses of which we have many other primate species that we know test positive for like one or more of these. So flavivirus belong to a group. The flaviviruses that are mosquito-borne are like few, there's like five of them. And so we implemented this Bayesian multi-task learning approach where we wanted to borrow information. Like so if you are positive for one flavivirus that's mosquito-borne and you are living in an area where we know that there is another mosquito virus that's flavivirus that's mosquito-borne, then we wanna borrow information from those species to help us do a better job at predicting the data that we don't have labels for. So it's an inherently Bayesian way of thinking about the learning problem. And also because this Bayesian multi-task learning requires that you have a complete data set which for primates we obviously did not have complete data on all of our features. We did this multiple imputation by chain equations approach which I still completely don't understand. But I did this work in collaboration with a bunch of scientists from IBM Watson that is close by Kerry in New York, in Yorktown. And so they were sort of the magic behind the multiple imputation stuff. So I'm just gonna skip to the predictions here. So we made a bunch of predictions about which species are likely to be Zika virus positive. The takeaway is that it's not great news because these are all super common species and they have really high contact ways with humans. So that was more bad news. I'm like sunshine and rainbows when I give these talks. Like all bad news. So the out of sample validation procedure that I wanted to just mention really quick. One thing that you can do to try and figure out how badly your model is doing or how badly your data are up to the task of prediction. So in this example, what we did was we took each of the reservoirs that we had data, we had a positive label for. And we pretend we relabeled it to be zero. So we changed it from a one to a zero. And we retrained the model and we said, okay, we wanna know what you would predict this species to be if you didn't have prior information about it. And it turns out that the model is not able to compensate for really grossly missing data. So these species are like these guys are here. So the 10 fold CV scores were really low, which means that the model's not doing a good job of assigning accurate labels to these guys. And it's directly correlated with how much imputation had to be done on that species because there's so much missing data for some of these. Like it's the analogy that I like to use is for some of these guys, like the black howler and the ringtail lemur, like these are not conservation threats. They're not rare. They're super easy to catch. It's like not knowing the life basic life history of the red Robin. Like you see it everywhere. It's not hard to catch. It's not hard to study. But you don't know how many eggs it lays and you don't know what it eats and you don't know when it's active. I mean, it's that level of sort of biological ignorance about some of these guys. And so the point that I want to make here is that data are really powerful and you can get a lot out of them but you can't make them up. You can't make up for missing data, right? You can't get something from nothing. And so that was one of the lessons that were really clear from that study is that, yeah, the data machine learning is really powerful but it just won't make up for missing data and the lack of basic research. I think in general for machine learning and especially for deep learning, the language of choice is Python. I don't code in Python so that's unfortunate for me but I'm learning Python and there's analogies to all of these packages in R and if you are interested in a particular package I can look it up for you or I can tell you what I've used in the past. For boosted regressions I generally use GBM which is by Greg Ridgeway. There's also another package that's more friendly for spatial analysis and species distribution modeling called DISMO, D-I-S-M-O by Jane Elith and then the CARET package, C-A-R-E-T. CARET has like random forest, GBM, it has like a bunch of SVM, it has a whole bunch of different algorithms under the hood. The reason I don't use CARET is that I find it really hard to fiddle with things under the hood or figure out why it did certain things. Like it makes decisions differently than DISMO does and DISMO does it slightly differently than GBM does. GBM I find is the way that is the easiest to sort of take everything apart and fix things and know exactly what's happening but this is a nice little cheat sheet for if you have like really sparse data it's just not gonna work, just get more data. If you're trying to predict a category yes, do you have labeled data yes, then you might wanna do some of these things to try and figure out what your labels are. If you don't have labeled data you might wanna do some clustering analyses to figure out how your data are separating from each other. If you're not predicting a category and you don't have a quantity and you're just looking around then you might wanna do something here and if you are predicting a quantity here's some other things that you can do. So another rule of thumb that I've heard that I've not validated myself is that sometimes for wider data sets like if your data matrix is generally like wider than it is long, random force is supposed to perform better. I don't actually know why. I actually don't think computer scientists know why. Like that's why they call the term black box, right? So it's black boxy because you're not specifying a process and that makes sense. But it's also black boxy because sometimes we can't figure out why they work so well. Like we know exactly what they're doing but the level of prediction accuracy is kind of staggering and that's the black box nature of machine learning. That's what computer scientists refer to when they call something black box. Thanks for having me.