 Thank you. Hi everyone. Yeah, today I'll be talking a bit about different projects we are running in the lab where we use supervisor and personal learning to look at biodiversity specifically aiming to estimate biodiversity in the present through time through deep time and also trying to learn from from these biodiversity patterns and from the extinction risks that biodiversity is currently facing and try to use AI again to guide conservation action. So, just to recap, I'm sure you've seen this stuff before, but basically the first thing that I would like to talk to you about these models based on supervised learning. And so these are typically in our applications deep neural networks that can be used for classification or regression types of tasks. So these are basically models that take a bunch of inputs and map them into an output and one way to use these type of models in biodiversity. Of course, there are many applications of these models. But like one of the ways we have been using these type of supervised learning models is to infer biodiversity in a spatial context. And so, here we are looking at present day biodiversity and specifically we will, we look at some plots where all species were recorded by scientists and these are available in a database as plot. And you can see here these red dots map the distribution of these plots where we know every single species of plant that lives in there. And the distribution of these plots that the plots are a lot, but they're also biased in the retrographic distribution and they don't span the entire Australian continent. Meanwhile, we have a lot of data sets that do span the Australian continent and the global. We have a global distribution for example the GB occurrence data spans the entire world and records species information from hundreds of thousands of species with billions of records out there. We also have climatic data that spans and covers the whole world. We have lots of databases that probably have some predictive value in understanding biodiversity patterns and they span the world. On the other hand, accurate human collected by diversity data is much more sparse to come by. So here we can use supervised deep learning to try and map these predictive variables onto predictions of biodiversity. And then if we can do a good job at predicting biodiversity for the plots where we do know the ground truth, which is how many species occur there we can then use this model to extrapolate and create maps of biodiversity. So this is a project that we did a couple of years ago in my group. And so we use this predictive variables we map them to quantification of biodiversity per plot. And then we train our model, evaluated our accuracy across different size of plots, and then we are able to extrapolate so once we are happy with the accuracy for model we can use these models to extrapolate and create maps of biodiversity. There are two metrics of biodiversity, gamma biodiversity, gamma diversity and beta diversity, gamma diversity is basically how many species are out there. Beta diversity is that quantification of turnover how different are two adjacent plots in terms of species composition. So they measure different metrics of biodiversity. And then across the we obtained with our deep learning framework so here the inputs were basically occurrence records from GB climate data and a few other layers. And then our train models were used to generate projections of gamma diversity across the continent and beta diversity. So turnover basically across the continent. In learning, there is a lot of focus on accuracy, right. We want our models to be as accurate as possible, which is great for many things but in ecology and in evolutionary biology sometimes we are more interested almost or is interested in accuracy as we are in the uncertainty in our predictions so we don't only want to have a model that is doing a good job at making predictions but we also have a model that you also would like to have a model that tells us. Where the predictions are less reliable. And here we can use techniques, for example, Monte Carlo drop out. It is basically an extra layer that you are to your network to evaluate how robust estimates are. And this is what we did here and we identified areas where the prediction is affected by high uncertainty. So now we have maps with predictions for the entire continent but we cannot identify areas where our predictions are probably not as reliable. This type of quantification of uncertainty is, it's really important. Because you can also guide research efforts. So if we wanted our model to be more accurate in predicting a predicting gamma diversity will probably want to add lots here. If you wanted our model to be more accurate and reliable for better diversity predictions then we'll need more training data in this shape that they're in here. So we can use deep learning models to make predictions of biodiversity patterns today. When we talk about biodiversity today. The good thing is that we do have a way to get a ground truth like we just need to go out there and count all the species this of course is not as easy as just counting things but there exists at least in general a way to get a ground through so we can train our models with plots where we are pretty confident that we know the ground through things get more complicated if you are interested in understanding how biodiversity evolved over a deep time. So this is another of the research objectives of our group and he's basically trying to understand not only how not only how biodiversity distributed today but how did we get there was the evolutionary process that led to the biodiversity we observed today. An estimating biodiversity through time in deep times of millions of years is really crucial to understanding some very fundamental problems in evolutionary biology and in understanding are in our understanding of the history of life. Is there a limit to biodiversity or can just species accumulate forever. Does biodiversity increase over time or what are the mechanisms controlling biodiversity. These are questions that have been around for a long time and they can be answered by obtaining reliable estimates of how biodiversity changed through time in the first place. So people have been looking at the fossil record to understand and try to plot how diversity accumulated over time and this is basically how we know about the past mass extinctions that occurred here in the history of life. So the fossil record is the closest that we have to a ground truth because it is the most direct evidence of past biodiversity. But at the same time the fossil record is played by all sorts of biases. So it is by no means a ground truth. We can count species in the fossil record but we will have to be aware of the fact that the fossil record is biased taxonomically, spatially and temporally. So there are all sorts of things that make our fossil record incomplete. Therefore, we can just use standard supervised learning to estimate biodiversity through time, because we cannot easily train a model on ground truth data. And people have been trying to interpret the fossil record. So not going beyond just counting species using other types of statistical models for a long time. Unfortunately, a couple of high profile papers, or maybe fortunately as you said, a couple of high profile papers have shown that just applying simple statistics to extrapolate through species diversity from the fossil record is not actually accounting for all the biases that affect the fossil record. And because of these intrinsic incompleteness of the fossil record essentially the available models out there are unable to estimate robustly biodiversity through time. So we decide to take a different approach and to use deep learning methods instead. Now the question is again as I said that we don't have ground truth, right. We don't have a ground truth data. When we look at deep time evolutionary models, we basically never have access to ground truth, even when we model how the DNA evolves over time. We can ground truth it for a very short amount of times or for very fast evolving organisms like bacteria, but we don't really have a way to validate experimentally how the DNA evolved across mammals or across animals. Right, because the time scales are billions of years or millions of years so we can experimentally validate our models. So when we build models, typically in a probabilistic framework, what we're going to do is to apply likelihood based types of analysis. So this is if we don't use AI. We will typically develop mechanistic probabilistic models of evolution, for example, models of how DNA change over time. We can define the mechanisms of evolution. We can build an unsupervised model and using likelihoods we can estimate our parameters of interest, for example, the genetic distance between taxes. So this is how we typically deal with the context where we don't have ground truth data. What we can also do then and what we typically do when we develop new models is to generate data under the assumptions of these evolutionary mechanisms so we can simulate data sets here. We can run these data sets where we do know a ground truth and we have realistic data sets, run these data through our unsupervised models and verify whether we are able to recover the truth as we simulated it. Now, what we're doing now for using deep learning is taking a similar approach. Once we have our model, of course, we can run it with empirical data. With supervised learning, what we can do is to create a model, a generative model that reflects our understanding of evolutionary processes. Use this model to generate training data and then train a supervised learning model so that it that is able to pass our data and make a prediction for the parameters of interest. Once we have a trained model then we can run it through empirical data and then obtain our parameters. So basically using a probabilistic framework or a supervised deep learning framework to estimate these type of parameters is just taking a slightly different routes to get in the end to the parameters of interest. And this is what we did for biodiversity through time. So in our attempts to estimate biodiversity through time as we cannot just train a model based on ground truth data, we had to basically create a simulator of ground truth. And so one of the projects we are developing in our lab is developing a software is implementing a software called deep dive for deep learning estimation of diversity trajectories. And here part of the software is a simulator of biodiversity. So we simulate biodiversity using spatially explicit birth death processes so using stochastic models that are typically used to describe the process of speciation distinction. And then we have simulations of the fossil record. So once we have a ground truth of diversity through time and space, we can generate the fossil record that is sampled from this true diversity. And here we can introduce all the biases that we know occur in the fossil record. So spatial biases temporal biases taxonomic biases. Once we do that, then we are going to have a ground truth that is like the true diversity through time and the sample diversity, which is what is left basically this incomplete and bias policy. So we can generate these data sets and use them to train a deep learning model. If we generate enough similar enough data sets that cover a very wide range of settings. And then we can hope that our model is trained in a way that once it's fed with empirical data we return releasing estimation of privacy through time. Here we use recurrent neural networks which are basically neural nets where the output of each output of different nodes are basically interconnected. And this is a way to account for the temporal autocorrelation between time beams. So we're going to have inputs for each time being here, which describe basically the fossil record in that time being and then obtain an output. There is a time series of bioreversies. So we can train these models based on hundreds of thousands of civilizations because we generate the data we can generate them. We can generate as many as we want. We are not stuck with some limited sets of ground truth. So we generate lots of data sets, we train our models and then we can validate our model. Here we compare our model to the state of the art known AI model to estimate diversity. And we found that under a different, under different settings and the different preservation scenarios, our model is consistently outperforming the alternatives. So once we are happy with our models and we see that our models perform well in the presence of temporal taxonomic and spatial biases, then we can apply them to real data. This is the data set of the elephant claim. So elephants are a claim that is very charismatic and today is represented by only three species. We know from the fossil record that there used to be many more elephant species roaming Earth until quite recently. So this is sample diversity through time. These are millions of years. Today's diversity is down here. So we're reading off on the policy record that there used to be many more species of elephants and there are today. But by feeding this type of data into our learning framework, we get an estimate of true, true biodiversity through time. We get the gain envelopes in the describer uncertainty around the decimates. But what we do observe is that of course the fossil record is an underestimation provides an underestimation of the true diversity for the claim. So by the species richness of elephants in the past in the recent past was way higher than this today. And indeed we reconstruct a tenfold drop in diversity in elephants in less than one million years. This is basically for this clay, the mass extinction that occurred very recently in time. Using other types of analysis we can identify the causes of these drop, which in our findings is basically attributable to humans for the most part. So humans kill most of these elephant biodiversity. So what we can see here is that even clays that have a very long accessible history of biodiversity are exposed to the risks driven by anthropogenic pressure on the environments for the elephants. The mass extinction has already happened for many other groups. That's not the case. We have thousands and thousands of species out there that are at risk today. Under some estimates, there may be up to one million species that are endangered with extinction today. So we can use our models to understand the evolution of biodiversity, but we can also use AI to try to do something about the present risks that biodiversity is facing. Speaking to supervised learning, we can use supervised learning to improve our understanding of which and how many species are extinct in the first place. Some of you may be familiar with the IUCN red list. This is the basically the gold standard for assessing whether species are at risk or not. It works with a set of labels that identify species from least concern up to critically endangered species that are expected to go extinct within the next 10 years with 50% chance. This evaluation of species is usually done by experts that take a look at various sources of information, including dynamics of population size for each species, range size, where these species occur, whether they are threatened by poaching or other things. And after an evaluation, one species at a time, the IUCN red list assessors will label these species. This is an extremely important task because it can guide conservation efforts. But it's also an extremely time consuming task because it requires experts for every single group of species out there. So if you're interested in birds or mammals, that's all good because all birds and all mammals have been assessed by the IUCN red list. And you can use then these predictions to these estimations of threats to make, for example, predictions for the future. So here is a study where we looked at current trends in extinction risk in birds. And if, you know, things stay as they are today, so if we don't improve the current status of species, we may be losing more than 100 species of trees in the next 100 years. So these are the type of things that we can do once we have these labels. The problem is if you care about other things that are maybe less charismatic than birds, but equally beautiful across me, then the problem is that these species are not assessed to the same extent. Plants, if you're interested in plants, there is only 7% of plants that have been already assessed by the IUCN. This is partly due to the fact that there are many plants out there, so there may be 350,000 flowering plants out there, flowering plant species out there. If you're interested in invertebrates, the case is even worse with only 2% of them assessed, fungi is like basically we don't know almost anything about their conservation status so far. Meanwhile, we have a lot of other data that spans many more species in the IUCN red list. I mentioned before the GB database this collects occurrence records for millions of species. We also have other things that can be layered together with occurrence data, for example, human footprint estimates or estimates of land use or human pressure on the environment. We recently developed an R package that basically collects this data across species and uses the available labels from the ICN red list to train a predictive model and try to fill the gaps in these groups. So the model works using a deep neural network where the input is multiple sources of information, for example, species occurrence data from GB, human footprint data, environmental data. But the model can also incorporate palogenetic information, traits, if we think they may be relevant to predicting the conservation status of species. And then the model basically maps these inputs into a prediction of whether a species is threatened or not. We recently applied this framework to a database of three species, so there are about 58,000 species of trees out there. And using this automated framework, we were able to complement the ICN red list by doubling the number of species that have an assessment reaching 49,000 species of trees. So, using a fairly simple pipeline, we're able to collect occurrence data for 49,000 species of trees, train a model based on the trees that had been assessed by the ICN red list, and then make predictions for the others. We can then use these predictions to map where the highest fraction of extreme or species at risk of extinction is found. For example, we found unsurprisingly that Madagascar has the highest fraction of threatened tree diversity. And that's partly because it's got an extremely high diversity and extremely endemic diversity. Overall, we found that about 40% of three species worldwide may be threatened with extinction. We can also map them by biome, so looking where in which type of ecosystems they live. We found that threatened species are kind of found pretty much everywhere. So, there isn't the island just accumulating tropical habitats, but they're found everywhere. As I mentioned earlier, with these type of models, we are not only interested in making our predictions as accurate as possible, but we're also interested in knowing in which cases our model doesn't actually know. So, under which circumstances is our model unreliable. We can do this, for example, by mapping prediction accuracy, especially. So we can do this using training validation data and we found that the Southeast Asia is the place where our prediction accuracy is expected to be the lowest. And knowing this is important because we can use these to be a bit careful in interpreting whatever we estimate for this region, but it can also direct the efforts to classify manually the conservation status of all these pieces. So we can use this information to refocus our efforts to complete this database. The other types of models that I would like to talk a bit about today is reinforcement learning, they're based on reinforcement learning. So we've seen different instances of supervised learning where we either had a ground truth, like in this case, or in the case of modern biodiversity, or we made up our ground truth using generative models and simulations as we need to infer biodiversity in deep time. The other type of deep learning models or AI models is actually reinforcement learning, which you may have come across, and it's the type of AI that is used for example for dynamic tasks like driving a car or driving a drone. So reinforcement learning is basically a type of AI that deals with an environment that is dynamic in itself and basically learns how to interpret the environment and make a decision, take an action. The context within which we use reinforcement learning in our work is for conservation purposes. Here we have an environment that is dynamic that is Earth basically so the system where we have multiple species, these pieces may be threatened or not, they have a geographic range. So this is like a spatial explicit framework. These pieces have palaginetic relationships that connect them. And in this environment, we don't only have the biodiversity aspect, there is this multitude of species and their geographic ranges. But we also have other factors that are affecting the conservation status of these pieces. So we have land use that may dynamically change over time and may display species or reduce the geographic range of species. We have costs, which reflect basically the cost of protecting particular areas. And this again can be dynamic, it may change over time. It's a function of land use for instance. Within an environment where we want to make political decisions about what to conserve or what to focus our conservation efforts on, we'll also have a budget that we need to rely on. So ideally we will be able to protect every species and every single corner of Earth. In practice, this is not possible. And it's limited by the needs that we have to generate resources from land, but also we need to rely on budgets that can limit how much we can actually protect and implement conservation policies. So this is going to be the environment. And in the same framework, we'll have conservation targets, which is basically what do we want our conservation policy to reach. A conservation target will, for example, define in which in which case is our conservation policy successful. For example, if our target is to protect as many species as possible and prevent every species from going extinct. Then the reward that we get if the conservation policies implemented correctly will be a positive reward every time a species doesn't go extinct and a negative reward every time a species does go extinct. So we can define this reward system to tell the reinforcement learning algorithm when it's doing a good thing versus when it's doing a better. The concept behind reinforcement learning is that you will have an agent that is basically the policymaker that reads in this environment. And based on what it sees in the environment will make a decision. The decision here will be to protect something or to protect a particular area. When this protection is applied, then the agent will collect a reward is basically a score in a video game. Did this protection action lead to a positive outcome or a negative outcome? This will be the reward based on the rewards the agent will optimize its way to make decisions. So the beginning will just reading this environment, translate this environment, all of this information in the environment into an action. We are going to use a deep learning, a deep neural network to do that. This action will be selecting for example an area for protection. So this area is from now on protected. The action will have a repercussion on the environment. So the environment will be updated based on this action. This will mean will change the land use in this area. It will affect our budget because we'll have spent some of our budget to apply this, implement this action, and it will have a potential repercussions for the species to occur in this area. From this updated environment, then the agent will get a reward. This will be a positive reward if it is good or negative reward if it is bad. Based on the updated environment and the reward, there will be some optimization of the parameters and the next action will be taken, which is the next area to protect. This area will be here and now it will have repercussions on the environment. The agent will basically play this game of protecting areas over and over again. And when it ends the budget, it will start again fresh with a new budget and try again. And every time we collect this rewards and it will try to learn from the rewards. Learning from the rewards will mean learning how to translate whatever is fed as an environment to the network here and translate this into the right action. So the agent will play this game many, many times and it optimizes the parameters and it learns how to best map any input data into a decision into an action. We can use this with empirical data, for example, using biodiversity data here across many species of plants here, there is three shown, but we actually apply this to 15 other species. We can basically combine this biodiversity data with socio-economic and environmental data, describe for example the disturbance to the environment, the cost of each cell in this map. We can feed this thing through our train policy and obtain a map of conservation priorities. We can use this framework to evaluate the outcome of our policy and the outcome of different types of policies. For example, do we need to know everything from this environment to make the best decisions or can we get a proxy for this? Do we need to know where every species is exactly or can we just use an approximation of this and still get a good outcome? We can validate this through simulations where we can simulate an environment where we know everything, where we potentially have access to every single individual of every single species in an environment. And then we only give partial information to the agent to see how well it does. So here we map along different access by diversity, protected area, genetic diversity and species value, we map the outcome of a policy trained through reinforcement learning. And if we let the agent only observe the environment once and then apply all of its budget, spend all of its budget to conserve everything, we will get a certain outcome here represented by this polygon. And if we let the agent observe at every time step the outcome of the previous action, then we'll get a much better outcome. So this means that if you monitor by diversity as you apply your conservation policy, you will get much better outcomes than if you just do monitoring once and then apply the entire conservation policy that is supposed to work for the next 30 years. Even if the information is not perfect here simulated as cities and science type of information where we don't necessarily know, we don't let the agent know exactly where every species needs. We still get, we already get a much a strong improvement in the outcome of this policy. And if we let the agent know every single detail about biodiversity, we get some improvement, but it's not as crucial as doing this monitoring of biodiversity, very regularly through the implementation of the policy. We can use reinforcement learning to optimize policies that target different objectives. So this was a policy that focus on protecting as much by diversity as possible, but we can also train a policy to maximize the value of species of the protected species and the commercial value of the protected species and these will come. We result in a different outcome. So we can use this framework also to evaluate the trade offs between different objectives, conservation objectives. So if we focus on species value or if we focus on area, we are going to to have trade offs on how much by diversity. So we just started applying this framework to evaluate the 30 by 30 conservation pledge, which you might have heard of last year there was this landmark agreement on biodiversity was signed by over 200 countries worldwide that basically aims to protect 30% of Earth by 2030. So this is an enormous task that we all hope will be successful. But the question is then, okay, which 30% should we protect and what will be the potential outcomes of choosing this 30% under different objectives. So we can use our reinforcement learning framework to to make predictions for potential outcomes of different implementations of the 30 by 30 policy. And to do that we simulated data sets again working with similar data sets is useful because we can compare different applications of the policy and we can have replicates of our of our policies. So here we generated 100 data sets of biodiversity data sets with different number of species different species ranges, different costs of conservation in different simulated habitat degradation patterns. So this data sets and then we implemented the 30 by 30 policy under different settings. One was focusing on minimizing the cost of the 30 by 30 implementation. One was focusing on a naïve metric that is just trying to protect the areas with the highest biodiversity. And focusing on the mean species abundance, which is a metric of intactness of the environment. And the last one was based on our reinforcement learning optimization. So we can run these different policies within our framework and then evaluate the outcome of these policies of these implementations right so we can evaluate how much each of these policies cost at the end, how much intactness we achieved with this protection. How much should we reduce the threat overall using the star metric that is commonly used for evaluating risks by the risk. And then we can check how many of the simulated and dangerous species were actually protected within our framework. And so we did this through simulations and now I'm going to compare to show you the outcome of these four different policies as a relative change to the minimum cost implementation. So these are percentage differences between these different policies and the minimum cost policy. Another is that any policy that is focused on biodiversity will cost a lot more than just a policy that is focusing on minimizing the cost. So this is not unsurprising, but it's something quite important because some governments at least will probably prefer this type of implementation of 30 by 30. So it's important to show that if we are interested in biodiversity we are going to have to spend more than the bare minimum to do a good job. On the other hand, we see that any policy that is not focused on costs but focus on biodiversity will improve the outcome. So, in terms of biodiversity. So it will improve the intactness of the environment, significantly compared to a minimum cost policy, it will reduce the threat and it will protect more species. So, even if we don't use AI, we, whatever policy which we use to prioritize that 30% that will be protected by 2030. So, focusing on biodiversity will have a much better outcome than focusing on costs alone. What we also observe from these experiments is that using AI actually improves the outcome of, of 30 by 30 implementation so we significantly reduced threat. Protect significantly more threat species by using our AI framework compared to using other more naive policies. So this is basically a justification for using a more complex model rather than a simple metric when guiding our and prioritizing our conservation efforts. So we have a preprint out there where we look at many more statistics than the one that I showed, which will hopefully convince you and hopefully the policy makers as well that we need models to actually do a good job at 30 by 30 implementation. So overall, what we're doing in our group is to use AI models to predict and estimate how biodiversity has evolved in deep time. And we hope to be able to use AI also to predict the future of biodiversity and hopefully help bending the curve of biodiversity loss. With this, I would like to thank my lab for contributing in different parts of the projects that I've talked to you about today. I would like to thank the CIP for organizing this symposium, Patricia in particular, and you all for listening. Thanks.