 Great. So now we have with us, James Nightingale, who's going to talk about, by Autofit, a classy probabilistic programming language for data science. Welcome, James. He is an observational cosmologist and post-doctoral researcher at Durham University, where he focuses on strong gravitational lensing, devising new ways to use the study to dark matter and the distant universe. So, over to you, James. Can I just first check, is my microphone working? Yes, yes, it's working fine. Is my microphone displaying correctly? Yeah. Brilliant. It's on full screen. Great. Excellent. Okay. Okay, so good afternoon everyone. I'm James Nightingale as thank you for the introduction. I'm a cosmologist at Durham University and my research typically focuses on studying galaxies and trying to understand the nature of dark matter. In the past couple of years, we've found that the statistical methods and techniques that we use to do that have far-reaching applications in the data science sort of domain. So we ended up developing this open-source software for Autofit to try and basically allow people who, you know, to use these techniques in a far more generalized setting. So, for this talk, I'm going to go and kind of give you a run-through of multiple fitting, what py Autofit, what probabilistic programming is, give you a sense of what this software does. And then I'm going to describe sort of how we ended up here starting with our cosmological use case on strong gravitational lensing. And so I want to begin by making sure that we're all kind of in the same place in understanding what I mean when we're talking about model fitting, probabilistic programming languages and so on. So I've sort of got this initial slide just to say that when we're talking about model fitting, these are the types of things, you know. I've got some data, I've got these data points and I'm going to fit them with a curve. I want to understand what model corresponding to this red line gives the best fit of data. I've got another image here showing that the results or a pictorial representation of a Markov chain Monte Carlo analysis or MCMC. So if you're familiar with that type of stuff, this is the domain we're talking about. And I've also got basis theorem because all of these model fitting tools, all of the things you can do with py Autofit can be done in the Bayesian context following basis theorem using Bayesian equation and so on. So to really drive home what we're talking about, I'm going to get very quickly go through the simplest model fitting example one could conceive of. So here I have some data. It's a 1D data set that clearly contains a signal. It contains a Gaussian. The data has noise. And my task as a model fitting expert is to find out what Gaussian best fits this data, what Gaussian corresponds to this signal. And so the way I would approach the model fitting is as follows. I would first compose my model as a Gaussian which has three parameters, a center and intensity and a sigma width value. And I would draw a set of parameters via my model fitting algorithm. So here you can see we've got a center of 60. I would then use these parameters to create a model Gaussian. This is the Gaussian that corresponds to these parameters. You can see it's centered on a value of 60. The next step in the model fitting process is I would use this Gaussian to fit my data set. I would compare my model to the data set. You can see here that this Gaussian isn't really representative of my data. I would subtract the two from one another to get residuals and define some sort of a figure of merit or likelihood function that quantifies how well this model Gaussian fitted my data. You can see here it's not done a very good job. The residuals of chi-square values are very large. And I would then repeat this process using some model fitting algorithm or what's called a nonlinear search. And this nonlinear search would guess lots of values of parameters and eventually it would find a solution that gives us the highest likelihood and indeed tells us what the Gaussian parameters of this data set corresponds to. So this is all very simple. It's all very, you know, hopefully you're following. But I wanted to really start at the beginning so we're all on the same page about what probabilistic programming languages do and therefore what chi-autopid does. So what is a probabilistic programming language or PPL for short? Well, these are basically, you know, software packages or frameworks or statistical inference libraries that make it straightforward to compose a probabilistic model, i.e. the Gaussian I just showed you, and fit it to data, i.e. perform inference automatically. So people who are familiar with this type of software will know of many PPLs. Some of the most popular are PiMC-free, STAN. There's many that focus on more sort of machine learning, deep learning techniques like Pyro. And each of these probabilistic program languages, they're all suited to different problems. They have different core features. So they have strengths and weaknesses. So there is a question here why have we ended up developing our own PPL to do astronomy when there already exists some of the biggest open source projects? And the reason we think we've done this is because we've actually found that the type of statistical inference problems, the type of model fitting challenges that we face in astronomy and cosmology, and we're learning in a wider setting in data science, there were problems that existing PPLs weren't really suited to. So I've listed a couple of examples here. In astronomy, we have these large homogeneous datasets, you know, images of thousands of galaxies. And all we want to do is fit those images one by one in a sort of identical homogeneous fashion. And we just want tools that make doing that straightforward, that make it straightforward to process large libraries of logs and then do our science, do our study. Another example is in astronomy, we often have these very expensive likelihood evaluations and our model fits can take days if not months to run. Whereas a lot of PPLs, you typically, you know, it takes a minute to run and the challenges you face with those PPLs are very different. And so AutoFit has lots of tools for customizing how the model fit is performed, as well as doing this in the context of massively parallel computing. And we also have a need to fit each dataset with many different models and streamline model comparison, use Bayesian inference to determine what the best models are, which again isn't something that's typical of all of PPLs. So the way we sort of concisely describe this is that pie AutoFit is this highly customizable model fitting software, the big data challenges in the model, many model regime. And I'll explain how pie AutoFit can be used in a second. The first, just to get the, you know, get the links and whatnot out the way, pie AutoFit is obviously an open source project, we're developing it to do cosmology with as many people to use as possible. We have a GitHub, all the things you'd expect, if you're interested in this type of stuff, check it out, they're listed on the schedule page. And to really drive home how we want as many people using this as possible, we have written a Jupiter notebook lecture series, which for our students at Durham, we give them, we use this to teach them about statistics about model fitting, but these are publicly available so anyone who kind of wants to get into this domain wants to learn how to kind of do this type of model fitting, you should absolutely check them out, they're on the read the docs, they can be done on binder and you know, we're having, we're getting very good feedback that these are sort of a great introductory way for someone to get into Bayesian inference statistics model fitting and so on. Okay, so now let's look at how we would actually use pi auto fit using what we call our classy interface. And so to demonstrate how one would approach model fitting and pi auto fit, I'm going to use the same example problem I just showed, that is fitting data that contains the Gaussian with a Gaussian. How would we get pi auto fit to find the parameters that correspond to this red curve? And in order to set up a model fit with pi auto fit, you basically have to undertake three steps. It requires you to basically write three Python classes. And so the reason we call this classy probabilistic programming is it's heavily built into the Python class data structures. So first of all, we need to write our model as a class. This is an example of what we'd write, so we'd call the class Gaussian because this is the model component we're going to fit. And the crucial thing to understand is that the parameters of the Gaussian that we're going to fit, the center, the intensity, the sigma we show previously, these are going to be written as the input parameters of our constructor, of our init function. So pi auto fit, when we do the fitting, we'll read this init constructor, it will recognize these parameters, and it will compose a model and a nonlinear parameter space of these dimensions. So that's the first thing to understand of the API. The other nice thing about using Python classes is, of course, you can extend this Gaussian class with functions and tools that do the things you need it to do. So this function will allow us to create the model Gaussian that we compare to the data, and we're about to use this function to perform our model fit. So this is the first of the three classes we need, our model. The second thing we wanted to write is an analysis class. So this is where our data meets our model, and we fit the data of the model in order to get our figure of merit, our likelihood. So the analysis class has two inputs, it's got another init constructor. Here is where you put your data, your noise, anything you need to do the model fit can be put here. And then alongside that, you define your log likelihood function. This is the function that takes an instance of the model and fits it to the data and returns the likelihood. It tells auto fit how well that model fits. So the black magic, the crucial thing that auto fits doing is that this instance that comes in, as we can see, is an instance of our Gaussian class, and the values of its input parameters, which we saw were here, have been set by our model fitting algorithm on the NIA search. So if we're doing a model fit that has priors, auto fit will take care of all of that behind the scenes, and now you just use, just focus on writing how the actual likelihood is computed. So we have our model, we have our analysis class. The final thing we need to do is to basically put all of these together to perform a fit. So we compose our model, we say auto fit, create new, and model of a Gaussian. We create our analysis, we just create, pass at the data, which we've already loaded. We now choose on nonlinear search. So I've chosen a Markov chain Monte Carlo MCMC fitting algorithm called EMC. It's very popular in cosmology, but we of course have many of the sci-fi maximum likelihood estimators. We have Bayesian inference tools and espis sampling, if you've heard of that, and by passing the model and analysis to this EMC, we perform the fit and we'll get the results, which corresponds to the red curve. It gives us the Gaussian that fits the data. And just to really emphasize, this result object that you get from auto fit has everything you would need to kind of interpret and inspect how well the model fits the data. So it has the best fit red curve, but it also has tools for error analysis. It has all of the parameter samples of your nonlinear search. And it also has visualization tools for creating these sort of probability distributions. So you can see here that the value of essentially it was 50, the input value corresponds to high probability when we visualize it in this way. Okay, so that's nice, you know, it's straightforward to compose and fit a model and auto fit, but at the moment, it's not clear what this library is allowing one to do, but you couldn't do another PPL or indeed just kind of write the Python code to do it yourself. It makes the process easier, but there's nothing overly compelling about this yet. So now let's start to look at how auto fit makes it straightforward to customize different aspects of your model fitting. And this is where the use of Python classes really starts to come into its own. So this has downsides, of course. Python classes are a bit of a less concise interface. It requires a basic understanding of Python classes and object-to-project programming. But as I said, it really allows us to build a far more customizable model fitting experience for the user. And so here's an example of how one would customize the model. So in this example, I don't want to fit one Gaussian to my 1D dataset. I want to fit two Gaussians, which you can see I've instantiated here. And in order to do this, I just create the Gaussians, and then at the end, I combine them in an auto fit collection object. So this is the beginning of how we're going to start building models of more complexity in auto fit by combining individual model components. But along the way, I take a number of steps to customize the model that I ultimately fit. If I'm doing Bayesian inference, I can manually set the priors on each parameter which is shown here. I might know that the sigma value of a Gaussian is not 0.5. Let's pretend I knew that. I can just set that to a float, and this will then automatically have priority if it reduced the dimensionality of parameter space by one, and all of my model fits will have this value for this Gaussian. Along the same lines, I could link two parameters in the model. Here I make my two Gaussians essentially aligned with one another, again reducing parameter spaces dimensionality by one. And I can do other things like make assertions. We have lots of tools to customize the model fit specific to what you specifically need to do for your specific fit. The other nice thing about the model API is having now seen that we use Python classes to define our model components, you can imagine that if you've got a problem where there's lots of slightly different models you want to fit and compare, the API naturally allows you to do this. So here I've got two more examples of 1D profiles. There's a Gaussian catersis class. I've just added an extra parameter to the Gaussian. And you could also imagine maybe I've got one dimensional exponential. So if you've got these problems where the sort of the model could be broken down into these different pieces and you often want to compose a model by sort of building them up together like Lego, this is the sort of API that this software really tries to facilitate. We also have a lot of customization on the analysis. I'm only going to show one example here, but it's pretty cool. So if in your analysis class you add a visualize function, this will allow you to have PyAutoFit output the current results of the best fit model on the fly. So I've told this visualize function to output the images I've been showing you that is via matplotlib output a one-dimensional schematic of how well the model fits the data. And when I'm fitting models for cosmology that could take months on the supercomputer, having this tell me after a couple of days if the model is working or not, can save me months because I can stop that from running by getting this immediate feedback. And on this line of customization there's lots of customization for the actual non-linear search itself. All of the libraries we support you can customize their parameters. And we also add functionality on top of these libraries. For example, for those familiar with Matplotlib Monte Carlo, we have inbuilt tools for auto correlation analysis. And we're just trying to basically add value to these libraries if one adopts PyAutoFit to undertake their model fitting problem. And so I'm about to talk about the astronomy, but I just want to quickly sort of list some of the advanced features that are a bit too technical, a bit too detailed to discuss in a talk like this, but at least allude to the sort of thing one can do with AutoFit if you really start to go into it. And so one that I really want to highlight is our support for outputting results into a database. So you can output the results to hard disk. It will create an ordered folder structure that you can navigate with your mouse. It's all quite nice. But that works if you've got 10 of datas. If you've got a very small amount of datasets, yes, that will work. But we are fitting thousands of galaxies with hundreds of models. With our results correspond to hundreds of thousands of entries. So we had to develop an STU-like relational database that basically streamlined the management interpretation of these results. The basic model is as follows. As you do your model fits of AutoFit, all of the results are automatically written to an STU-like database. You can then load this via a Jupyter Notebook. You can query the database for the results you care about. And from there begin to investigate how your model fits when what the results are telling you. And all of our sort of result API, the visualizations I showed earlier, they're all built around this kind of database Jupyter Notebook API. There's also lots of advanced model fitting techniques that I'm not going to cover. But basically these are sort of very bespoke statistical methods that insert some problems could be really, really useful. And the one I'm going to briefly mention is we had a problem where a normally apparel space was so complex that it was so difficult to fit we could not do it efficiently. So we built a non-linear search of grid searches that basically carved up the dimensionality parameter space over a couple of dimensions and then fitted those reduced parameter spaces in a massively parallel fashion. It's a bit of a weird thing. Probably don't want to do it too often that if you do want to do it, it's a really powerful tool to overcome the sorts of problems you often face with these types of things. We have other tools and I don't have time to talk about that. These are fully described on the read docs under the features app. Okay, so that's auto-fit. That's the type of model fitting that we're trying to do. Hopefully you've got a sense of what this library is about and I now want to kind of describe how we got to auto-fit from our initial cosmology use case. So you get a sense of the, I guess the actual application that drove this and it might sort of bring some alarm bells and the sort of things that you do. And so to do this, I obviously first need to explain to everyone what strong gravitational lensing is. And so most astronomers, when they study galaxies, they look at things like this. This is a galaxy in the Milky Way. The typical astronomer, you know, you get your favorite telescope, you point it at a galaxy, you get an image like this and then you would perform model fitting on this image to study your particular scientific interest. Strong gravitational lenses are a very unique phenomena where instead of observing one galaxy, you observe two galaxies perfectly down our line of sight. So this red galaxy is the foreground galaxy that's closer to us. You can see it's emitting red light, but it also has mass and that mass curves space-time in on itself. So it's the light from the background source galaxy doesn't travel straight into our telescope, but it bends around space-time and therefore becomes stretched, sheared and distorted into this ring-like appearance. And this is called an Einstein ring after Einstein. So these are the example of the sort of problem we drove the development of autofit. This is a two-dimensional schematic just to make sure we're really on the same page of what a strong gravitational lens is. We've got a foreground galaxy here. It's curving space-time such that the red light of this background source travels in this curved traversal around into our telescope. And this actually means the background source appears multiple times. I've seen here. So just, you know, if you're interested in astronomy, there are galaxies that we genuinely observe more than once due to this problem. We see the same object in the universe multiple times. So I'm going to talk about how this informed autofit, but just for the machine learning aficionados in the audience, people who go into deep learning, this phenomena has a growing literature of machine learning studies. You should check it out if you're interested in this sort of stuff. It's a great use case because you can generate large training datasets very cheaply. There are these large astronomical instruments that are going to find thousands of this. It really is your sort of stereotypical needle in a haystack machine learning data science big data problem. So if you're interested in astronomy, you should definitely sort of just do a Google search in this, have a read of what's out there. I'll obviously say that I developed an open source library that does this analysis for pyrotolens. So Google pyrotolens, if you're interested, this will say, and this is what ultimately led to us developing autofit. And again, we have all these stupid lectures. So if you want to get into this, check them out. Okay. So that's enough shameless plugging with my other software. Let's get back to the science. What drove the development of autofit? What was it about these strong gravitational lens systems that pushed us in the direction of making this open source statistics library? And it's basically how this phenomena makes one think about model composition, in particular, multi-level model composition, something that I haven't yet touched on in the earlier descriptions in this talk. So when I look at a strong gravitational lens now, I don't look at a pretty picture and think, wow, that's some awesome thing we've seen in the universe. I think about how it decomposes into distinct model components that I want to then fit. So I have a strong gravitational lens here and my sort of my scientist brain says, well, there's a foreground galaxy. This has a set of model parameters associated with it that I want to learn. There's a background source galaxy. This has another set of parameters associated with it I want to learn. But then I sort of, I'd break the model a lot further. I say, well, this lens galaxy, it doesn't just have one model component. It has a model describing its emission, so a light model. And it also has a separate model that describes its mass. And this is what sort of defines how the light in the universe is curved. Conversely, my background source, it only has a light model. I don't need to know anything about its mass. So this type of problem, it sort of forces you to break the model you imagine into these sort of distinct components that have multiple levels. And it's with auto-fit that we now are going to try and construct a multi-level model according to this schematic. And so the lower levels of this model, we can use the exact same API that I showed previously for the galaxy. We write model components. This is an example of a light profile. These are the input parameters that would describe the light of a galaxy. We can attach functions that we use to fit the data, which is shown here. And again, with auto-fit, as I sort of alluded to before, we can write other piping classes with a very similar format, a very similar API describing the mass of galaxies and the like. This is where we got to with the galaxy in before. But we have a slightly different problem now because now we want a multi-level model that doesn't just have light and mass profiles, but has distinct model components describing these galaxies that may have their own parameters. We need to basically use Python classes to construct a multi-level model. In particular, we need to use hierarchies of Python classes to build models that go up. And this is where, I think, the real compelling aspect of the PyAutoFit API comes in. So this is another Python class where we've now written a galaxy object. And the crucial thing to understand is that the inputs of its init constructor are themselves lists of the model components I just showed you. So PyAutoFit will see an object like this galaxy. It will understand that its init constructor contains other PyAutoFit model objects. And it will use this hierarchy of Python classes to construct a multi-level model. And you can also adapt additional parameters to these objects. So you can again have this very nice level of customization on the model that you build. I've got functions here that will be used to fit the likelihood. I'm not going to go into the details of how they're fitting. I'm just trying to really sell this, you know, the composition of these types of models. Yeah, so we're trying to construct a multi-level model that's like this, using this galaxy class. This is the Python code we've been writing. It's the same tools that we saw before, but instead of just passing a Gaussian to this model, we are now passing it a galaxy. And with that galaxy, we're filling in the light and mass profiles that you would use. So if you've got a model fitting problem that you can really break the model down into these distinct components, PyAutoFit has this API that allows you to basically use those components to build a model with arbitrary dimensionality, arbitrary complexity, and so on. So in this example, this model has 16 free parameters, but you could easily make this model have hundreds just by making more galaxies, making more light profiles. This is the analysis class. There's not all to say here. The key point is this instance that previously only contained the Gaussian, this instance now contains multiple levels. You know, there's a galaxy here. The galaxy has a light profile. The light profile has a parameter. So the multi-level model will be constructed by a PyAutoFit in the nonlinear parameter space and come into this lightning function in the most convenient, usable way you could imagine. And just to sort of wrapping up now, just to sort of try and sell why this is so compelling in certain problems, we then had a data set where we had an object that was like this, which is called a galaxy cluster. This has hundreds of galaxies. It has hundreds of background source galaxies. These galaxies can have multiple mass profiles, multiple light profiles, but because we designed the composition of models in the way that I just described, we could compose and fit a model to this without having to write any more source code. The PyAutoFit API was extensible such that we could fit models of any nature given this new data set. So that's really what we're going for. We're trying to create this model fitting library that compels one to design their model fitting problem in the most object-oriented way possible so that you can then build and fit these models in a fully-expensive way. So this is the summary of time I set up at about 25 minutes. I'll quickly mention we have a whole other use case to do a study in cancer that I've not had time to talk about today. But yeah, absolutely check it out if you're interested in this being, thanks for listening. Yeah, I'll stop there. Thank you, James, for that very amazing talk. We do have some time to take on a couple of questions. I'll just show them right here, and then we can take them. For someone who is new to probabilistic programming, what does intensity signify in addition to mean and standard deviation? Okay, so that's my number for the slide. In this example, intensity is a parameter of the Gaussian. That's a lot more than that. So in this sense, the intensity is basically the normalization of the Gaussian. It's one of my three model parameters that in the way I've chosen to parameterize a Gaussian defines how far this red curve goes. So if I doubled the value of intensity here, the Gaussian, the model Gaussian I'd create could be twice as high as pictured here. So it doesn't signify, it's not like a mean or a signal or a forward hard maximum. It doesn't really signify anything meaning in the context of a Gaussian. It was just how I chose to parameterize this Gaussian in this particular setting. So sorry for the confusion there. All right, thank you for answering that. Let's get to the next question. How does pi out of it compare to pi mc3 and when should one be used over the other? Yes, this is a great question. It's something I've thought long and hard about a lot of the time. I sort of tried to allude to that earlier where there are a lot of use cases which you should absolutely use pi mc3 and there'd be no point using autofit. But if you've got a very data-driven use case, if you're fitting large data sets, you need this high level of customization on your model fit. Autofit can be a lot more useful. So I'm going to try and answer this a bit more concisely. My experience of pi mc3 was if you're trying to integrate a function but you're not really using it to fit data, the API for pi mc3 was very sort of trying to understand the integral of a function, trying to sort of do a thermodynamic analysis or something. It was never clear with pi mc3 how one would feed data on a noise map through the analysis and you have to use specific tools. Whereas with autofit, the API that greets you is like, this is where your data goes. This is how you fit your data. So I'd say it's an extremely hard question to answer succinctly, but the notion of having data and fitting that data of a model is something that I would say it sort of strikes you in the face immediately of autofit. Whereas other PPLs often deal with statistical problems in a slightly different context. But it's really hard to give a straightforward, simple answer about when you should use one PPL over the other. Well, thank you for that answer anyway. Okay, the audience would love to connect with you in the breakout optover room and thank you for this amazing talk. Thank you very much.