 Good. All right. Thanks. Thanks to the Montreal Python organizing folks for letting me talk while I was here in Montreal. It's great to be here. And I love coming to Montreal. I try to do it at least once a year. Thanks also to whoever opened the windows. I'm going to be talking about statistics. So combining statistics and lack of oxygen is a recipe for disaster. So I'm ostensibly a professor of biostatistics at Vanderbilt, which is in Nashville. But I'm really a data scientist. So what that means is I'm the worst programmer in a room full of programmers. I'm the worst statistician in a room full of statistician. I'm the worst basic scientist in a room full of basic scientists. So and PIMC3 in general is a project that I started back when I was a postdoc, when I really didn't know anything about Python programming. And so I'm going to talk a little bit about this here. So before I do that, this is way different from the first talk, which I enjoyed a great deal. I'm certainly not doing any live programming, sorry. But a couple of questions, though. How many folks in here would describe themselves as data scientists? A few? Good. How many know what Bayesian statistics is? OK, that's great. Awesome. OK, that's all I need to know. So about this time last year, I did a different meetup in London. There's a Bayes meetup over there, which is really great. And it was held at the Imperial College there. And much like this meetup, what you do afterwards is you go to the pub. And so there's a pub that we go to after the Bayes meetup in London. It's called the Artillery Arms, and it's in East London. And right across the street, if you look out this window here, there's a cemetery called Bunfield Hills Cemetery. And John Bunyan's buried there. And Daniel Defoe and William Blake, all these writers. But if you go just inside the gate and to your left, the big Crip that you see here is Thomas Bayes. So you can sit there, give your talk about Bayesian statistics, and pour a little bit of beer on Bayes. Great. So first of all, what is probabilistic programming? It's really not a new concept, but I thought I should describe it before I go on and talk about software that does it. It's really the easiest definition, or maybe the least useful one, is that it's any program that's partially dependent on random numbers. And so the outputs of these programs are not deterministic. And it can be expressed in any language that has a random number generator. So certainly, you can do it in Python. And what it really amounts to is, essentially, you're adding another set of primitives to the programming language that you would have otherwise. So with these primitives, you're able to do things like draw random numbers, calculate probabilities, given those random quantities. So you can have, for example, distributions over values. So you can say something normally distributed, bell curve, and draw random numbers from that. You can do more complicated things like having distributions over functions. So rather than drawing single integer floating point values, you actually, realization is an entire function. In a longer version of this talk, I talk about Gaussian processes a little bit. But an important thing that allows you to do is to condition random quantities on one another. So if you define, just say, a quantity p, like a probability as a beta distribution, just a particular kind of distribution that has values between 0 and 1. It's good for modeling probabilities. Well, then you can take another random stochastic primitive and condition its value on the value of that. So z here, z is a 0, 1 based on that probability p. And that's the important thing, is it allows you to kind of condition things and really have kind of building blocks to let you to build more complicated things. And so why do we do this? Well, really, most of the time, we're doing it to facilitate Bayesian inference. And so there are lots of hands here that recognize what Bayes is. I'm going to have like two minute talk about what Bayesian inference is. And then I'll move on to the software implementation. The most important thing you have to know about Bayes is that it deals with a different sort of probability than, say, the statistics that you learned. Most of you probably learned in school. And most importantly, it's the fact that we're doing something called inverse probability, where we're modeling effects based on causes. So the only notation I'll show you here is that we have things that we don't know, which are theta, parameters, unknown values, future predicted values. And then there are things that we know, things we've observed, which is y, our data. So everything in Bayes can be classified into two things. One, things we don't know and things we know, or things we haven't observed and things we've observed. And we can use these quantities in conditioning statements to help determine what the causes might be. So we've observed the effects, why we're gonna see what the probability of the causes are. And everything can be thrown into that theta. Anything you don't know, you can put it in your model, and that's one of the powerful things about it. So why do we need a whole different type of statistics? For me, it's pragmatic. Some people have kind of philosophical preferences here. For me, again, way back when I was a graduate student, it helped me solve problems that I wasn't able to solve before with sort of classical statistics. The important thing is that they're very useful. It allows you to build really complicated models that you couldn't fit otherwise. And ironically, it's because it allows you to build these things from kind of simple building blocks. So this is why Bayesian statistics is called Bayesian statistics. This is Bayes formula. And the important things are along the top. So we have unknowns theta that we say before we do our experiment or collect data or observe the world, we have some information about. So the probability of theta is our prior probability. It's what we know before we've collected any data. What we want is after we've conducted our experiment or our study, what do we know after we've collected all this information? So it's a process of updating priors to posteriors. And the way that we do that is through a likelihood function. So those are kind of the three main components. And then when you're done, you can take that posterior and make it the prior the next time that you use it. And you go collect more data and update it again. You can kind of turn the Bayesian crank that way. And the big advantage to doing this is that everything here is in terms of probabilities. So all outputs from probabilistic programs will tend to be entire distributions. So rather than just getting a mean or a median or some statistic, you get an entire distribution. This allows us to say things like, well, what's the probability this is greater than zero? So I built generally models of infectious disease systems. And so this was some co-infection effect. And we can see that it's almost certain that it's greater than zero in a probabilistic sense. You can pull arbitrary values from this. So once you're able to get this posterior distribution, you get a lot of stuff kind of for free. The stochastic program then is where the probabilistic programming comes into play. So we're able to specify priors and likelihoods and come up with some joint distribution of everything in our data. So the first step in doing Bayesian inference is to write down your model in whichever language that you're gonna be using. And so just how do you do this? What constitutes a prior? What constitutes a likelihood? So prior distribution generally just quantifies the uncertainty in whatever variables that you're interested in fitting here. So this is a normal distribution with a zero mean and a standard deviation of one. And it says that we're reasonably sure that things are somewhere between negative three and three with quite a bit of certainty. This is also a normal distribution, that line across the bottom. But it's got a standard deviation of 100, so a variance of 10,000. And so here we're saying we don't know, it could be essentially any real value almost. So this reflects kind of lack of information, if you like. This one was highly informative, this one's not very informative. And we can pick these based on, it's best if you impart any prior information that you might have. So if I'm modeling a disease and it's a rare disease, I'm doing kind of a model of rare disease prevalence. I might pick something like a beta 150, which has all its probability way down here, right? Most of the people don't get the disease. If I'm, say I'm a baseball fan and my favorite player gets a hit in his first three at bats of the season, what's the probability he's gonna hit 400 or 300 for the season? Well, you wouldn't put a flat prior on that, right? Because we've been playing baseball for over 100 years. There's lots of data that this prior actually comes from all of the data on batting since the turn of the century. On average, major leaguers hit 26.1% of the time with a standard deviation of 0.034. That's prior information. There's no way he's gonna hit 900. There's no way he's gonna hit zero. He won't be in the major leagues long enough. So that's kind of the idea. You put whatever you know about the problem before you collect your data into the problem that way. How about the likelihood? This is where our data comes into play. So what we're coming up with here is a data generating mechanism, if you like. So how did the data come to be? And here too, it comes down to picking an appropriate distribution for this. And this is kind of the knack. This is kind of the art, if you like, to probabilistic programming is seeing which distributions should be used in which cases. And so, for example, our data might be normally distributed. If it's human heights and weights, for example, they tend to be normally distributed, blood pressure measurements, things like that. If it's baseball, right? Binomial distribution. So it's in N chances, N number of that bats, and you get X hits. You can model that with a binomial distribution. And the batting average would be P in here. If we're running a website, we wanna know how many unique visitors per month or year, whatever, we might pick something like a Poisson distribution, which is for counts. So different distributions are good for different things. And then we combine all of these things together to get a posterior distribution. There's our likelihood, there's our prior. And what I've written here is this little symbol here means proportional to. It's not quite equal. It's equal up to a constant. And the constant, I kind of glossed over when I showed you Bayes' formula. It's this probability of Y. I was like, hmm, the probability of the data. This is a marginal probability. What it is, it's just the numerator integrated over all of the theta. So you integrate all your variables. So I'm a really bad mathematician. My background's in biology. I'm not very good at calculus. I can do integration of one variable, or at least I could about 20 years ago. Two, I have a lot of trouble. Three, forget it. And most of you probably can't do three, even if you're a mathematician. Most models will have hundreds, maybe thousands of variables. So doing Bayes is really hard. And that's one of the reasons that you didn't learn it in school, particularly if you went to school as long ago as I did. It was just really impractical to use. And so with probabilistic programming, we're able to use complicated numerical methods to approximate that. And probabilistic programming abstracts those away so that we don't have to be experts on all of these things. And like I said, probabilistic programming is not new. So when I was a graduate student, there was really only one way to do this. And there was a package called WinBugs. The win, of course, means it only ran on Windows. But it was really great. In sort of the mid-90s, you had this kind of dashboard here and you could specify, draw samples. You could watch the samples come in live here and you get summaries of everything. And it really made it easy to describe and build and fit and share Bayesian models, particularly for non-experts. Again, I was a biologist. Most people that use this are not statisticians or mathematicians. And even better is there's this really nice domain-specific language that it used called bugs. Sort of an R-like syntax. For any of you familiar with R, this looks very R-like, right? This is a complete model, a complete hierarchical model in seven lines or so here. And this was really great. It allowed me and others to do lots and lots of things. But after a while, you kind of hit your head on the ceiling, right? There's a few things wrong with it. A, it was closed source. And again, it's this kind of domain-specific language where you had to get everything into bugs and then get everything out when it was finished. It was coded in Object Pascal, which now that it's open source, how many Pascal programmers in the room? Oh wow, okay. Yeah. I gave a talk in Denmark recently. I raised that in like half the room with their hands up. I was like, it's a weird place. Yeah, I don't know. So, you know, there are a lot of problems with that. Back when, again, when I was a graduate student, I had lots of time. I didn't think I had a lot of time then, but I really did have a lot of time. I kind of cobbled together using a language that I like to use, and which I did use for most of my other work, tried to kind of re-implement this stuff using Python. And so I started this round back in 2003 at the University of Georgia. And really what it is, it's a probabilistic programming framework for fitting arbitrary probability models. So it's not in any one particular class of models, not just regression models. Any model you can write down in math you should be able to implement in PMC. It's based on Tiano. This is why I come to Montreal, a fair bit. Tiano is a package that was produced from the Lisa Lab. They changed their name, right? It's the Mila Lab now or something like that, the University of Montreal. And it implements what we call next generation Bayesian inference models that use gradient information. And I'll talk about that in a little bit. We have over 100 contributors now, about a dozen or so core contributors. It's used quite a lot in academia and industry. Companies like Quintopian and Monitate, Grubhub, Channel 4 over in England, Allianz Insurance, things like that. And of course it's on GitHub and freely available. So what Tiano does, this is kind of sort of the computational engine behind PymC3. PymC2 is largely a Fortran project. It's mostly Fortran with a little bit of crunchy chocolate coating made out of Python on the outside so people didn't have to code in Fortran. Same thing here, Tiano is the engine now. And what Tiano is, it's sort of a meta language for specifying and evaluating mathematical expressions using tensors, which are just generalizations of vectors and matrices. And it's really built to do deep learning. That's why it came about in much the same way as TensorFlow or Torch, Tiano kind of led the way in fact. And what it does is it dynamically generates C code from that. And so this is kind of what it looks like. So what we're gonna do here is construct a matrix, populate it with values, and then take some gradients here. So if I specify matrix, I'll call it X. And then here's a function of that matrix. This is the inverse logit transformation. So I'm gonna transform the values to the zero one domain. And then this is the cool part. Just take the gradient of it automatically, just like that. That's the magic. And then you turn that into a function. Now, up till now, no calculation has occurred whatsoever. What all that's being done here is a graph is being built, a static graph that Tiano can use. It will optimize. It will learn how to do the gradients over that whole thing. And then the only time it actually does anything is when you call the function for the first time, it will compile it to C and run it and so on. And so this is the gradient of a matrix, transform matrix. So this is great. And so this powers everything that's done. So it's always best to kind of show real world examples for this sort of stuff. So I'm just gonna show you what a model looks like in PIME-C3 using an example that's actually in our set of tutorial examples in PIME-C. This is a data set from Britain across the turn of the last century of coal mining disasters. So every year they just, you know, safety isn't what it isn't, isn't safety back then I guess in mines isn't what it is now. And so there are a fair number of disasters and coal mines. And these were just counts from about the middle of the 19th century to the middle of the 20th century. And it's a good example because it's this nice count data, but you can see the counts kind of change, right? They're kind of high, they're kind of around three near the beginning. And then somewhere around the turn of the century, it kind of drops. And, you know, you still get some bad years, but on the whole, they tend to be lower. And so what we're gonna do here is model this count process. And we're gonna hypothesize that there's some early mean that's kind of high and there's a late mean that's kind of low, but we don't know where it is, right? We don't know where this kind of switch point is. And the great thing about Bayes is that anything you don't know, you just make it a variable and estimate it. And that's what we're gonna do. So there's gonna be three variables. There's gonna be early mean, late mean, and then switch point and they'll all be random. Okay, so first step, prior distributions. We talked about these before, right? So I'm gonna choose exponential distributions for these rates. Why? Because they're positive continuous values. Rates can't be negative, obviously. You can choose other ones. It's always good to test whether you picked good priors. That's a different subject. And then switch point, I'm just gonna say, this is gonna be completely, I'm trying to, I didn't look at the data. You know, I know it's somewhere in the middle, but we're just gonna allow it to be uniform across the time series. It could be anywhere in there, right? And then the prime C code looks like this. The cool thing is that it hijacks this context manager, which you usually use for things like opening files, or sockets, sockets, ports, and things like that. We're gonna use it here to open a model, and we're gonna populate the model with variables. And that saves us from doing the sort of thing that you see, like in Keras, if you've used Keras, where you have to add this variable, and you have to have all these ad statements. This kinda does it with a bit of magic. Anytime you declare a prime C variable inside the context manager, it gets added to that model, which has a nice named disaster model. So there's my switch point, uniform between the lowest year and the highest year, early rate, exponential, late rate, exponential, and prime C has 40, 50 pre-specified probability distributions, all the ones you'd probably ever need to use, but you can customize it to do sort of weird distributions that nobody ever uses. And the point here is that, and kind of the driving, the motivation behind prime C three is this high level language for specifying these models where you have almost the same number of lines of code as you do of math, right? There's very little extra stuff going on here. So what happens now? What do we have? Well, if we look at some of these things, the type of the early rate, for example, it's this prime C object called a transformed RV. It's actually been transformed to the real line rather than positive value because it makes sampling more efficient. And I can do things with this. So this is the probabilistic programming thing. These are our primitives now. I can do things with them. They have attributes. So here's the log probability of the value 2.1, what's happens to be negative 2.1. I can take four random values from that distribution. I can do anything I want with it. And then I can transform variables arbitrarily. So my rate here is gonna be early mean if t, the time, is less than the switch point or late mean otherwise. And I can use this switch statement here for that. So, and notice I don't have to do a loop here. Everything's vectorized, right? So it does this vectorize. And this is really just theano code disguised here. And there's nothing random about this. This is all, these are just deterministic transformations. And then our likelihood, here we're gonna use a Poisson. It's what I used for the, you know, website visit counts before. Now I'm using them for disasters. And so whatever that rate is for any given year, it'll be a Poisson draw from that. And my data are the disasters. And in prime C, the only difference here that distinguishes this from a prior is that I've got observed. This observed flag includes the data. And it says essentially these are fixed. I've observed these, don't change them, okay? So the next step is how do you get posterior distributions? This is the obstacle, right? This was the hard bit. It's analytically impossible most of the time. And even calculating them numerically is challenging. And so over the years, statisticians have come up with different approximations. Things called the map estimate, which really just does optimization and finds the peak. But you don't get anything, it's not really fully Bayesian because you don't get any distributions. You just get a value. You can do things, weird things like rejection sampling where you take random values and you look at the value and see if it looks like it's from the distribution or not. Whole slew of things. The sort of de facto standard for doing this is something called Markov chain Monte Carlo, or MCMC. And even quicker than my description of Bayes, MCMC, the Markov chain part is the fact that even though we can't sample directly from our, if we had a really simple model, we could sample directly from our posterior distribution, we usually don't. So we can't sample independently from it. We can generate a dependent sample. And Markov chain is a dependent sample where the next value is dependent on the current value, but not any of the past ones. If I can generate a Markov chain with a particular property called the reversibility. So if it satisfies this detailed balance equation, then if I sample from that Markov chain long enough, I'm going to get samples that are indistinguishable from the true posterior distribution. That's kind of the magic that the math guarantees us, right? And in practice, so MCMC is kind of a class of algorithms. There isn't an algorithm called MCMC. There's lots of specific implementations of it. The most famous one is something called metropolis sampling. You may have heard of Gibbs sampling. There's lots of different ones. Metropolis sampling is the easiest to describe what you do is you initialize your parameters to arbitrary values, whatever you want, and then you have some way of proposing new values. So some distribution or some way of proposing values that's easy to sample from, and then you evaluate that proposed value depending on the log probability of the whole model that you've specified. You either reject it or accept it. If you accept it, you take that value and you add it to your bag of values. Otherwise, you revert to the current value and then you go up and do it again. You repeat this over and over again. And when you do that, you get something like this. So this is a big but very simple model. What this is is a model that is a thousand dimensional multivariate normal. So it's a multivariate normal of thousand values. And so it's big, but it's very simple. And we can see as metropolis sampling is not doing very well because of the correlation here. And it's kind of stopping and starting. When it stops, it means it's getting rejected all the time. It's not really finding the meat. Oh, this is the marginal of just two of them. So sorry, I forgot to say that. So this is two of the thousand just to kind of see what it looks like. It's hard to visualize in larger dimensions. Pardon me? Oh, arbitrarily, yeah, it's the first two. It doesn't matter. They'll all look like this, right? And the problem is that it's this random walk. I'm randomly selecting a candidate value and then evaluating it. And it works fine for small models, not really well for big ones. And so the whole idea of the prime C3 is to use new, more sophisticated algorithms. And in particular, we're going to use gradient information of the posterior distribution to propose better values. So it's not going to be random walks anymore. What we do is we essentially try to simulate this as a physical system. So if you think of our posterior distribution as like a landscape, like a skateboarding park, right? And the skateboarder is like your point and you're just kind of rolling them along the surface, right? And so you're simulating the physics if you like. So we add an auxiliary variable to this. So we have the position and the velocity and we move this thing around according to that. So no more random walk. And what we're doing here is simulating this. So if you're at the top of the hill, you have lots of potential energy and not much kinetic and then as you go down the hill, you got lots of kinetic energy and not much potential energy and you're essentially simulating that system here. And so derivatives, right? We see how that changes over and that's what we've got a model here and that's why we require a theano or something like it because integrals are impossible to do automatically. Derivatives you can do automatically. You just need the right technology to do that and that's what is provided. And so Hamiltonian Monte Carlo kind of looks like this. You assemble a new velocity from a Gaussian distribution. So you essentially give the marble or you give the skateboarder a push in a random direction and then you simulate a continuous system using kind of steps, deterministic steps, discrete steps. And then once you get to the other side, you stop it, you take a point and you repeat the thing over and over again. And now what you get is this, works much better, right? Near independent sampling across the distribution. It characterizes it very quickly, has a very high acceptance rate and it's applicable to much larger models than kind of the metropolis style sampling. The downside is, is there's a lot of tuning to be done. You've got to pick, you know, how many steps you take along that leapfrog, where to stop, right? So what happens if you're on your skateboard and you get to the other side of the skateboard park, right? You start sliding back again along the path that you just took and you don't want to do that. And so Andrew Gellman and one of his graduate students came up with an automated sort of self tuning version of Hamiltonian Monte Carlo called the no U-turn sampler, which as it says is trying to prevent the U-turn coming back upon itself. And so, you don't have to know all that when you do PMC, right? Again, we black box that we abstracted away. All you do is you call sample. That's it. And it determines it's used nuts for the late rate and the early rate. It's used metropolis for the switch point and you get a few thousand samples in a few seconds. And then as we promised earlier on, once you can get that posterior distribution you get a bunch of stuff for free. So you get, here are samples. You're able to get means and standard deviations and credible intervals and everything else that you need, which is fantastic. So that's kind of the primary way that you get inference due machine learning using PMC. But even with more sophisticated algorithms, MCMC can be very slow, particularly for large data sets. It doesn't scale well with large data sets because that likelihood has to be evaluated for every data point at every step of that sampling algorithm. So in those cases, we can use a different type of algorithm that very recently has been added to PMC and that's called variational inference. And it's a very different approach than MCMC. What we do here is we take the blue curve here is like some posterior that we don't know what it is. And then we have another distribution that we're familiar with, like a normal distribution, something that's easy to work with. And we transform it and we select parameter values for it so that it gets as close as possible to the posterior distribution. So we're changing the problem from a sampling problem, which is great but it can be slow, to a straight optimization. We're gonna optimize the crap out of this approximation as well as we can. And so what do we mean by mean by close as we can? Well, the measurement for this we use is essentially an information distance. This is called Kullback-Liebler divergence. It tells us how far away one distribution is from another. So Q is our approximation, P is our true posterior distribution. And all you need to know about the math is that it gives us an expected value in terms of Q, which is the thing we know if it's a normal distribution so we can work with it. We can't optimize this directly because it contains the posterior and we don't know what that is, but with a little bit of math that's way over my head, we can rearrange it and get a quantity that we can deal with. And this is called the evidence lower bound. So we're gonna try to maximize this evidence lower bound. It's the same as minimizing the Kullback-Liebler distance. But again, just like with nuts, there's some choices or tuning to be done here. We have to use pick Q. How do we pick Q? It's gotta be a distribution that's useful. We may not know what our posterior distribution looks like. Things like that. And similarly, in the last few years, 2016, Alp Kasekelbier, also out of Columbia, the same as Gelman and Hoffman, came up with an automated, this is called ADVI, automatic differentiation variational inference that will just start with normal distributions, transform them into a real coordinate space and standardize them for everything so that it works across any problem. And so what we get when we do variational inferences is this. It kind of looks like MCMC, but these aren't the values. This is that elbow. So I've hit some sort of asymptote here and whether that's a good place or a bad place depends on how good my approximation is. And to kind of give you an idea of what that looks like, here's a beta distribution that I'm estimating. It's this dashed line here and each of these curves is an approximation based on 100 through 10,000 optimization durations. And this is just straight optimized. This is, you know, BFGS or Neldermead or whatever optimization that you wanna use and it's fast. And you can see that in this case, it does a reasonable job. And in prime C, all you do, you take your model, whatever it's called, and rather than call sample, you call fit. And then fit will choose an appropriate approximation. And what we get out the other end is not a bunch of samples, but this approximation, which again is a distribution that's been fit. But because it's a distribution, we can draw samples from it. So we take that approximation and sample from it rather than the true posterior distribution. So we get what looks like MCMC samples, these are just again approximations. But as they say in the machine learning world, there is no free lunch. These are approximations. They generally aren't as good unless you get lucky or your problem is simple enough. So the blue line is the ADVI approximation and nuts is a better approximation to the posterior. And this is what you see in general is you tend to underestimate the variance because it does what's called a mean field approximation. It kind of assumes all of the variables in the model are independent of one another. But it works a lot faster. So if you have lots and lots of data, you may be willing to make that trade off. And it's made faster still by the fact that we can minibatch. So what I mean by minibatch is that rather than throwing all of the data at the problem at once, at every iteration of my optimization, I throw just a random batch at it instead. This has two advantages. One, the computation time decreases. And two, it does what's called stochastic gradient descent, which tends to be more robust. It's kind of noisy gradients rather than non-noisy gradients. So they tend to converge faster. So that's really cool. So MCMC and ADVI are kind of the two main ways of using PIMC. And this fits very appropriately into machine learning. You can combine probabilistic programming, call it Bayesian machine learning if you like. Thing about machine learning models is you tend not to account as much for uncertainty particularly. You tend to get a prediction or a probability and not an entire distribution. And so they can sometimes be easy to fool, harder to interpret. You can totally fit machine learning models in PIMC. And here's an example that Thomas Vicky, who's one of our core developers, demonstrated a couple of years ago, this is a Bayesian deep learning model. And so this is just a neural network with two hidden layers. So deep learning is anything more than one hidden layer. It's deep, so this is deep learning. And all that we do to Bayesianize it is we take the weights of the neural network and we put priors on them. Here we're just putting normal zero one priors. So, and this is the whole program in PIMC. So here's the weights from the input to the hidden layer, first to second, hidden layer to output, and then a set of activation functions. And our output is a binary classification. So it's a Bernoulli random variable. And what you see on the side here is more than you usually get. This is not a decision boundary, which is what you would get if you did support vector machine or something like that or a deep neural network. What's being shown here is the posterior standard deviation of the estimate of the probability of classification. So you can see the darker it is, the more uncertain it is. So everything close along the boundary is uncertain. Everything light and away from the boundary is very certain. And so you can get an idea of how reliable your predictions are, which is important, right? You wanna know if you're gonna use this to make decisions, you wanna know what the risk is. Okay, I'm almost out of time. So in terms of the future, we're lucky enough this year to have some Google Summer of Code slots. We've got a student in Argentina, Augustina Oroiello, implementing approximate Bayesian computing, so yet another algorithm for fitting these models. Bill Engels is gonna continue working on Gaussian processes, which is really great. And then we have another student, Sharon Yulberge, who's gonna be working on a TensorFlow back end for PMC3. Unfortunately, the Tiano project is shutting down after many, many years. It's kind of served its purpose. It's essentially prodded lots of companies to make very robust and powerful open source deep learning engines. And so we're going to transition into what we would call PMC4 using, hopefully using TensorFlow. That's what we're gonna try first anyway. And Google Summer of Code is gonna start that for us. For those of you interested in learning more about this, our project, the PMC repository, has a whole bunch of Jupyter notebooks full of well-documented deep examples, everything from regressions to survival models to machine learning models. And of course, I'm biased towards PMC. We live in a great time now, rather than back in the 90s when you had just had windbugs. These are just the Python tools for doing probabilistic programming. It doesn't, not even any other languages. Edward, which is also now TensorFlow. The Stan team has a Python interface and so on. And then if you wanna learn a bit more about Bayes, a local Montrealer actually, Cam Davidson Pilon, I think he works for Shopify, wrote a nice open source textbook years ago based on PMC2 and it was recently ported by Max Marginot and Thomas Vicki to PMC3 and Python3, yay. And so you can go on there and learn all about kind of the basics of PMC and probabilistic programming. So with that, I'll close. This is our core team. I'd like to thank all of them. This went from a three-person operation for PMC2 to more than a dozen now. And as a result, we have a much, much better project. And I've been able, if I've done anything for the project, I've been able to recruit people who are smarter and better at all this stuff than I am. So I thank them. And I thank you for hanging out and listening. So we can't do integrals, but we can do derivatives. It means that for the multi-carnal simulation, it's actually symbolically computing the derivatives. It is a, Fiano does symbolic differentiation, yes. And that's what's needed. That's what's built. Yeah, you need that somehow. Stan has built its own engine for doing that in C++. Tense of flow can do, and we've just hijacked a deep learning engine to do probabilistic programming is the idea. So you can do it with any of those. But yeah, so that's what the graph does. You build this big static graph and then it's able to reason over that graph and come up with the gradients. Were you able to find the great input for the British minds? Yeah, thank you. Oh, didn't I show you all the... You took over the results. Yeah, it was around halfway. I mean, that's the cool thing is that you get a distribution. There's one of those iPython notebooks has the whole worked example, but it was around about, I can't remember which year, but the 40th observation, I think the 40th year, and you get a nice distribution. You don't just get one. Did it correlate with some regulations or anything like that? I would imagine so, either. Yeah, yeah, I'm not sure. And that's a simplification, right? It probably wasn't just a switch point. In my longer version of this talk, I show how to fit a Gaussian process, which I think I mentioned it earlier on is it's a distribution over function. So we pause at just a function of some kind, maybe does that, maybe wiggles a little bit. And so you can come up with a function that's more complicated that allows it to vary within those two intervals. So you can do something simple like that or something more complicated, depending on what you need. Could be, or yeah, I could just more, I bet you it was some technology or something like that, or regular, it could have been regulations. I mean, turn of the century in England, you know? Yeah, I'm watching Peaky Blinders right now, so you can't do that stuff. Yeah. For that example that you gave us, it looks like, when you were fitting it, at least I guess it was a jet or something that looked right? Yeah. But I mean, how fast is it really for something that size? That was it? So that was like real time? That was real time, yeah. That was very big, wasn't it? So 111 data points and three parameters, it was, so the thing about the Hamiltonian Monte Carlo algorithm is that each iteration of it is slower than you would get in Metropolis, but it's a far more efficient sampler, so you're gonna keep all of the samples you get. If you use Metropolis sampling, you throw away 75% of them for the most efficient algorithm. So, I don't know, it took four seconds, I'm happy with that. And again, it does well, it scales well with the size of the model. If you increase the numbers of parameters, it's still quite fast. When the data increase though, it doesn't scale quite as well. So when n gets large, things can slow down quite a lot. No, no, no, that was a read-a-ment. Remember I had three parameters? 100 data points? I mean, typically I fit biomedical examples that have dozens to hundreds of parameters and I don't know, it'll take five minutes, seven minutes, something like that. So yeah, it depends on what you need. Maybe you could code something that's faster, but it's not as general as this tool, right? So you could have a super optimized algorithm for your particular domain's problem, but then when you go and do your next problem, you've gotta do that all over again. And so the other thing with the PMC is that you can apply this to any probabilistic programming application. Thanks.