 Hello everyone. I hope you're doing well. I hope you're staying sane. I'm going to talk about Bayesian inference, which is the main core business of probabilistic programming languages. This is my personal take on it. But it's also the result of joint work on the one hand with Elias Garnier and Vincent Danos, and on the other hand with Dexter Koeisen. So I'll start with some motivation, some intuition. So I consider the following question. What does a division by zero mean? And having studied philosophy myself, I would recommend avoiding philosophy and answering this question or tackling this question through specifications. So a mathematician specifications would look something like this. They would say that they would try to axiomatize the arithmetic operations on real numbers. And so they would say something along the following lines that the real numbers form a field, which is a commutative ring where every nonzero element has a multiplicative inverse. That means that three divided by zero, for example, means nothing. It's an invalid expression. Okay, so how would the computer scientist translate these specifications? So the computer scientist specifications might look something like this, might use types. So it would create a type of nonzero numbers, and then it would simply define the division as an operation from ordinary numbers cross nonzero numbers into ordinary numbers. And in this context, three divided by zero is a badly typed expression. And so this reflects very accurately the mathematician specifications. But what happens in reality is that floating point arithmetic is guided by the IEEE 754 standards. And they use a different approach, which is to assign special values to computations of the type X divided by zero. So here you've got the three possible values. So in this case, three divided by zero type checks, but we can still implement the mathematician specifications what they meant by treating these three special return values as flags for errors and catch these errors. So the type approach would prevent you from writing three divided by zero, whereas this approach lets you write three divided by zeros, but catches the error subsequently. I would say they both implement what the mathematicians would mean. But now suppose that we have another mathematical specification for division, which looks like this. So when the denominator is nonzero, we perform an ordinary division. But when the denominator is zero, we say that X divided by zero can be anything. Okay, so now three divided by zero means something, arguably doesn't mean much, but it means something. It's a legitimate expression. And you can view it as an uncountable collection or equivalence class of equations of the shape three divided by zero equals R, where R runs over all real numbers. Okay, so this is another possible specification for division by zero. Now, how would a computer scientist implement such a specification? An equivalence class, an uncountable equivalence class is not a convenient object for a computer scientist. Even an equivalence class which would be indexed by all floating point numbers say would be a really big and unwieldy structure. So the sensible option is just to pick a representative in the equivalence class. And so here is a silly example of what you might do. You might say that in this context X divided by zero will be assigned the height of the lead developer in furlongs. So this is just a silly example to show you that any choice is equally valid and equally arbitrary. So from this perspective, the previous mathematical specification which says that X divided by zero could be anything is quite problematic. So why am I telling you all of this? So as I've highlighted, you can deal with operations like division by zero reasonably easily with types or with error messages. But it's very unclear how to deal with the three divided by zero means anything sort of specifications. It's always arbitrary. And the crucial part from the perspective of Bayesian inference is that this kind of specific under specifications. So saying that any value is acceptable as a return value happens all the time in probability theory. In fact, an even more spectacular kind of under specifications as the one I've just shown you. And in particular, this kind of under specification is at the heart of the mathematical machinery behind Bayesian inference. Okay, so the objectives of this talk are the following. So I'm going to show you how to avoid this mathematical under specification of Bayesian inference. And essentially, to use the analogy that I've just presented, I'm going to turn the funny division into a standard division where division by zero is forbidden. So I'm going to turn under specification into forbidden behavior. And this will make the maths much nicer. And then I'll present a type system for Bayesian inference, which implements this idea. Okay, so let's start by describing classical Bayesian inference. The mathematical engine, which is necessary to do Bayesian inference is called disintegration. So in a picture, this is what this is the disintegration theorem is all about. So let's start with the information in black on this drawing. We've got a set X and a map F into another set Y. We can assume that F is surjective. Okay, that means that we can look at the information in red. Every point, small y in big Y has a fiber lying over it, which is just its inverse image. Okay, now let's add probabilistic information to this picture in blue. So first of all, I'm going to put probability distribution on X. I'm calling it P because it stands for the prior. So this is my initial state of belief about the space X. Now this probability P can be pushed forward through F to give a distribution Q on Y. So this is what I've sketched as a little density underneath Y. Okay, so you could view it as the probability that big Y takes a particular small y value or the push forwards of P under F. And what the disintegration theorem says is that we can find a stochastic map going in the reverse direction from Y to X, such that to each small Y, I can find a probability distribution on the fiber over Y. So it's zero everywhere else, but its support is this inverse image of Y. And I've sketched this as a little density to the right of X. And the crucial property is that when you average all these distributions on the fibers, when you do the Q average of all of these distributions, you get back the prior P. So this is the key property that this stochastic map G should satisfy. Okay, intuitively you can think of F as a kind of observable and small Y as a particular observation that you make on the system X. And G of Y is the probability distribution conditioned on having made this observation small Y. Okay, so the important features of disintegration are the following. First of all, it's a mathematically reasonable way of making sense of conditioning on events of probability zero. So this G of Y can be shown to exist, even if the probability that Q assigns to a particular observation Y is zero. Okay, but there's no free lunch. And so the flip coin of this possibility to condition on events of probability zero is that G is highly underspecified in the sense that I used in the introduction. So in particular, if you have a point why, which has mass or probability zero, then you can change G and define a G tilde, which takes any distribution you like on the invest image of why. And this G tilde is another disintegration of P along F. So G tilde and G, they differ on one point, but they're equally good disintegrations of P along F, they're equally good description of conditioning on the observations in why. But in fact, it's even worse than this. If you take any sets in Y of measure zero, any set of observations which have Q measure zero, then you can change G on all of these completely arbitrarily and defined a new disintegration G tilde. So it's not just underspecified at one particular point, it can be underspecified on uncountably many points. Okay, so this is a much more even more extreme case of under specification as the example with the funny division I showed in the introduction. So once you have disintegration, you can do Bayesian inference. So the intuition is on this picture. Now we're going, we still have a space X and a space Y, but now the map between X and Y is stochastic. So to each small X in big X, I associate a distribution on why and I've sketched this as these little densities that run perpendicular to why. Okay. And now you can take the P average of all these little densities, and that gives you distribution on why, which I've sketched on the right hand side. And I'm going to call this Q again. So this is a more sophisticated version of pushing forward P through F. Okay. And now you have to take my word for it, provided you can do this integration, you can find stochastic map G, which goes in the reverse direction, and which associates to every point or every observation in why density or the distribution over X. Okay, and I've sketched this as the little density G of Y to the right of X. And again, the defining feature of this map G going in the reverse direction is that if you take the Q average of all the densities G of Y, you get back P. Okay, so it's very similar to this integration is just that in this case we allow F to be stochastic as well. Okay, and again, we can think of why as a sets of observations and G of Y is the probability distribution conditioned on having made the observation small Y. Okay. So the formal connection between G and F is Bayes law essentially. So this is one way of writing Bayes law. I would prefer to write it like this so the probability of landing in B. The probability of landing in A starts somewhere in a weighted by the measure P should be the same thing as the probability of landing in a starting from be and waited by Q, because it's a nice symmetry between F and G. It's highly underspecified. That means that if you take a set of Q probability zero, which could be huge. You can really find a new map going in the opposite direction from F a new Bayesian inverse of F. The probability of landing in G tilde, which takes the same value as G outside of C, but can take any value you like on C. Okay, so again, C could be infinite. So we've got this extreme form of under specification. If Q is a continuous probability measure, that is to say, if Q assigns measure zero to every single ton, then that means that G as a function is underspecified on the whole of the observation space why is G of a particular Y is completely, it could be anything. And if you write Bayes law in the usual format with a with a quotient with the format you would see on the Wikipedia entry for Bayes law. This this high high highly underspecified behavior would correspond to the division by zero being interpreted by the funny division I introduced in the introduction. Okay, so that something divided by zero can be anything is equivalent, and which is why I chose this example in the introduction. Okay, so how do we address this extreme under specification of Bayesian conditioning. The strategy is to avoid points will go pointless. So we've seen that points, and I mean points in a, in a general sense, a set of measure zero really, but this often happens to be points. So points are in general problematic. And the idea was to redefine the Bayesian inverse as a map, which acts on another class of objects than points. And the following fact, which is very well known, which is that every mark of corner defines a linear operator. So we're going to move to the world of linear operators. The advantages is that, at the end of the day, everything will become completely well specified, well defined. Everything will live in a very friendly world, which is that of linear algebra. And we'll see that Bayesian inference takes a very elegant and natural form. And also this will allow us subsequently to define a type system for Bayesian inference. Okay, so the fact that Markov kernels can be seen as linear operators is very well known. It's, if you start with a Markov chain, so working in finite dimension. Any textbook, any undergraduate class on Markov chains will show you that they can be represented as a matrix, a stochastic matrix. Okay, so for example, the probability of jumping from the state one to itself is P one. So that's the one one entry of the matrix T. And the probability of jumping from S one to S two S three, sorry, is one minus P one. So that's the first column third row entry of the matrix, etc. Okay, so this is very standard. And it holds much more generally. So if you have an arbitrary stochastic map from X to Y, so that means to every input X I associate a distribution on Y. Then this defines a linear operator acting on the measures on X and outputting a measure on Y using the following formula. Okay, so again, this is completely standard. Now this hasn't solved the problem yet. So remember the setup for the Bayesian inference problem, we have a stochastic map going from X to Y and X comes equipped with a prior P, which is a distribution on X. And we also have constructed this inverse which goes from Y to X, which is the conditioning map, if you like, and why is equipped with the probability distribution Q, which is often called the evidence. Okay, and the problem was that G is defined only up to sets of Q probability zero. So Q almost surely. The problem is that we can still encode points in the picture of just showing you by associated associating to a point, the Dirac delta over that point that is the probability distribution which assigns probability one to any set content containing small y and zero otherwise. Okay, so points reappear through the back door. And through this back door, we can redefine another Bayesian inverse G tilde in the same way that I've shown earlier. And this means that we can create two linear operators which morally encode the same Bayesian inversion, but which will disagree on the point Y represented as a probability measure. So how do we fix this problem. The problem, the, the solution is to use all the information that we have been given and that must include the prior and the evidence, because if you change the prior, you will change the Bayesian inverse. So it's an essential part of the problem and it should be included in its formalization. And the way we're going to use the prior is using a very well known notion in probability theory, which is that of absolute continuity. You say that a probability measure P, in fact a measure P is absolutely continuous with respect to a measure Q. If whenever Q assigns probability or mass zero to an event A, then so does P. So another way to say this is that if Q cannot see an event, then P shouldn't be allowed to see it either. And now with this notion, we can restrict the linear operator I've described above to the probability to the measures which are absolutely continuous with respect to the evidence. So I'm not going to define the linear operator as acting on any measure. I'm only going to allow in the domain those measures which are absolutely continuous with respect to the evidence. And this prevents the problem I've shown earlier, because if the probability of a point of an observation is zero under Q, then the Dirac delta over this point is not absolutely continuous with respect to Q. Okay, so these are forbidden. They're excluded from the domain. So using a cheesy Zen slogan, if what the evidence cannot see is not evidence, so you cannot condition on something that is not seen by the evidence. Okay, now using this representation as a linear operator between spaces of measures which are absolutely continuous with respect to the corresponding reference measure. We have a very nice mathematical picture. First of all, we've replaced unspecified behavior by forbidden behavior. So we cannot condition on space on an event of measure zero anymore, because we've set up the domain of the linear operator specifically to forbid this. The second really nice aspect is that these spaces of measures, which are absolutely continuous with respect to your reference measures are really well known in probability theory. And in fact, probably the most important or one of the most important results in probability theory is the Radon-Nikodym theorem, which shows that this space is equivalent to the L1 space over xp. So that's the space of integrable functions on x, p integrable functions of x, and these spaces are of fundamental importance in probability theory. If you open any probability theory textbook, the LP spaces, the Lebesgue spaces will probably take a whole chapter. So you end up with very familiar and well known and important spaces. And finally, and perhaps most importantly, in this picture, the operation of computing the Bayesian inverse boils down to an adjointness relation. The relationship between F and G that I mentioned earlier can be rewritten as an adjointness situation where between the two spaces of measures which are absolutely continuous with respect to the prior and those that are absolutely continuous with respect to the evidence respectively. Okay, and crucially, we'd in fact no longer need to compute this highly underspecified stochastic map G. We can directly compute the adjoint of TOF and F being the data of your of your Bayesian inference problem is specified correctly and uniquely and precisely. So T of F is also uniquely and precisely specified and its adjoint is again completely uniquely specified. Okay, now, Bayesian types is probabilistic language embodiment of all the ideas above, which we suggested with with Dexter. And it works like this, given a probabilistic type T. That is to say, elements of type T that might have been drawn at random. And a term mu of type T, then a Bayesian type is just a pointed type T mu and mu is to be understood as the prior. So T being morally as a set of probabilistically picked elements, mu is in fact a distribution on on this set of elements. Hence it can be a prior. And the semantics will be of this sort. So the semantics of the type T will be a space of measure over some space T. And mu will be an element of this space of measure to be precise, it's going to be a probability distribution on T. And now we can build the, the denotation of the pointed type as the space of all measures on T, which are absolutely continuous with the prior mu. And using this idea, we can, we can formalize the mathematical description of Bayesian inverse that I've given in the previous slides under this inference rule. So there's no need to go through it in detail just to mention that the keyword observe is used in some probabilistic programming languages to mean condition. And as you can see observe E is a map, which goes from a Bayesian type T equipped with the syntactic embodiment of the evidence into S equipped with the prior. Okay, some concluding remarks. So, as I've shown you, the classical description of Bayesian inference inherits an extreme form of underspecification from measure theory, namely that everything is only defined up to a set of measure zero. And that means that possibly infinitely many points are actually underspecified you can change the definition of your Bayesian inverse on these points and you still get a equivalent object. And of course this is problematic from a computer science specification point of view because you would, you are going to have to choose a representative. And this is addressed by first moving to this pointless representation in terms of linear operators and cleverly selecting the spaces to avoid this problem of under specification coming back through the back door and Pointless is in fact far from pointless it's very fruitful. First of all, you live in this much nicer world of linear operators with tons of results. It connects with completely fundamental objects in the theory of probability theory. And you've got this really nice description of Bayesian inference as computing the adjoint of a linear operator. And Bayesian types are simply a syntactic discipline to avoid the under specification that comes from the classical description of Bayesian inference as a result of the disintegration theorem. And I would even argue that Bayesian types are practical because of the rate on Nicodem theorem. So every distribution that is absolutely continuous with respect to a reference measure will have a density and densities are the way Probability distributions are typically represented in a computer. You would have maybe a piecewise polynomial interpolation of your density or something like that, but the density is the fundamental object. So having a type of measures which are absolutely continuous with respect to a reference measure is in fact very sensible from implementation perspective because you will end up implementing your distributions as densities anyways. So thank you. And for further information, I will point you to my paper with Dexter. Thank you very much.