 Hello everyone. Thanks for coming to my talk. Thanks for listening and thanks to the organizers for the invitation. I'd like to speak today about joint distributions and probabilistic semantics. So probabilistic semantics is a subject that goes way back to I guess around the early 80s, late 70s early 80s, starting with a discovery by Jerry of the so-called Jerry monad, which is very important in probabilistic semantics has been for quite a while. It was actually known to Lavere back in 62, although it was before the relationship between monads and adjunctions were discovered. Around that time, early 80s, there were a lot of papers developing the semantics from different points of view. Here's some of them that are listed. This is can consider this to be prehistory. And the subject lay dormant for a while until the discovery that the probabilistic programming languages would be useful, very useful in machine learning and statistical analysis of large datasets. And there's been a renewed flood of interest since around the early 2000s in this area. Some of the features that are required in modern probabilistic programming languages that were not available in the early versions were our Bayesian inference, very useful in machine learning and conditioning. You would like to do samples and then update a prior to a posterior guess of some probabilistic model based on the outcomes of your samples. In addition, there's interest in higher order constructs. Since the early 2000s, people have started looking at programming languages for this kind of probabilistic programming and a very long list of programming languages has been developed since that time. Just a few of them are listed here. There's also a renewed interest in probabilistic semantics to deal with conditioning, Bayesian inference and higher order programming. Here are some list of a few more modern approaches. These spaces are a little bit more categorical in nature and higher order, as I say. They all relate to each other at some level, but look very different in practice. And it's still kind of an open question exactly how they relate to each other. To model Bayesian inference, most of the approaches, at least the ones that I know about, use a technique called disintegration. And it's become a very important concept, very important technique for modeling conditional probabilities, conditional expectation and Bayesian inference. I'm going to talk about this a little bit later on in some detail. In this talk, I would like to go back to a model of Bramsky, Blut and Penn and Gauden from 99 called P-REL, which is a category, symmetric manoidal category of joint distributions. And P-REL limited those spaces to spaces that allowed disintegration so that they could compose these probabilistic relations. And what I want to do here is show that you can actually do this without disintegration. And in so doing, I'm going to generalize the category to other spaces that, including spaces that don't admit disintegration. We want disintegration so that we can do the composition. And in order to do that, we have to make some restrictions on the category. Typically, you restrict categories to standard Borel spaces. What I'll do is I'll give a faithful embedding of a category called Kern of Markov kernels into the category of joint distributions. And in fact, limited to standard Borel spaces, the two categories are equivalent. The advantages of the joint distribution category are that there's a symmetry between input and output. So you don't really need disintegration. And it's a step towards a point-free approach. As an added bonus, I'll give an enhanced Radon Nicodine constructed construction. This is a standard technique for producing a derivative of a measure. And we'll show how it can be done in a little bit more generality. And it's actually necessary for the construction to do this. Okay, let's go back to the original system that was used to model probabilistic programs. And that's Markov kernels. They have lots of names, stochastic kernels, stochastic relations, measurable kernels in literature. And they're defined as follows. So if you take two measurable spaces consisting of a set of points or outcomes and a set of events or measurable sets, take two of these things. And a Markov kernel going from X to Y is a map that takes a point in the input side, in the X, in the set X, and a measurable set on the output side and gives you a real number. And it has to satisfy two properties. For fixed first argument, the function of the second argument must be a probability measure on Y. Sometimes you take subprobability measures, but in this talk, I'll just restrict to probability measures just to add an extra point in which you stick all the non halting probability. And then the second property is that for fixed second argument, the function in the first argument is a measurable function on X. And the reason why you want that property is that you want to be able to compose these things sequentially by Lebesgue integration. That's how they get composed. It's an analog in measure theory of multiplication of stochastic matrices. The morphisms form, these Markov kernels form the morphisms of a category called S-REL, which was studied by Ben and Godin and Dobrikat in the early 2000s. And I'll write P X to Y when I'm thinking of it as a morphism in this category. This category, S-REL, is actually isomorphic to the Klyzly category of the G-R-E monad. Okay, so this is what a Markov kernel looks like. It's a little bit unsymmetric in the sense that on the input side, you just have a point. On the output side, you have a measurable set, B. And the value of the kernel on S and B gives you the probability that, you think of it as the probability that if you start in state S and run this process P, then the probability that the output state will land in the set B, or equivalently the event B will happen upon output is the value of the function PSB. And this is a Klyzly arrow of the G-R-E monad. You compose these things by integration up the middle. So what you do is if you have two of these things, P and Q, two kernels, you use the fact that P is a measure in its second argument for a fixed first argument. And Q is a measurable function in its first argument for a fixed second argument. And to get the value of the composition on S and B, you integrate up the middle. So I'm going to divide the space, Y, the mediating space into little infinitesimal regions in which the value of the function doesn't vary too much in a region and take the value that I get into B from T and then sum up over all of these little regions and take the limit. And that's the Lebesgue integral. You can also lift, you can do Klyzly lifting for the G-R-E monad. That's what this is. So you instead of an input point, you can assume that there's a measure on input, let's say mu, that comes in. And again, you integrate. So you divide up and take the sum over infinitesimal regions and take the limit with respect to mu. And that makes, and I denote that by mu semicolon P. So sort of like composition, sequential composition. And what you get is a measure B that's defined this way. Again, it's just Lebesgue integration. And this becomes a bounded positive linear map from measures on X to measures on Y. So this map of the, of a map of whose inputs are, are measures. Now this was made into a category called Kern by Dalquist et al. in 18 couple years ago. This was at MFPS two years ago. The objects of the category are probability spaces. X, A, where X is a set, A is a set of measurable sets. And typically you take a standard Burrell space. And mu is a probability measure on that space. The morphisms are equivalence classes, modulo mu null sets of Markov kernels from some space of this form to some other space of this form. Such that mu, the probability measure on output is mu composed with P. And what they show in this paper is that Kern is a dagger so-called involutive category with an involutive functor going from Kern to Kern op. And this is done by disintegration. So what this does is it allows you to take this map, which is, looks like a Markov kernel, and turn it around in the other direction. And that's done using disintegration. And it satisfies these two properties that if mu is the output measure on input mu from P, then if you apply the transform here, the involution to P, then mu will be the output now and mu will be the input. And it goes in the opposite direction. So here's the picture where P goes from left to right. P dagger is going from right to left. And you can think of P dagger of T and A as the probability that the input state was in A when T was the output state. So in this way you can draw inferences on the input distribution based on some sample that you took from the output distribution. And this is used to model Bayesian inference. And the nice thing about doing this way is that it works for continuous distributions even when the outcomes can occur with probability zero as you would have with a continuous measure. It still makes sense. Now I want to talk about joint distributions. And this was done by Abramsky et al in 99. And the way what they did was they observed that P and P dagger can both generate joint distributions on the space of the input space, the product of the input space and the output space. So the way you can do this is take the kernel P and just make a new kernel P hat that remembers its input state. So on measurable rectangles A cross B, its value on the input state S will be the value of P SB if S is in A and zero if not. Okay, very simple. And we can do the same thing for the for the dagger for the symmetric for the transpose. And if I lift then using an input measure mu, I get a measure I get a joint measure on X cross Y. And if I lift P dagger, I also get a joint measure on Y cross X. If I lift the output measure, if I lift P dagger by the output measure, I get a joint measure on Y cross X. And it turns out that these things are the same measure or actually just transposes of each other. So what we get is a joint distribution in both cases. And this is I can sort of draw out like this, it looks like a big uncountable stochastic matrix if that makes any sense. Okay, so I got mu P hat, which is a measure on X cross Y and U P dagger hat dagger, which is a measure on Y cross X. They have marginals mu and new. And they satisfy this property that they're just transposes of each other. Okay, so using this idea, it was suggested that well, maybe we should come up with a category that has these objects or has morphisms, these uncountable stochastic matrices, if you will. And Abramsky et al did this, and they called it PREL. The objects again were the same as in Kern and restricted to, we have to restrict to standard Borel spaces so that we can do disintegration. The morphisms are the joint distributions. And the question is now, how do you compose these things? You can't really multiply uncountable matrices. But what you can do, if you can do decomposition, is you can decompose and then use integration. You know, I'll show you how that works. So before I do that, I have to say what disintegration is. So you disintegrate. And what disintegration is, this is a little bit different from the standard presentation, but it's equivalent. Disintegration is the inverse of the operation that I did before where I took a kernel and an input measure, and I lifted it to a joint distribution. Disintegration allows you to take the joint distribution with marginals mu and u and gives you back a kernel such that nu is the lift of p. And if I take mu and nu now, and if I now lift this, if I take this kernel that I get by disintegration, and now I lift it again, and now I lift it again, I get theta back. And p is unique up to a mu null set. So mu and nu are parameters of this construction. I'm given a joint distribution with marginals mu and nu, and I can construct this p, and that's unique up to a mu null set. Unfortunately, this doesn't work for all spaces. There are counter examples. So people typically restrict to standard Borel spaces, and that's what everybody does. And without loss of any practical applications, you can do that because it's about all spaces you ever care about are standard Borel spaces or finite spaces or discrete spaces. The most general result that's known in this area is you can do disintegration for spaces that are countably generated. And if the measures are perfect, that's a recent result of Culbertson and Sturz. Now, once you can do disintegration, then you can compose in p-roll. And the way that works is if you have two marginals, if you have two joint distributions with marginals mu and nu, and nu and c, so the mediating marginal nu is the same in both cases, theta from x to y and eta from y to z, then you can disintegrate to get two kernels. You can compose the kernels in the category kern using Lebesgue integration to get a kernel from p, semicolon q, and then lift again to a joint distribution on x and z. And that's how it's done. Okay, so our main result that I want to show you is that you can actually do this composition independently of disintegration. You don't need disintegration at all. So you don't need to restrict to standard Borel spaces. This holds for all measurable spaces. And the technique I'm going to use is an enhanced reton nicotine theorem for derivatives. Okay, so I'll define a new category that's a generalization of p-roll. The objects are all measure spaces, super category of kern and p-roll. Morphisms will be joint distributions from the space to the space with mu and nu, and the joint distribution has to have marginals mu nu. Those are the morphisms. And I'm going to get a faithful functor from kern to j-dist, which is the identity on objects. And it just takes a kernel or a equivalence class modulo null sets of mu of kernels and lifts to this, lifts to the joint distribution as I showed before. And I get a joint distribution with marginals mu and nu. And this gives a, I'll show this gives a symmetric minoidal category with joint distributions as tensors and transposes symmetry. So it looks just like this. It's just, it's the real transpose of that uncountable matrix. So it's got a very pleasing symmetry unlike kern and p-roll. Sorry, unlike kern, not unlike p-roll. It's the same as p-roll. Okay, so the lemma we need to prove, one lemma we need to prove is that the following are equivalent, that p and q as kernels are equivalent in modulo a, I didn't actually define this equivalence relation. It just means that p, that the set of, that for any measurable set in the output space, the set of input points, which cause p and q to be different is of measure zero in the measure mu. And that is equivalent to saying that p and q have the same image under this functor j that takes you up to j dist. And what that says is that this functor j is going to be faithful for that reason. Okay, we need to use right on Nicodemus approximants. So here's the definition. So let's let mu and mu be finite measures and then consider this set of quotients for all measurable set C in a measurable, subsets of a measurable set B such that mu of C is greater than zero, is nonzero. And the set is nonempty if mu of B is greater than zero. And if that happens, then it has a finite infimum since this is a member, but it may be unbounded above. I'm going to take all of these quotients between mu and mu and mu and mu. And I'm going to look at the sets of the sets of these quotients for over taken over all subsets of some measurable set B for which the value of mu doesn't vanish. And for any epsilon greater than zero, I can come up with a countable measurable partition of D of the input space X such that for all nonempty or all B whose measure is positive, the set GB is bounded, the set of these things is bounded. And moreover, the difference between the supremum and the infimum is bounded by arbitrarily small epsilon. And then this other inequality here is I also show for technical reasons. Okay, moreover, these are preserved under refinement. So this refinement, this is refinement of countable measurable partitions of the space. That's what that inequality is. And using these definitions, I get these radon nicotine approximants. And these are the same approximants that you get in the proof of the radon nicotine theorem. You're going to get step functions here based on these values. And it's going to turn out that you can prove this enhanced version of the radon nicotine theorem, which is the first statement says that you get Lebesgue decomposition out of it. So you get a Lebesgue decomposition that says you have a, if you can decompose a measure into two measures on which the value of some measurable set F, this measurable set F is zero, and then mu is also zero. And the radon nicotine theorem, which is a derivative, you get a derivative S such that V zero is the integral. And that's going to be the piece, it's going to be a piece of mu that is non-zero on F. And mu vanishes outside of F. Now these two parts of the radon nicotine theorem are what you can find in any textbook. But what you don't find and that what I need is some uniform convergence property of these approximants. So usually when you see the statement of the radon nicotine theorem, you don't see the approximants anymore. They kind of go away. So what this says is that there's a lower approximate, a set of lower approximants that is monotone non-decreasing on F and an upper approximate that's monotone non-increasing and both converge point-wise to F and point-wise to the function F and converge uniformly on F. And F is a set of measure one, mu measure one. Okay, so if you take, if phi is absolutely continuous with respect to mu and you take these values for, sorry, mu is absolutely continuous with respect to mu and you take mu zero and mu one to be nu and zero respectively, then you get the classical radon nicotine derivative d nu over d mu. There's really nothing new in anything I've done here. The standard references, a cert one and two, without reference, this is the scaffolding that goes away when you state the theorem. But all of the stuff that I mentioned is implicit in the proof and you just have to go in and dig it out. Unfortunately, I need it. I can't do without it for what I want to do. Okay, I'd like to note that the value of the integral is independent of the choice of this countable measure, the sequence of countable measurable partitions. But F and F are not, and there's nothing that works for all sequences, which is why the radon nicotine derivative is only defined up to a mu null set. Okay, so here's how you do composition in J-dist. Given two joint distributions, okay, with theta and eta, with theta having marginals mu and nu and eta having marginals nu and xe, you compose them this way. You take the value of the composition on a measurable rectangle, A cross C, as the limit over all countable measurable partitions of this sum. Now, I have to emphasize that this is a kind of a weird limit because you can't have anything that works. You can't have any F, big F and little F on the last slide, that works for all countable measurable partitions. There are too many of them. But what you can show is that if any one of these sequences of countable measurable partitions, if you have a sequence that converges and another sequence that converges, then they converge the same value, and moreover, there's a least common refinement of the two that gives you the same value. So you get a unique limit, if you get a limit at all. And it's just defined this way. Okay, so you have to show that this limit exists and that comes from this use of the radon nicotine approximants that I mentioned last time. You have to go through the argument that everything works out and converges the way you would like. And then we only defined it on measurable rectangles, but that extends to a joint probability measure on the product space using the Karat Theodori Han-Komogorov extension theorem. So what we get is a faithful embedding of current into J-DIST, a bona fide symmetric minoidal category. Okay, it's completely symmetric in the input and output space, so no difference. You can just turn things around. It's a symmetric minoidal category. The tensors are the joint distributions. I would like to get some kind of a closed structure on this, if I like, because after all, joint distributions are distributions and that's what the objects are. So there might be a possibility of getting some kind of closed structure. And it's also a dagger category or an evaluative category that has an involution or a symmetry. And that's just composition with the transposed operator on the underlying space. So in conclusion, I've given a definition of composition of joint distributions without references to disintegration, so it holds in more generality than previously known. You don't need to assume standard Burrell spaces. You get a symmetric minoidal category, not closed J-DIST of joint distributions. The Markov kernels are faithfully embedded. And for the future, I'd like to give a point-free treatment and a closed structure, if possible. Okay, thanks very much. And thanks to these people you see listed here, all of who helped out. And especially to Sam, who caught a serious error in an earlier draft of this paper. Thanks very much. See you soon.