 My name is Ohad Kama, I want to thank you for the invitation to speak about the MFDS special session on probabilistic programming languages. Probabilistic programming languages have become a whole topic in the area and we've in the last few years seen quite a lot of languages in this area that are called probabilistic programming languages and what's nice about this area is that the practitioners seem to be listening carefully to semanticists and vice versa so there's a very nice interplay between the semantics and the language design and for me that's very exciting and matching this proliferation of languages out there we're seeing proliferation in semantics out there in the major community sub-communities in our area so MFDS especially who have been doing it for many years but also Lix and Popple and the more applied conferences like Europe's and PLDI and several years ago Prakash has asked me to try and build bridges between very different semantics to probabilistic programming languages so that's very hard so I thought of doing something maybe slightly easier which is trying to find ways to build bridges and connections between languages so there's so many languages out there and what I'm going to do today is propose a way of organizing those languages and it's based around these two acts I probabilistic programming language would have some constructs for sampling and some constructs for conditioning okay and I've identified these two acts I to go about and this talk is about fleshing these out now I'm not saying that every programming probabilistic programming language fits nicely into the square some would fit outside of the square but if you're in the square we can try to build more bridges between them and if you're not then we know it's something quite different okay so the two acts I okay so the first one conditioning is about being graded or non graded which I'm going to explain in the rest of the talk and the second act I is whether you conditioning based on a density or based on a distribution okay and this is part of ongoing work so I'm going to spend most of the talk in this square trying to explain what is a greater density based language and semantics and then I'm going to go down this act side spend some time explaining what was the non graded language looks like and what kind of this amount what does the semantics look like and then spend very little time here I don't have a good answer for the full general non graded distribution based language but I'm going to outline some thoughts I've been having in this square and hopefully this could start further discussion about building bridges between different languages and semantics okay so first I'm going to talk generally about basic programming languages in this context and then start left to right top to bottom okay so very basically we're in the game of building statistical distribution so what does that usually look like we have some data set in this case we have three pairs of x y inputs okay and we're trying to explain statistically some process that generated those data points so in this case I'm looking at a very simple Bayesian equation or even more genius equation where we say okay some kind of linear process y equals ax we don't know what a is there's some distribution over this a and our goal is to find this distribution okay and what we're doing is writing a little generative model okay for for generating this data so say a is a prior or prior a distributed normally around zero with standard deviation two so if you look at the density function for that that looks like this very wide distribution around zero and then we have those three data points and we're sampling x equals one x equals two x equals three okay then y should be eight times one eight times two eight times three and we're adding some noise of this measurement which so there's some sound deviation of a quarter okay and then we have those two data points that's what the model looks like looks like okay and what's the Bayesian posterior distribution that it represents well we can describe its density function okay the probability that a is between lower bound and upper bound sampled with posterior distribution is proportional to this integral so you integrate according to the first sample and then you multiply by those three different densities okay or normally distributed with the appropriate numbers if you plug these in this is what you get and then when you normalize this integral you get this density okay it's just slightly below one so that's very basic intro to probabilistic programming you want to write these models we want to write them in the programming language okay so this this way we can make machines understand and machine and analyze them and run them and also scale them up to many more data points okay so abstracting away from this example in a probabilistic programming language this in the context that I'm talking about okay we have two core constructs once for sampling a model for one condition so in the previous example we sample an a out of normal distribution so we can sample some x out of a distribution we bind it in some in the rest of the program and the other construct is saying we have some value okay and some other distribution and we're saying when we sampling patch up the distribution and update our distribution based on this value okay and because previously we use the same notation so the decisions might be using this same notation for both of these constructs and adopting a notation by Ramsey and Sean where we're adding a little arrow on the sampling symbol telling us whether we sampling so this means from this distribution I'm sticking a value into x I'm binding a value into x or going the other way around I'm using this value to update my distribution so these are the two core constructs we're going to have in our probabilistic programming language and then we have other constructs the most important one for this talk would be a sequencing construct because probabilistic programming these two probabilistic constructs are effective constructs so we sequence them like we would do in any other functional language and of course you can add lots of other features I have the functions that you buy data types state and other effects I'm not going to talk about that in this talk I just want to focus on the probability theory first order so it's a very simple semantic setting okay and of course it would be lots of extension in all kinds of other actions and it's a very fruitful area for research okay so if we try to concretely kind of build up this language as we go through the taxonomy time proposing okay so for all of the languages we're going to consider we're going to have some base types okay this would be a finite discrete type so we have a finite set of labels or accountable natural numbers or some continuous intervals from you know one to two minus infinity to plus infinity and we'll allow ourselves to both of them okay so very very simple collection of types some continuous some discrete I think it's important to include the continuous ones because it forces you to really work out some measure theoretic foundations to what's happening but definitely some probabilistic problem images will not have the continuous types but for this so this talk will include both discrete and continuous okay I'm going to have deterministic viable contexts okay so just tuples of those base types okay so this is the same in all the rubrics I'm going to consider in this talk okay so we're starting on the top left corner okay we're gonna start with graded sampling and density based condition okay so you know to set this up okay we have a syntactic class of stock measures okay so what will those be and those stock measures are stock measures we're going to use to integrate over the types that I've shown you in the previous slide so the simplest one is a categorical distribution over a finite type okay l1 to ln and the statisticians call it categorical distributions and the stock measure here would be as we'll see in a few slides and would be the counting measure so we're just counting how many l1s we have how many ln we have similarly when we go to the natural numbers we're going to be just I think the counting number the counting measure over the natural numbers okay so when you give me a subset of the natural numbers and it's going to assign how many elements are in it including infinity we have a Lebesgue measure over close interval ad and the Lebesgue measure over the whole free line okay so these are the four stock measures we'll be using in particular applications we might be choosing different stock measures depending on slightly different sampling constructs or just to demonstrate what's happening I'm just going to take these four measures and using those stock measures we're going to create sample spaces just context of intervals okay and now once we have a stock measure we can talk about probability distributions that have density with respect to that stock measure okay so I'm matching them against the stock measures so the first one is we have some categorical distribution m1 to mn are going to be relative weights for each of those points and that's with respect to the counting measure on the finite set we're going to have the geometric distribution over natural numbers and this is going to be the probability of success in the experiment we have uniform distribution over close interval a to b so this has to be a and b have to be finite and we have a normal distribution over the whole red line so we have to supply its mean and standard deviation okay so we have a little syntax for stock measures and then once we know what the stock measure is we can talk about measures that have density with respect to that stock measure okay so let's start describing the type system and we're going to have two kinds of measurement the first one says m in this gamma assuming a sample space omega has type A and an example would be for example if I have m and n and I want to sequence them let x equals m in n well if m has sample space omega 1 and n has subspace omega 2 I'm going to be concatenating the two sample spaces together so it's a graded type system it's a very standard concept in programming language theory and this is this top this is what the gradedness means this means we're syntactically keeping track of the shape of the samples we've done and in languages like stan you do have to very carefully explain what shape of the sample space you are using in your program it has a very rigid structure that's one kind of judgment by term judgments the other one is distribution judgment where we say which stock measure this the distribution term mu has density so for example if I have some term calculating the mean so it's a real number in some sample space omega 1 another term calculating the star deviation in some sample space 0 to infinity okay then I can form the normal distribution with mean mean and star deviation as dv and that has density with respect to the big measure so this is the second kind of judgment I'll be making okay and now the two core judgments will be focusing on judgment for something and judgment for conditioning so if I want to sample from a distribution mu in some n well then mu must have some density with respect to stock measure p and n once I've bound exoterministically it's gonna have some type A and now I'm gonna put together these three some sample spaces together I can show p is in the middle okay and of course you can think of endless extensions or how to manipulate this sample space syntax structurally or structurally in order to shuffle things around think about conditionals think about traces and so on and there's an infinite kind of rabbit hole you can go down to talk down in and I'm not gonna talk about that at all today so we're just gonna take them very structurally so that's sampling and conditioning is a bit simple from the type system we have two terms and mu that has density with respect to p and we're conditioning we could issue an immune and and just the only wiggle here is that mu needs to have that's the respect to p and p has some underlying space it's distributing over an m has to have that type p so if p was the bag this will be the whole real line if p was a bounded the bag this will be the interval it's it's over if he was the counting measure the infinite counting measure with natural numbers and so forth and so on and conditioning just have the unit type so it's just the single finite type which is the single constructed cold start okay so so far that's the language okay so this is the language where we have density base conditioning and graded sampling so so every space like a space natural numbers or continuous real space so very well behaved measurable spaces stock measure space speed will denote starboard space and a significant distribution it's going to be specifically the counting measure that are very good measure okay so there's some there's going to be some semantic question mark and what this property should be but in this very specific case I'm going to be taking what we were intending which is just the counting measure of the big measure so for example the measure that the categorical distribution of L1 to Ln denotes is the measure that just counts the number of labels each of each one of them okay the big measure means integrating respect to the big measure and so forth and when we have a sample space on the guy it's just going to be a product right of each stock measure taking the product measure between them so that's straightforward and stuff okay to interpret terms term is going to be interpreted as two component functions so one of them is a valuation okay so given given some a choice of the so given an element of environment and then some choice of the probabilistic choices in our sample space then we're going to evaluate the value a value in the space denoted by a and we also have a density term that tells us what is the density of each of those values for every choice of values for the sample space we're going to give back a weight so number between zero infinity including zero infinity okay and of course you can think of all kinds of other semantic invariants you might want to enforce on these functions for example that you know gonna be invariant under and the chain you know changing of the of the space regarding to independent tossing and so on okay so and that relates to some work that what do matter and text and percussion in a Scott and others and have worked on so so potential reach here but I'm not gonna go down finding those exact environments so far we're gonna this is this is this semantics is very very intentional in that sense and similarly with the fine distributions to be densities okay so mu is a distribution with density with respect to P if for every choice of the element and choice of the parameters is parametrized by some sample space omega I give a some kind of weight so the meaning of mu is just density of this distribution with respect to the stock measure so for example if I want to get the semantics of sampling okay well what is the density of something mu well if you give me the choices so W1 is the choices for defining mu a is the value over which me over this the space mu distributes over and W2 is the choices that are going to be done in M well the density for that is density of getting that to measure mu and the semantics for conditioning once I know what M is and what and what mu is then I'm just gonna be multiplying the density of getting M out by the density at the point M so that's very natural and that would give us the semantics we had in the starting slide of where we multiply we integrate according to that measure then multiply and the three okay and the rest of the semantics is very standard once we realize that what we have here is a graded monad okay so I'm not gonna go into too much depth here but we have a kind of a graded reader writer monad so once you give me once you give me this little hike over here once you give me the elements of the sample space I'm gonna give you back a weight and the weights add up multiplicatively according to the monoid structure we have on zero to infinity okay so once you know that you can think of how the rest of the semantics will fan out and if you start adding more features or so they behave nicely with respect to something in conditioning and they will the semantics will be given in the usual kind of graded monad structure okay so so far this is the this is the semantics of the language I just want to add two more points about about the prophecy programming semantics so first the semantics we're giving is quite intentional okay one you can think about what would be more extension of semantics well every model which is just a term M in context gamma with sample space omega of type A if there's a kernel that goes from gamma to a which we call M best okay what is that kind of well given any choice of deterministic variables in the gamma and you measure will set you and make a subset of u of a and the probability of landing in that set is the integral of you know integrate over the sample space take the density of that point and then either integrated or integrate depending whether you land in that set you so you can put this you on an integral or just integrate the characteristic function so every model describes a kernel and every kernel has a model evidence function which just means what is the total measure at each point okay so the evidence the model evidence M once I chosen the deterministic variables gamma or the parameters gamma is the total measure of the whole space under that kernel and I don't have a very good semantic understanding of this beyond what I just said it's just the statisticians seem to really care about the model evidence and they use it to somehow debug their models if it's very very close to zero they think it's a very bad fit and if it's very very close to infinity okay then they're also worried about it and they tell you all the model might diverge and so on and and when you use languages like standard they really tell you what's happening to this model evidence and you can use that to check whether your model is robust very different data sets and so on okay so there's some semantic story there and the semantic should account for the model evidence if your semantics is always normalized to one there's something missing something that people in practice seems to care about that you're not modeling okay so this is a very quick description of the top left corner the very simple corner graded sampling density conditioning and a lot of what statisticians think about or forensic programming programmers in the very big languages like pyro and stan they really think in those terms so they think about explicit description of the sample space a very fine control over it and you conditioning always respect the densities so that's the space that we more or less know what's happening and knowing this description already means you can do quite a lot of good work there from a semantics perspective just because the semantics is so straightforward okay so what I glossed over in this description so far is some foundation issues I'm not gonna spend a lot of slides over those this is just one take on how you might overcome those semantic foundations but I do want to say a couple of words on it so in the course of describing the semantics we talked about w to the omega so the space of functions from omega to w okay which is generally is not going to be a measurable space so the space of weights zero to infinity if you exponentiate whatever numbers there's just no no nice structure there and that's not once there I mean many people also in this session have talked about that I don't want to go too much details but the way I get around it is by using mathematical structure you've been developing over the last few years called quasi-boreal spaces the first paper appeared in 2017 at Lix with Chris Hoyl and Samsted and Hong-Sik Yang and just briefly just to give you a flavor of what a quasi-boreal space is it's a pair of two things first the set okay so it's a space of points together with a collection of subsets of functions from the real line onto the SpaceX okay let's close on to some axioms we call those this these functions random elements okay because they correspond to random elements from probabilities and statistics and the axioms says things like every constant function is a random element okay or precomposing a random element with a measurable function gives you a random element okay and the point is that every measurable space has a quasi-boreal space structure by taking the random elements to be all the measurable functions from the reals into that space okay so somehow measurable space is fading to this universe and this universe I mean it's a category okay so I can define morphism between quasi-boreal spaces where morphism from space x to space y is a function from the points of x and the points of y such that when I post-compose this function with a random element in x I get a random element in y it's also at a much a bright closure condition okay the point about this category QBS is that if I look at standard-boreal spaces either a standard-boreal spaces so it's measurable spaces or quasi-boreal spaces the morphism between them behave the same so we have a conservative extension of the universe of measurable spaces okay on the well-behaved spaces on standard-boreal spaces and of course the benefit is that quasi-boreal spaces is a very well-behaved semantic universe as measurable space is not so well behaved specifically we have functions we have quotients we have subspaces and so on okay and so so what I do here is a development I do function spaces I need all of this internally to quasi-boreal spaces and of course I instantiate everything to the first or the fragment of standard-boreal spaces I haven't I haven't changed anything it was if I was working with standard-boreal spaces all along and it's very liberating I don't have to check measureability conditions at all everything just works out fine so even though I'm not actually doing higher order semantics I'm just defining my semantics in a meta theory that's higher order it's very pleasant so I very strongly recommend to people to try to do that I mean quasi-boreal space is how I do it because I'm familiar with it if you have a better semantic framework to do it try that and see how it works okay so this was first note on foundations and in order to talk about distributions we have a monad of quasi-boreal spaces called the distribution monad and you can read about it in this paper from 2018 in purple and for standard-boreal spaces what it does is it assigns to standard-boreal space S the set of points is the S finite measures on S okay and the random elements are S finite kernels from the wheels into S so what's an S finite kernel because an S finite measure is S finite kernel from the singleton space so an S finite kernel is a sigma fn combination of probability kernels okay so k is S finite if there is a countable collection of probability kernels km and some countable collection of weights between zero and infinity and I can create this sigma fn combination and that gives me an S finite kernel and the collection of such kernels is all the S finite kernels okay so it's suitable for doing probabilistic programming semantics and some statement in 2017 proved that you can define all of them in a very minimal first-order probabilistic programming language so somehow it's a very nice environment to maintain semantically and that's the that's the monad I work with and I should say this is a monad over quasi-boreal spaces we currently don't know if there's a monad over standard-boreal spaces that give you the S finite kernels so this just works for this is specifically the quasi-boreal spaces but again I'm only talking about first-order programs so I'm quite happy to be in that with that restriction okay so let's now move for so previously I talked to you about the top left corner right when sampling is graded and conditioning is based on density we're going to go down this axis by the axis of something becoming non-graded so what does that look like programming with this explicit description of the sample space omega is tedious the same way that kind of telling manually annotating your your resources in the program is tedious you want somehow we want to have an analysis so we prefer instead to just say not have the omega or non-judgements and just say M has some variables axis but I don't really keep track of the shape of the sample space similarly for my distributions okay so semantically okay instead of working with density for for everything I work we work with S finite kernels okay every term is an S finite kernels from the variables into the return type but measures okay so this is still on this axis our distributions are still densities so we still think of distributions when we condition it as densities so now we can include as primitives in our language arbitrary probability distributions for sampling okay so you can add whatever distribution as sampling you can add it and of course you can sample from arbitrary subterms but when you condition you need to know density so similar to that new has some density with factors on stock measure then I can talk about conditioning you respect to this 5 mm okay so only I've only changed sampling not condition and as a consequence my language my modeling language has become more convenient so there's a little trade-off okay but the downside of the trade-off is that inference becomes more difficult or rather if I know the shape of the sample space I can do very efficient inference for it so for example Stan employs a very efficient inference method called how we join with Monte Carlo and another I mean it's a big system it's a bit more complicated but I automatically differentiate in variational inference algorithm and this is specific algorithm that they use inside Stan and it really much depends on having this explicit description of the sample space right once you know that up from then you can do very efficient inference okay so what can we do and again I'm not advocating each of them I'm just outlining the current approaches out there so one thing we can do is pay the price right so if you want to reach a modern language then we sacrifice the inference a bit and go further towards the non-graded sampling or the other way around you know really care about inference we have a lot of data we want very good accurate answers and say okay like what we do in in program optimizations we would change the program we change the model a bit to make sure that it's clear to the runtime system or the compiler how to do efficient inference on it so we'll be rewriting our models in terms of a parameterized setting and so we'll get this extra benefit in inference okay and so that's one thing one can do that one thing people do okay so Stan is highly used has easily thousands of users and the other thing we might do is use some static analysis to somehow compile a program from a non-graded program into a graded program okay so for example very nice work by growing over Groton and Sutton from last year is slick Stan where you write a model in a language that's non-graded and then they do some static analysis on it some information for analysis to generate a graded program out of it and feed that into Stan another piece of recent nice work is by Lerau where they use Haskell's very expressive types system extended with raw types to keep track of the this sample space in the type that you write with so it feels like you're writing in a non-graded setting but under her the compiler tries to work out what the grading is and if it can't work that out it complains okay so so it's a bit like programming in a non-graded world but it's probably in a grade world but feeling like you're in the non-grade world most of the time it's very big impressively so recent work okay so that was going down this axis and very briefly I'm gonna talk about going down up the turning densities into distributions axis for conditioning it's still ongoing work so this is still a bit loose so I'm just gonna have one slide on it okay so putting back the grade so we always know what sample space we're in but we're allowing arbitrary distributions to condition with respect to they have to have density with respect to b so there is a density I don't know what it is and I'm allowing arbitrary terms there so previously we knew exactly what was the term we knew it would be either Gaussian or uniform or categorical and so on now we're saying no no it has to be and it can be any any term okay so how does the semantics change where we're looking at we keeping track of the distribution over the sample space okay so we have some distribution that has density with respect to the sample space and and we have some random viable okay that's gonna be the semantics so we have two components I call this the latent semantics okay so it's a distribution over the latent variables but the partition is called it so just about the sample space and then some valuation okay so that's very much like before only I'm looking at this distribution rather than keeping track of the density and now when I would condition I have to do a little wiggle and that's to do with this integration so roughly what happens there is I have my sample space I have some valuation from it into the space p my distribution is over and this integrated with respect to the stock measure here which I get more sigma finite okay and then I have some distribution over p and I'm kind of pushing it and looking for whatever density I have so you have to use the composition to make that work okay so that's roughly what happens but I don't want to into too much either because it's still only going work so I wouldn't stand too much behind this until I you know calculate a few more examples and be sure about it and calculate the relationships it gives you a square okay so but what I'm hoping is that it will allow us to build bridges this is the point and with work by Ramsey Shan and Hakeru and Narayla Yanan and others from very recently where they use this integration in order to give some antics to probabilistic programs and really try to present that work in this kind of setting of distribution based conditioning and there's some good unpublished work by Om Matjesen who's taking forward this integration and they have sophisticated algorithms for calculating this integration because at the end of the day you have a program you want to disintegrate it very hard to do that so you have to come up with algorithms for that and this formulation is very very close to impressive work by Fred Alquist and collaborators on Bayesian inversion so this is what this is what I'm taking this idea from right is using conditioning as some kind of Bayesian update along with this integration and one thing to watch out for is that there's some foundational care you have to take with disintegrating single finite measures that's fine but we need to then show that this integration as an operation is itself measurable that's can also be fine I mean there's a theorem by Kallenberg of a uniform disintegrator but you have to be careful and to make that work and also that a Bayesian composition theorem also needs to be self-measurable so there's a few steps you know still work in progress so I'm just trying to maybe excite you about some current development in this area and I haven't talked at all about the fourth square and the fourth bottom right square is in this taxonomy which we're so interested in coming up with new in the front side of this panel we know which language is in that square yet so maybe there's a new one to be discovered or maybe you know some of the ones out there already fitted and we already have some in front side but it was there or maybe one has to compile up the square from the bottom right corner to the top left corner in order to make it work but I don't have much specifics to say in this area yet hopefully watch this space okay so summary okay I outlined this square again not all languages fit in this square some fit elsewhere but the point is that if we know what we can whether we can place a language in this square or out with it then we already have some crude ways of relating different languages and different semantics and comparing them and within the square and I understand this axis very well this axis is still for me ongoing and really exciting and this could be potentially cool maths here and when we have a good understanding both these acts I hopefully we can find the fourth one and just a side note as I said I use quasi-worldspace as a foundation for this work not mandatory but I recommend trying it out because otherwise you really have to carry a lot of measureability requirements on the side and here you really have to only check them very specific places and it was very rewarding so I recommend to you to use that or at least try using that kind of theory thank you for your time