 Good morning, everybody. I want to begin by thanking Pygathem, the United Nations, for this beautiful venue, and you all for coming to the talk. I'm especially excited to be speaking at the United Nations. I think my parents are going to be so excited to hear this, because I thought this would sort of happen after I became president of the United States or something a bit more impressive than software engineers, so this is a super leap forward. Let's go ahead and get started on our topic. I'm going to be talking about probabilistic graphical models in Python, and sort of giving you a quick crash course, just very basic concepts and showing you sort of the code you would use to just do a simple model. But really, I should be calling this Bayesian networks a crash course, because I'm not really going to be talking about other kinds of probabilistic graphical models. I just want to throw that out there at the beginning, so there's actually other kinds of ways of thinking about problems graphically in a probabilistic sense. I'm going to be focusing on directional, causal sorts of models. So first off, what are they, and why should I use them? Why am I at this talk, or why am I giving this talk? Well, firstly, what they are. When I'm talking about a Bayesian network and thinking about probabilistic graphical models, I'm thinking about solving problems like this. So if you guys have noticed over the last few weeks, if you watch TV, there's been a lot of Olympic trials. And so I thought, let's think about how Olympic trials work. Well, it ends with an offer to join the team. We're not even going to get into the middle, but it ends with an offer to join the team. What is that based on almost always, and kind of shockingly to some extent? Well, it's almost always based just on the Olympic trials. So you can be amazing, awesome, have this fantastic career record, but most of the time from most sports, if you tank at the trials, you're done. It doesn't actually matter that you were awesome at everything else. If you mess up the trials, it's kind of over for you. And let's think, well, how do you get to the trials? You probably have good genetics, right? Let's be realistic. Not everyone is going to become Michael Phelps. I don't think I ever had a shot. But it's clearly, most of the time, not enough to have good genetics. So you also have to have practice, right? And so this is an example of where we sort of know intuitively how part of this works, and then we know formally how part of this works, right? So the intuitive bit is I sort of know it has something to do with genetics, something to do with practice. And then formally, I know that really, at least officially, many of the Olympic teams say, look, we're only looking at the trials. So this is the model we're going to be looking at. What do we do then? I just drew out a graph. What am I going to do with that? Firstly, you're going to use what you know about direct relationships between variables to get information about more complicated relationships. So if you're thinking about how you might write this out as a probability distribution, especially a joint probability distribution of all four variables I've just been talking about, right? Genetics, practice, Olympic trial outcome, and whether you get an offer, that's pretty complicated. And so what I've done is I've sort of substituted that one function, which is really complicated, with if we count the arrows one, two, three, much simpler, much more intuitive relationships. So that's what they are and what they're doing and why you want to use them, is they've turned things more intuitive and more explanatory also, right? If I wrote this out as a shared distribution of all the variables, it would actually, in some cases, be really difficult to get a sense of how they related to one another. You can also store your data slash probabilities in a form where it's easy to input evidence and get out updated information. So that's the other thing we're going to be doing with us today. For example, once I sort of know the overall probability distributions of these variables and how they relate to one another, I might get different kinds of evidence I want to use. I might want to go forward, right? I might want to say, I have this athlete. She practices a lot, but I have a feeling her genes are not so good. Like what are her expectations knowing nothing else? I also might want to backtrack. I might want to say, well, I know this athlete did awesome at the trials, so what is my best guess about their sort of genetic status? So that's sort of what they are in a high level view. What they're not equally important, right? Because there's all sorts of graphical stuff going around. This is not sort of your intro CS algorithms, graph traversal class all over again. Those ideas are certainly useful. They're certainly applied, but the most important word here is probabilistic, right? We have nodes that are probabilistically connected. They're not deterministically connected. For the same reason, it's not the same as a network, right? If you want to model a Facebook network where you have pairs of friends who are connected, again, it's a related idea, but not the same idea. And the reason is we're sort of looking at causal analysis and probabilistic ways of describing relationships. The key word is probabilistic graphical models. And finally think Markov, right? Because the other idea is that we think your status is explained sort of by your immediate neighbors. So to go back to my graph example, if I know the Olympic trials, again, that's all I need to know to assess your probability of getting an offer on the team, right? That sort of tells me everything I need to know about your genetics and your practice. That's the Markovian bit. So some common applications also to get you motivated before we take a look. Medical diagnostics, physicians love these and people who work in the medical field love these because they model how a physician thinks and is even formally taught to think, right? As a physician, you're sort of taught to think, I have this category, I have this category, these are my inputs, what are my probabilities for an output? But there are other applications too that might feel more familiar. Image processing, for example, labeling pixels. So if I have one pixel and I know it belongs to a person, I have another pixel and I don't know what it is, but a person is right next to it, that's certainly gonna have some sort of causal input in making my assessment about what the unknown pixel is. And natural language processing, one way you can approach this is also with probabilistic graphical models. Say given that I was preceded by a verb and preceded two words ago by a noun, what am I likely to be as a part of speech? So that's another common application. Okay, so we're also gonna just quickly review some basic concepts and probability to get you thinking about how these relate to probabilistic graphical models. We'll start with the super basic. What is the probability of event A happening? Just to make sure we're all on the same page. The basic idea is you wanna count up all the As and divide that by the total number of possibilities, right? We all sort of intuitively understand that. Most of us probably work with this pretty often. So just to remind you, the probability of A would be we'd count up the number of As and then divide that by all the possibilities we have. Now the real world is not so discreet. Multiple things of interest can happen simultaneously and that's usually what we're more interested in, certainly when we're modeling things graphically. So what's the probability of A and B if I even just wanna think about that, right? Situations where they can overlap. What needs to happen for A and B to happen? One way to think about it is A needs to happen and then when A already happened, I would need B to happen at the same time and I don't mean to imply an order because you can describe that swapping A and B, right? I would need B to happen first and then A to happen while B had already happened. We can write this as probability of A times and this second term for those not familiar is the probability of B given that A already happened. We can also write this as I was explaining because ordering is not important. As the probability of B times the probability of A given B, it's entirely symmetric. So what we have discovered is a rule for finding out what is ordinarily more interesting, which is we usually wanna know what is the probability of one event happening given that another event already happened, right? That's usually more what we're interested in in the real world, right? So what's the probability that John is gonna make the Olympic team given that I already know he put the practice in and got the invitation to the Olympic trials? So we said probability of A times B is the probability of A times the probability of B given A. You can also reverse this. We can isolate either probability of A given B or probability of B given A and compute this from simpler probabilities and it will always depend what we're interested in, what we know and what's easy to compute. So most of you have already seen but for those who haven't, this is very important theorem and statistics, Bayes' theorem, probability of A given B can be expressed as the probabilities of B given A and the probabilities of A and B separately. And again, you can write this in both ways. And a final bit of statistics before we take a look at the models, I wanna remind you, flash, teach you if you're not aware that there are different kinds of distributions, right? When we think about distributions, it's sort of where the probability of an event happening is, right? When I roll the dice, is it gonna be one, two, three, four? And these can be either discrete or continuous. There are different kinds once you start thinking about many variables. One is called the joint probability distribution. This is when you are fully describing how multiple variables putatively related to one another will all happen together. So this would be something like, what is the probability of A, good genetics and B, bad practice and C, mediocre performance at the Olympic trials and D, getting an offer. You'd have to specify everything to get an actual probability out of it. On the other hand, you might want something simpler, like the conditional probability distribution. That looks at how the probabilities of A are distributed given a certain value, say, for B, right? So this would be something like what I talked about earlier, what's the probability of getting a spot on the Olympic team given a mediocre performance at the trials. And then finally, the last idea is a marginal probability distribution. This is when you start with your joint probability distribution, but in one way or another, you try to integrate out one of your variables. So this is like, I might ask you, well, what's just the probability of getting a spot on the Olympic team? In the model I've presented to you, I haven't actually indicated that, right? I sort of told you, given what happens upstream, here's the probability. But I could also ask you just, what's the probability? Average out over all those other variables and just give me one probability, taking everything into account and not having any information? Yes. Yes, and you can do the same thing with discrete variables, right? It's either an integration or a summation. Okay, that's a decision. Yes, correct, yes. And actually, I should have put a summation to be clearer because we are dealing with categorical variables today, but everything I say can be applied, including the graphical models to continuous distributions. Okay. So where are the graphical models already? Let's take a look at a Bayes network. Here's a Bayes network. It's a structure that can be represented as a directed acyclic graph. The advantages are twofold. You get a compact representation of the joint distribution. This is especially useful for categorical variables where I might just specify my joint distribution by writing out every combination of the four variables I've been talking about. And as you can imagine, that gets really long and doesn't actually give you an intuitive picture of anything, right? It's just a table of numbers. And then I can also observe conditional independence relationships between those vertices and random variables. So here's an example of a Bayes network that hasn't really been specified yet. I've only given you the structure. What else do I need to tell you? I have to tell you how they relate to each other, right, the distribution. So for each of these nodes, I either need to give a distribution depending on its parent input, or if it doesn't have a parent input, I need to just give you an absolute distribution. So let's start with our top inputs. These things we're gonna say that just sort of come out of the sky. They just sort of happen. Probably not true, right? I could probably continue adding back layers and back layers and back layers onto this graph. And that's what you can do in real applications. So let's say your probability of genes, good genes is about 20%, bad genes is about 80%, the probability that you practiced is about 70%, that you didn't is about 30%. Okay, now we're gonna specify a conditional probability distribution for now a node that has a parent. So now I'm gonna need to take into account the parents of Olympic trials to think about what its distribution looks like. And in particular, I need to account for four cases, right? Because there's two cases for the genetics input. And then independently of that, there's two cases for practice. So it's two times two is gonna be four that I need to specify. And Olympic trials, I've given it three outcomes. So I need for four inputs to specify those three outcomes. And this is what I've said. I've said we've got bad performance, overline performance, amazing performance. And then we look at the probabilities of each of those for each case of inputs. And now you see where the dependencies are beginning to grow up. Because for each of these inputs, there's also a separate probability of that input even getting there. And then once that input does get there, I have a probability of a given output. But none of it is deterministic, right? At least how I have specified, nothing is zero or one. So still nothing is guaranteed. And then finally, out of the Olympic trials, I'm gonna input that into my offer and get another conditional probability distribution. And here we see it. Now I have two options, right? You either get an offer or you don't get an offer to join the team. And like I was saying earlier, this is Markovian in the sense that the only thing I need to know to know the probability that you'll get an offer is how you did at the Olympic trials. I'm not looking further back in your history. So we see based on your three inputs from the Olympic trials, your respective probabilities of getting no offer or getting an offer. So that's what we would input into our model. And let's also just try to get a bit of an intuition before we look at the code about how things work. So does an offer depend on genetics? Raise your hand if you think yes. Not when I know the outcome of the Olympic trials. So just in general, does having good genes predict to some extent getting an offer? Absolutely, right? Yes. Does an offer depend on genetics if you know how much someone practiced? So if I know somebody practiced a lot, am I still interested in knowing their genetics or is that fully explained? No, it's not fully explained, right? These are independent inputs. As I have specified them, not necessarily true, but as I have specified them, so both of them are informative about an offer. What about this final question? Does an offer depend on genetics if you know the Olympic trials performance? No, right? As I've been emphasizing, this is Markovian, so once I have that more recent data point that wipes out whatever information was provided by the earlier data point. Okay, so let's actually see how it works. What we're gonna do is each node in your Bayesian network is gonna have a CPD associated with it, a conditional probability distribution associated with it. If a node has parents, we use the CPD with the parents, and if it doesn't, we use that more absolute distribution. We're gonna use a package called pgmpi. You first define your network structure, then you add your network parameters, and then you get all the info that would be a pain in the butt to work out by hand, and you just get it very simply. So I'm gonna switch over to my handy Python notebook. Raise your hand if this text is not big enough. Okay, let me try to make it a little bit bigger. How's that? We're okay, okay. So first you see I'm gonna import my Bayesian model and my tabular conditional probability distribution functions from my pgmpi. Here's how I set up my structure, right? I just tell it what nodes are connected, and this is actually directional. Remember Bayesian networks are directional. So I go from genetics to Olympic trials. I go from practice to Olympic trials, and so on. And see how intuitive this is. I can call it whatever I want, really easy to input as many vertices as I want. And you can do all sorts of things like add unconnected edges, it won't break. Okay, so I have put in my Olympic model. Now I'm gonna set up the relationships, all of the conditional probability distributions. Take a look at this first one and see what you notice. What you'll see is I am handily specifying the variable name. The variable card is the number of possibilities I'm gonna have. You can see this is intrinsically meant for the discrete case. You would have to specify differently for the continuous. And then I give it the values. I'm gonna do the same thing for practice because these two just had those extrinsic probability distributions I specified. You can think of it like a prior if you're more of a Bayesian in general. Then I get to my offer. I sort of thought that was next easiest to specify. You can specify in any order you want. And again, I've got my variable, my number of outcomes, but now it gets a bit more complicated because now I have what's called evidence, which is another way of saying apparent. This is sort of saying something like what would be immediately prior in the chain that would give me information. In this case, it's those Olympic trials. The evidence card is like the variable card. It's the number of values you have. What could I possibly expect? For the Olympic trials, we had three possible outcomes. And then what we're writing here is our conditional probability distribution that's matching up the inputs to the outputs as far as their probabilities. Again, it's never deterministic. And you can see why I saved the Olympic trials, conditional probability distribution for last. It's quite a bit more complicated because now I have two input parents, right? And each of those has two possibilities. And then I have three possibilities for my Olympic trials which gives me these 12 numbers, right? For each set of inputs, I have three outputs and I have four inputs, hence 12 numbers to describe that distribution. So once I've set up those relationships, and let me actually make sure I've run them, I'm gonna add them to my model, right? So what you're seeing about the way PGM specifies models is everything is highly modular. So you can sort of add your CPDs and take them away very easily to try out different possibilities which I find really nice and intuitive. Okay, that's because I ran it twice, apologies. So Olympic model, let's first get those conditional probability distributions and there you can see exactly what I put in. It's not gonna cough all of the input back at me, but it has them. Now I'm gonna do a couple of sort of common problems that I have with probabilistic graph models, right? So I'm gonna look for what are called active trail nodes. Another way to think about that is like a path of influence, right? What can give you information about something else? So if I want the active trail nodes of genetics, it's gonna be genetics, offer and Olympic trials and let me just take us back to the graph for a second so we can see that, that makes sense, right? Genetics, Olympic trials and offer. These give information about each other. Practice shouldn't be on that list, right? Because practice doesn't tell me anything about genetics. Similarly, if I look for the active trail nodes for Olympic trials, now I'm gonna get everything, right? Because actually everything is informative about performance at the Olympic trials. Can also find local independencies, right? I'm too lazy, I don't wanna read my graph. Or more realistically, a real graph is not gonna be as nice as mine. It's gonna be pretty messy, so it's not always gonna be easy to spot those independencies. What if I want the independencies for genetics? Boom, there it is, right? Genetics is independent of practice. It's not actually independent of anything else here. In a real world application, it could be hundreds of variables that I don't know, but that are independent. And again, Olympic trials will actually, nothing is independent of Olympic trials. Everything is telling me something about that variable. If I just wanna get all independencies, I can do that too with get independencies, super straightforward. And here you see, now it starts telling me about conditional independencies, right? So genetics is independent of practice altogether, but then we have information like, genetics is independent of an offer when you know the Olympic trials. So you're also getting conditional independencies specified from this library. You can also do super useful things. Again, all of this you can do by hand, right? Even after one year of statistics class, you can sit down and spend hours working out all these conditional probability distributions. The beauty here is it's all symbolic and it happens in one line of code rather than two pages of algebra, right? That's the whole point. So if I wanna do variable elimination, I can do that too. Thank you. So let's say I wanna know, what is the distribution of an offer look like? Because that's something I actually didn't specify, but that's probably just interesting, right? Because you might say, well, why is this so interesting? How many people get an offer anyway? What's the likelihood you'll get one? I actually don't know from the information I put in, but I can get that information out. So you'd have to backtrack to know, but an offer is zero is a no offer. Getting an offer is an offer one. It's about 11% of my model. This information was intrinsically in the relationships I specified, but I actually didn't know it when I made up my model, right? I actually found out. I didn't have to do all that nasty algebra. We can also get conditional probabilities that take into account what we already know. So here's an example of that. Actually, I need to go down here. Let's say I wanna know, what's the probability of getting an Olympic offer when I have bad genes? Well, I'm gonna look at my variable again, offer, but I'm also gonna look at my evidence now. And bad genes was actually the second variable, like the second value, so it's a one not a zero here. So what's the probability of that offer if I have bad genes? Well, it goes down, but it's actually not that bad, right? It seems like maybe practice can help me overcome it or some other variable I'm not looking at. So it's about 9.8%. What's the probability if I have good genes, right? So again, I put in my variables of offer, my evidence. Now my genetics is that value of zero because that was the good value. And if I look at this, well, now it's 15.8%, so it's quite a bit better. So at least within the limits of my model, I have found out that you nearly double your chances of getting to the Olympics, all else being equal, if you have those good genes. I can also put in more than one variable, right? And again, this all you can do by hand, but it will be pages and pages of algebra sometimes. So now let's say I wanna say you did practice and you have good genes. And wow, it goes up even more, but actually in my model, not as much as just the good genes, right? So good genes versus bad genes gets you from 9.8% to 15.8%. Adding that practice in only gets you a little bit more, right? And this might explain why some of the more famous Olympic athletes are not actually so well known for being dedicated to their practice, right? So I would argue my model is not entirely unrealistic. You can go upstream logically, not just downstream, right? And we talked about this a bit, right? Like if I know you got an offer, I can infer something about your probability distribution of your genetics. For example, and that's sort of what I wanna do here is I'm gonna say, well, if you did amazing at your Olympic trials, that's two, what do your genetic distribution probably look like? And I see that you've actually got a 33% chance of having really nice genes rather than just a 20% chance I had assigned as my prior. Some variables are only informative about others given a third variable. We talked about that a bit with genetics and practice, right? So my genetics just by themselves versus practice doesn't tell me anything. But if I stick in information about the Olympic trials, that information about practice does actually become useful, right? So if I know how you practiced and how you did the trials, I can also get more information about what your genetics looked like. And here's one example of that. We can also find out the most probable state for a variable period, right? So if I just wanna say this is more like that marginal distribution, like just tell me what's the most likely genetic status? It's one, which is bad. What's the most likely offer? It's a zero, it's no offer. That's the most likely integrating over all of my cases. What's my most likely performance on my Olympic trials? Pretty bad. Okay, so I will conclude with that. I will just say there are other graphical models. They are mostly widely under development, very active development, so I would encourage you both to use these to explore causal theories and also to get involved should you be interested. Thank you.