 Data science itself is a definition that is getting standardized by the day over the years. So what we are going to present is something on the machine learning and that on the probabilistic machine learning something not deep not deep learning. So we are going to talk about can deep learning work with this yes it can but is it going to be in the same paradigm as that of neural network probably not. So let's see what probabilistic graphical models are before that short introduction why we are doing this at Mysore consulting group we are a group of folks who enjoy machine learning we have been doing this for our fun and also helping clients essentially with these areas. So we have our team here some of the work that has been done so far to give you an idea I will just speak a few landmarks as to why we are talking about this related to the workshop. So you might have heard of LDA literally allocation in text or national language processing. So Rhea is a contributor to that to David Blaise Labs original LDA she has added methods to it and we have been using that to automate some of the existing recognition tasks within the natural language processing. Myself have been doing this for fun and I have been doing this for my career as well for a long time it is getting better and better with newer algorithms being added to it and we see lots of use cases. So now let's get into probabilistic graphical models let's try to see why let's try to take a scenario and then go from the scenario to explain as to the importance of graphical models and then later Rhea will take over and have hands on exercises. We are giving away everything for free it will be online it is open source you can access the notebooks feel free to copy it share it keep the logo in place. So that the diagrams are also generated in using latex and we have Praveen who has done that as well using visualization tools that we have in house. So let's take a scenario here the head of a top restaurant he wants here she wants to solve a problem which is that let's say a new dish has been introduced maybe it is the new dosing with the Arthibhavan if you are in from Bangalore or it could be a new dish that has been introduced in a nice restaurant and now let's say and people love it right. So this has changed the number of people who go into the restaurant what makes what how would you think about this problem now the idea is you want to find out why the traffic is because of the new dish are people coming in because of the new dish that was introduced to the restaurant in the restaurant or is it because perhaps you know there was there were no other restaurants in the vicinity or perhaps the traffic that day was so clear that people felt they could go to this restaurant basically many many factors influence a decision and let's let's see what kind of a problem that we are trying to solve here. So the idea here is as I said we want to find out and make decisions and to find out specifically how good the dish is that is actually causing the traffic is getting into the restaurant the reason why people are you know happy about it. So how would you try to solve it now the head of the restaurant would call in a machine learning engineer or a data scientist and tell him or her you know I have this problem can you measure I want to find out why and not just get predictions of course everybody wants predictions but along with that we want to know why. So what are the data that's that a data scientist would demand right for any person would demand it's the amount of traffic historical data sets anything that you think would be useful in modeling probably along with that also things that you don't know but you have the data for and you would allow your model to find out right and so this can include type of traffic city events such as games that could be happening and can you model now with all these data sets can you model how busy it would be so the answer to it apply a deep learning neural net with all features and target vector train it increase number of layers as you can get the highest prediction make sure you don't overfit and you're done great you've been able to successfully predicted and now what happens so let's look at this now we have a deep neural net right so here is the deep neural net and we have all the inputs that we think will be useful in getting the predictions okay now what happens with the deep neural nets can you tell as to if the data is less for training what happens if we have less data for training what if we don't have enough information can you train your deep net and if it is busy today how likely is it because of the game or due to the new dish it may not be due to the new dishes because perhaps a game that was happening there people wanted to go to the restaurant and you know they're hungry the game and this is the this is probably the decent restaurant nearby and how would you hike the pay of the new chef he he or she's going to demand more you know it's because of my dish everybody's coming here I want more pay and what about the weather suppose it's raining you know now it's the monsoon season not everybody wants to go out and what if it's a pedestrian people like to walk so these are real business problems right you need large number of data sets it's never enough they're not interpretable always you can there are studies to say that interpretation can be done with neurons you can think of a hyperplane and sigmoid functions activation functions there's work being done there but how far I don't think we are so far there yet and it doesn't account for future unseen parameters suppose new parameters comes along the way a new data data set comes along the way can you do you have to retrain your net and many firms they don't pay for models that they don't understand if you go to businesses and say hey I'm getting 95% or 90% accuracy but I don't I cannot tell you why can you pay I don't think businesses want to invest in that of course I mean this is not entirely what I'm saying is not entirely true they are investing in areas where you see accuracies but you need to get reasons would make it a stronger case and so in 1986 Jeff Hinton himself who contributed to some of these methods he says that these should be dispensed away with I think it's in the MIT review I don't know how many of you read this but he says that these methods are all that we are working with are three decades old right they should be dispensed away with and we need to find newer ways to solve these problems or a new path to AI now let's say that instead of having the the the kind of neurons that we have which are say a new dish accident yeah let's say if we move a little bit from here to this scenario from from this scenario to do something like this we know accident in game affects the traffic right we know we know this happens instead of having them all in a single layer as we saw earlier what if we had something that is having a dependency where you have accident in game affecting traffic and there's a new dish that we don't know it's affecting or not it's independent or suppose we know that sunny being sunny or if it rains it can affect the pedestrian traffic we know these things so how about we have something different hierarchical reality how about we have accident game affecting traffic sunny and rain conditions affecting pedestrians and a new dish that directly affects the bookings and all of this leads to how busy my restaurant is going to be so this leads to some other mathematical questions that we want to look at suppose your toss a coin and you see heads what would be your idea of coming up with the prediction for the next coin toss any answers cases no it's not 50% you don't know if it's biased or unbiased you've only seen a single toss and all you see is heads what would be your next prediction you have nothing just one information single sample it would be heads again because you have single sample and you see heads it's 100% right unless you don't know what if tails is on the other side suppose you toss again and you still see heads so you would say that this coin is biased but is it actually biased it is not because you don't have enough samples so your model should account for uncertainty in the number of samples that you've seen and the answer to that is Bayesian modeling Bayesian modeling can account for not just the prediction but also can give you an estimate of the variance as to how unlikely or how uncertain your prediction is on the coin toss and suppose I toss the coin five times and I would probably see three out of five heads then I would still say it's biased or it could be even two out of five but my uncertainty should reduce as I increase the number of coin tosses and that's where Bayesian modeling is extremely useful where you can have a paradigm that can depend on the number of samples and your uncertainty gets lesser and lesser or your variance can be modeled and what if a sample is missing can you account for for that can you can your machine learning model conventional supervised learning run learn in terms of a neural net or a deep learning so these are case these are questions that can be answered with probabilistic graphical models and can it explain causality what is causality causality is a affecting b b affecting c and c affecting the result that is called causality that is something that is a tough problem and it is useful to know that you'll see more when we give a hands-on example when Rhea talks about it so the task was to predict busyness and given the restaurant is busy can you tell if the road to the restaurant had no traffic jams could can you go the other way round I know my restaurant is very busy today could you tell if the pedestrian traffic what how likely it's going to be busy or not these are questions we need to answer we need to go inside every single causal model we should be able to get inferences on what is happening that is actually leading to the result that we are seeing today right the explanations with quantified results not not qualitative results but quantitative results on how much and why so that I can hike my pay of the chef in these amount that I think the person is deserving and what about dependencies in explanatory variables okay a affecting b and c is fine but what about a b c and d affecting each other correlations have been done but has causality been done with conventional methods so these are questions that we try to answer today and we'll talk about explainable AI and then but before I get there what I do is I'll let Rhea take over and this is a short background I've given you so we are going to go through some examples and learn about how what these things are and then you can do some exercises on these notebooks download them using your your geo or the any of the networks from github and and then she'll go through the examples from the notebook that's out there it's available for you and you'll understand as to how to analyze infer from a sample network and after that what we'll do is we'll summarize the whole lecture or talk and and then we'll close the session does that sound good to everyone any questions we are available here the team is here who added the content and so if you have any questions we are around here to fix it good morning everyone so let me start by introducing myself I'm Rhea Agarwal and currently I'm working in a AI based research startup my suru consulting group and we will be talking about probabilistic graphical models and what I'm going to talk about is the basic building blocks so that you can create your own Bayesian models and you can infer ask questions from it so let's start by defining what are these models what exactly is a probabilistic graphical model so it's just giving a mathematical framework to already existing graph theory and the probability theory so that we can take into account some complex interactions which are happening between various random variables so I would like to tell you the prerequisites which are there you need to know a basic probability theory and some statistics and yeah machine learning is essential so let me reiterate some very basic concepts so that we are on the same page everyone what is probability the number of favorable outcomes to the total number of possible outcomes so if you roll a dice and I didn't telling you it's unbiased one so what is the probability of you getting one is 1 by 6 and then what are random variables random variable is a value which is a numerical outcome of a random phenomena so the phenomena which leads to it is completely random and a random variable can either be discrete or it can be a continuous one and for a discrete random variable we actually define the probability function as a probability mass function so it will be in the form of a bar chart but for a continuous one it will be a continuous curve and in order to calculate the probability we have to take into account the area which is under the graph so that's the difference between a discrete and a continuous random variable and then comes the Bayesian methods or the conditional probability what is the conditional probability is if you are given some evidence about a certain event happened what is the probability of another event happening so if you are given like oh okay event y happened what is the probability of now x happening so it is given by this equation which is the joint probability of both the events happening at the same time which is x and y both the events and which is divided by the marginal or the probability of only the event y happening and then there is also the chain rule for Bayesian models and if you have a probability distribution over a lot of random variables how can you break it into conditional probabilities so these are very important equations which I think you already know just reiterating them then comes a very important concept which is marginalization so marginalization is suppose you have a probability distribution over two variables and you want to reduce it to a single variable so what you will do is you will sum over the variable which you want to eliminate all the values of the variable taken into account and summing over it now there are basically two types of graphical models one is the Bayesian model and the other is the Markov model the Bayesian model is a directed acyclic graph there are no cycles in a Bayesian model it's not there and the nodes represent the random variables of an event which is happening and the edges represent causality for example the weather outlook will actually affect whether you are going to play cricket it's not the other way around so there is a direction there is a causality but in case of Markov networks you don't have any causality you have only correlation there is no causality in the picture and again the nodes even in a Markov network they represent random variables and it's a undirected graph but it can have cycles in it so we can actually estimate those models in which we have cycles which a Bayesian network cannot so next comes these are the two examples which we will be dealing with very toy examples in today's workshop and so there are there is a Bayesian network in which what is happening is you have two random variables which is humidity and wind and this is going to affect what the weather is going to be like on that particular day which in turn will affect whether you can go out and play that day or not so this is a very toy example and then there is a Markov network in which we have four debaters they are debating with each other but they have some correlation with each other at the same time they tend to agree with someone more they tend to disagree with someone more so we will take into account these two examples today in today's workshop so first yeah I think I will cover that in later slides I am just going to talk about yes there is a flow if influence you know yeah a can affect b that's there but that is yeah so they are not directed there is no causality I mean they are related to each other they affect each other definitely they affect each other yeah there's that is that is you know you are talking about the hidden Markov models I am going to cover it that comes under the Bayesian models so yeah we will talk about transition from one state to the other that is the hidden Markov model which comes under the Bayesian yes yes these are Markov networks I know it's a bit confusing the names but yeah that comes under the Bayesian because one state affects the other state it's a direct causality I will come to it yeah yeah yeah it's there it's there so Bayesian networks as we know each node represent a random variable and each node has a CPD which is a conditional probability distribution and in order to calculate the joint distribution we will just take into account the products over all the CPD's that there are in the network so as we already discussed this example now we will see the maths how to calculate various conditional probabilities how to go about exploring this model more and more so this these are the conditional probability distributions which are related to each and every node and as you see humidity and wind because they don't have any parents so these probabilities are directly found out from the data itself there is nothing is affecting them these are completely random there is no causality but the weather outlook it has a causality and it depends on the humidity that day and the wind that day so the weather outlook will be given the probability of a weather outlook will be given by how was the humidity that day and how was the wind so there is a condition given the humidity and wind what was the probability of it being a sunny day it being a cloudy day or it raining that day and then again the weather outlook will actually effect whether are you going to play or not play that day so in order to calculate the joint distribution over the three variables which are humidity wind and the weather outlook we have three CPD's we will take the product of the three CPD's and it will give us a joint distribution so the joint distribution will sum up to one and we can get some useful insights into it again when I talked about the marginalization if we want to remove the weather outlook how can I remove the weather outlook from the joint distribution I know there can be three types of weather outlooks I will sum over the probabilities of all the three of them and I will calculate the probability distribution over humidity and wind so now no it is not so next comes the causal reasoning so how does knowing one variable affect the other one so suppose that we know that the probability of you playing cricket on a certain day is 0.5 there is a 50-50 percent chance that you will play cricket that day or not now you get a new evidence okay the winds are extremely strong that day will it affect your probability of playing cricket yeah so how how to give it a certain structure so what will happen is wind will affect the weather outlook there is a flow of influence from wind to weather outlook to playing cricket and because of that the probability of playing cricket given the winds is actually less than the probability of playing cricket given no evidence at all so this is what is called the causal reasoning next comes the evidential reasoning so suppose we know that the probability of humidity being high is 0.25 and then we make our observation okay it is raining today so will this affect the probability of humidity being high that day given the evidence you know it is raining so you know the humidity is high intuitively but how do you get to that inference by looking at the model because humidity and whether they are directly connected to each other so weather outlook can actually directly influence the humidity so let us come to the interesting example where we know that it is raining and we also know that the humidity is high so will it affect the probability of wind being strong it should right so there is a flow of influence from humidity weather outlook and wind suppose you know that humidity is high and you don't know the weather that day will it affect the probability of winds being strong no it will not because the flow of influence will be obstructed because you have not observed the weather outlook so this is one major difference when you see the structures you get to know which variable will affect which variable whether they will affect or whether they will not affect so these are all the six basic kinds of graphs which can exist there is no other kind of relation which can exist in a Bayesian network so we know in which cases can the influence flow from A to B so if they are directly connected it will flow which is the first two cases if they are connected through another random variable the influence will flow and if they are connected in a V structure it the influence will not flow for the influence to flow you need to observe the C variable in the V structure so this explains it if the C variable is given then yes yes in the first two cases yes there is a direct correlation it will affect but if C is observed in the in the bottom two cases so there will be obstruction in the flow of influence now A cannot affect B so you have blocked the trail but if you have observed C in the V structure you have actually activated the trail so in that case A can actually affect B so what are active trails active trails are the trails where the influence can flow so if we have a active trail from A to B we know that the influence can flow from A to B for that to happen we need to activate all the V structures and no other node should be observed in the active trail because observing a node will actually obstruct the flow of influence and active trails actually give rise to a very important phenomena which is independence if two random variables are independent then the probability of a joint distribution over A and B would be just the product of the two probabilities and if A is independent of B then probability of A happening given B will just be the probability of A and by the chain rule we can even tell what will the probability probability of A and B given C if A and B are independent so independence like I said in the Bayesian structure if you are activating a trail then there is no independence but if the trail is not activated there is independence so if you have observed the parents of a given node then the node is independent from all its non-descendants because now it cannot affect any other random variable so knife base is the most commonly used Markov sorry Bayesian network that is there and we make a very strong independent independence assumption in an knife base model that is all the random variables are independent of each other so this by using this very strong assumption we come to the conclusion that the joint probability will actually be given by this equation where no random variable or features have any kind of influence over the other one and then comes the Markov assumption that the future is independent of the past if you know the present and to give it a graphical structure it will be a straight line where all the random variables are connected to each other so for example if you know x2 then will x3 be affected by x0 and x1 no right because you have obstructed the trail so this is the independence assumption that we are making and we are giving it a graphical form and hence the probability distribution would be given by the following equation keeping into mind the independence and keeping into mind the chain rule which happens now these are the dynamic Bayesian networks so in dynamic Bayesian networks we can actually define a network over a large period of time just taking into account just two slices of time which is a time slice t and a time slice t plus 1 and it behaves almost exactly like Bayesian networks it is only that you unroll it and you define it in only two slices of time and you infer things from it and most of the time it follows the Markov assumption that given the present state the past state cannot affect the future and here is the hidden Markov model which comes under the Bayesian so given state 0 the CPD is different for a hidden Markov model so we have a finite state automata so we know if you know your present state it doesn't matter what your previous states were it will only and only depend on your present state the future state that is going to happen so if suppose s0 is state a then what will be the state in s1 will only and only depend on s a because this is the CPD which will affect it which will affect the transition and then comes the plate model which helps in the reuse of the structure of a Bayesian networks and the various parameters for example we are rolling a dice n number of times so we don't want to write n random variables we will just create a plate around that and we will say all of them are actually affected by one parameter it helps in reuse of the structure and the parameters and the plate models can be of two types which are overlapping plates and the nested plates so overlapping plates so the location in this overlapping plates example is look is repeated m times and quality is repeated p times so the by is repeated m into p times again for nested one the only difference is the quality in the nested plates is actually repeated m into p times so plates actually helps you to reuse a lot of structure and understanding it better now we will jump on to the Markov networks which are undirected graphs and the nodes of a Markov network they represent random variables and as we know yeah true true yes true true true so that is actually structure learning that falls under structure learning so one of the major advantages of Bayesian networks is it will help you to use your prior knowledge so if you have a lot of prior knowledge about a particular task and if you want to deploy it you can use Bayesian networks very very efficiently but if this is basically what you do is you look at the different structures and you pick the data that fits the best that's how you can learn and tell which structure you don't have to know if you know you can model correct there are algorithms to do that like your estimation carry so Markov network it can be cyclic in nature so it can represent certain dependencies which a Bayesian network cannot and instead of cpd's now in a Markov network we have something like factors so factors can be viewed as potentials these are not probability distribution values so in this particular case suppose Amy and Bob are debate are debating with each other the probability that they are for the topic which is a 0 and b 0 is mostly true because it has 0.8 potential value but they disagreeing with each other is very less so they are mostly on the same side whenever they are debating and same goes for Sam and Bob they very much like to agree with each other all the time because even if they are for the topic even if they are against the topic sorry for Sam and Bob they are mostly not in agreement with each other yeah sorry so they want to fight with each other a lot so the potential values they can take any value from the range minus infinity to plus infinity and they have their own scopes and in order to get a potential distribution over you know a larger set of random variables we multiply the factors together and suppose now we are taking into account four people now Amy and Bob they love to agree with each other Bob and Sam they love they also love to agree with each other and Sam and Tom they love to fight with each other and again Tom and Amy they love to again agree with each other so somewhere you know the cycle is breaking because they are not agreeing with each other all the time so how do you calculate the probabilities of Amy and Bob agreeing with each other and fighting for the topic or against the topic because somewhere the cycle is breaking because two people are always in disagreement all of them cannot agree with each other all the time so what happens is to calculate the unnormalized probability we take the product of all the four factors and because it's unnormalized we need to calculate the normalizing factor in order to calculate the normalizing factor we sum over a set of combinations of all the values that these random variables can take we will see that with help of an example I didn't get your question okay yeah Tom and Bob are not yes but if you can see Tom and Bob are connected through Amy yeah so yeah I mean they are related because they are connected through Amy and we have a flow of influence through Amy so yeah exactly exactly yes yes so now these are normalized values are calculated by taking the product over all the factors and we sum over all of them to calculate the normalizing value and then we divide to calculate the normalized probability values now as you can see now we tend to marginalize Sam and Tom from the picture now what is happening is you can see the difference between the potentials what potentials tend to tell you and what the probability values tend to tell you so the potential they told you that okay Amy and Bob they always love to be in agreement but when you got the probability values you can see that's not the case it's only true for half the time half the time they are in disagreement because we have two other people in picture and if you put them in the same room it will not happen that Amy and Bob will always be in agreement with each other so this is just giving it a mathematical framework and then comes the Gibbs distribution the only difference is that in a Gibbs distribution your factors they can have more than two random variables unlike pairwise Markov networks where your fact your the scope of your variables was only two random variables here you can have three or four or as much as you like depending on the application that you have in hand and again the unnormalized probabilities calculated by taking a product again you can calculate the normalizing constant by summing over the set of all combinations of your random variables and then dividing both so now when you have a graph you can actually have different sort of factors for example if you're going for a Gibbs distribution you can say okay this graph factorizes in two we can take groups of three or you can say okay no I want to go for a pairwise Markov network you can take groups of two so which one is correct and which one is wrong actually both of them are correct so by looking at the graph you cannot calculate or you cannot assume a factorization yourself a graph can only tell you okay there is a flow of influence between these random variables and if you observe a node that flow is obstructed so if you observe x2 in this case then x1 can no longer affect x3 and x5 that is the information that this graph tells you and nothing else now comes the interesting part which is the conditional random fields so mostly it's a sort of a Markov network and it is used when you have a large number of correlated features there's a very high correlation between them so taking into account all those correlations will be very difficult in a graph so how do you predict how do you use that information to actually predict something so here what we have is we calculate the unnormalized probability in the same way we take into account the product over all the factors but the only difference is to calculate the normalizing constant we don't sum over all the random variables we only and only sum over the target variable so the distribution that there is over you know the input or the correlated features doesn't matter to us that is taken care of by itself we don't care about because we already have that information we don't want to calculate the distribution over the correlated features that is not what we are calculating we are only calculating the distribution over the target variable how given these variables it is affecting the target variables we only are interested in that distribution so we will see that with the help of example this is a very basic example suppose you have just one feature which is which can take a binary value x 0 or x 1 and you are trying to predict target variable which can be y 0 and y 1 and you calculated the unnormalized probabilities which I have taken as a b c d and we want to calculate the conditional probability distribution like what is the probability of y happening given x is something then if it was give distribution the unnormalizing factor for us would be a plus b plus c plus d but because this is a conditional random field what we do is we take only and only we sum over y 0 and y 1 which is our target variable so we divided by a plus b in the first two cases and we divided by c plus d in the next two cases now what we have done is the correlation which was there the distribution which was over the random variables we have not taken that into account we have assumed we have we are given evidence we have taken that and we are not really concerned about you know what is the kind of probability distribution that is there that exists in the random variables log linear models yeah that is just for explaining things yeah you are right you are right that I have done just to explain you what is the difference yeah we should have at least two features if you are taking into account correlation but this example is not wrong because we are calculating a conditional probability we are calculating probability of y given x we are not calculating yeah exactly see this is just to explain it to you this is not a real-life example yeah yes exactly so the example for it would be for example you are doing image segmentation you have a image and you want to segment it into super pixels and it's widely used in that case so you have a cow and there is something in the background you want to calculate you want to know what is the feature of the cow you want to take one super pixel and you want to name it a cow now what will happen is all the pixels which are inside the cow are highly correlated in terms of texture color everything so if you use a simple model then that correlation will give you a very biased probability value because you are taking into account the same texture the same feature a lot many times maybe 5 times or 10 times so what we do is we already know the texture is like this we already know the colors are like this now how do you know whether it's a cow or not or how do you know whether this super pixel belongs to a car because now yes no that is the logistic regression example so I mean you yes yes yeah yeah I mean this is for a use case actually she's saying so in the case of a logistic regression where you can go with an elimination right here you can't eliminate a pixel you need to come up with the value that is reflective of all of the pixels depends on what use case you're looking at the log linear models so in the log linear models we define the factors differently we take a exponential function which which is actually a multiplication of all the weights or coefficients and the features so the log linear models are highly employed in the field of NLP and some of state of the art algorithms are actually based out of this because you can define the features in so many ways and it will give you so many complex interactions and you can use it in any number of possible combinations so this helps and the log linear models I would like to tell you is inspired from the Ising model in physics which is used for you know detection of ferromagnetism and where you have you know dipole moments of atom atoms so in the log linear models the energy function oh yeah so in the log linear models the energy function is used instead of the unnormalized probability distribution so you kind of take negative log of the unnormalized probability and it gives rise to a linear function because you are taking a negative log of exponential function so that is why the name log linear models now after you have your Bayesian network in place you want to make decisions based on that so how do you make decisions on the basis of a Bayesian network so there's something called the utility functions which are used and highly deployed in making decisions so you can define your own utility functions here I'm taking an example that a student wants to get a job offer and a manager wants to know whether should I extend that job offer to that student or not so what is happening the utility function says that if the student is a poor performer and you don't give him a job then your utility function is zero and in the same case if it's he's a poor performer and you give him a job then your utility function is negative and he's a great performer and you give him a job your utility function has a very high value so this can actually help you in determining okay which action should I follow I think we'll now get to the exercises so we want you to do some hands on so the idea of presenting some of these are which are advanced even if you don't get it the idea is to give you a breadth of what exists out there but I think we should now get on to the problem solving part so go ahead and download these notebooks from github and you can start experimenting with it and we'll go through these examples and show you how these would work in reality so go to github it's on it's on the notebooks section yeah so the so if you go to mcg-ai slash notebooks on github you should be able to download these we can help you with that yeah we'll probably yeah we'll give it give the link action so here it is so github.com so it is on github.com slash mcg-ai slash notebooks is this visible to everyone or should I increase the font perhaps is this better so pgmpi is so you don't have to technically run this right these notebooks are available for you can watch while we run through these examples you'll get an idea of how you can work with the Bayesian network and we have a team if you face problems you can come and fix it for you so pgmpi is an open source library and what it does is it is able you're able to enter conditional probability distributions into it and then get inferences from it so what we are going to do is we're going to import a Bayesian model so one of the things that you have to do is specify the nodes or random variables in the order of influences say for example a weather outlook is a child node for humidity and wind and so also is plane cricket right so how do we build this model we saw in an example let's see how to build it so the idea here is you're going to define a Bayesian model and instantiate that by providing your parent nodes with h weather outlook is w o where you have h affects w o in that order and w also affects weather outlook and w o the weather outlook affects plane cricket so is this clear to everyone so we're instantiating a model and then what we do is we import the tabular cpds so from pgmpi.factors.discrete we're going to import a tabular cpd and we're going to create a humidity conditional probability distribution where we are going to call the variable as h and the variable card is going to have to which means that there are two different possibilities of humidity and it could be at a 75% or 25% and they should add to one so that's the idea and similarly let's create a wind cpd which means that it can have a value of 0.4 or 0.6 you could actually define wind as 0.4 as being the chance that it is going to be breezy and 60% chance that it's not going to be breezy so that's the idea of looking at a conditional probability distribution so also with humidity humidity you could think of it as less humid and more humid less humid is going to be 75% 25% of the time it's going to be more humid so that's the idea of defining probabilities and they sum up to one and let's look at the evidence what's an evidence actually so the evidence is essentially what we are doing here is we are saying that the weather outlook is dependent on the evidence of humidity and wind and so you're going to define a WO variable or the conditional probability distribution which is going to be dependent on these two which means you're looking at a matrix here all different possibilities right what happens if the weather is less breezy and if the humidity is high so we need a table we need this table to map and find out all different possibilities in the matrix so that is what you're seeing here 0.8 0.6 0.3 0.1 which basically is telling you that there are three different possibilities or two different possibilities of humidity and then various possibilities of weather outlook and let's look at the map of what to pick in a different scenario so that's why it's called a conditional probability distribution because it has conditioned on the previous states and now we have to define another CPD for playing cricket which is different values what happens if we pick one of the values here in weather outlook what would happen to the conditional probability distribution of playing cricket so then what we do is we add up these CPDs we basically add these CPDs and these need not necessarily be in order you're basically if you have entered all your conditional probability distributions in the right order with evidences then you should be able to get all the CPDs back and then also validate the model can validate the model and make sure that it is correct and so in short we are looking at let's print out the CPDs using the get CPDs which gives you in an iterative manner and so you're looking at your entire set of CPDs that you have entered so as you can see here a humidity low and a humidity high and a weather having different states will map up to your conditional probability distribution of a weather outlook is this clear to everyone in this example and so also is the playing cricket so what happens it doesn't mean that you will go and play cricket right there are dependencies there could be possibilities for whatever reason that we choose not to and so which basically a probabilistic method or a sequence so now you're looking at different states of weather outlook and then whether you're going to play cricket or not is going to depend on your weather outlook so that's the idea and then you can compute the probabilities using evidence so variable elimination is an algorithm essentially to find out if the idea is that you want to compute the probabilities and CPDs within the nodes by specifying evidence so this is another way where you can actually run query so you can query the variable piece or the playing cricket given that the evidence of humidity being low so you don't so the the good thing here is this right if you look at the map you don't always have to have every single distribution or data that's available for you to actually query your model this is now we've now we've built the model we have a model with the conditional probability distributions we need to get inferences from it we need to harness it and the way to do that is let's say we only have evidence that the humidity is low with so we're going to assign a state to it which is zero and then we are going to find out what different possibilities of playing cricket is going to be and you can see that with no other evidence or available evidence that we have about the wind or any any other value all we are seeing here is that given the humidity you can tell that the probability of playing cricket is going to change from 0.36 and 0.63 is going to be the chance that you would play so these are in these are examples of of having or gathering information information from your model what happens if there's no evidence suppose there's zero evidence then what would be in a normal state with having no additional information what is the probability of playing cricket so you don't have to specify the evidence and then what it would do is using chain rule you can actually compute these probabilities and they change naturally because there is a flow of influence right because there is a flow of influence having or specifying or not specifying a value will change your probability so you can also do something called evidential reasoning which is which is basically a bottom up so bottom up what that means is what is the probability of day being windy given that we play cricket so that this is where you actually harness to think of this how powerful this can be suppose you have a large model and you want to find out that I know what the end result is but I want to find out what what would it affect or what affected it so you're going to do something backwards from your Bayesian model which is you know that that you actually play cricket and but you want to find out what is the probability of the day being windy that day and you can actually query your model and using Bayesian these are all basically based on very simple Bayesian principles and chain rule that these are computations happen but they're very powerful so you can find out that you can say that yes it was less breezy we know the probabilities less busy because we did actually play cricket and what is the probability of less humidity given that it was a sunny day so another form of getting influence so you can see that you can be all different possibilities are possible here and these questions what does it bring us up to these questions essentially are the questions that you would have when you run your model but you were afraid to ask right so that that's that's the power of Bayesian networks and conditional independence as we talked about it so conditional independence is a nice criterion the idea here is if I tell you or give you an additional data set then the we are we're already working with graphical models lot of you who views knife base as we were saying is a graphical model it is a conditional independence because once you know the class your data samples are independent of each other if you don't know the class they are dependent so in a Bayesian structure let's say we have three random variables and so the idea here is that your humidity and your playing cricket is going to be independent if the weather outlook is given so you can actually look at the Markovian chain as someone was asking questions on also falls into the same paradigm and these are what PGMPI gives you it's you know it's the only open source library that we know is very powerful amongst the other ones and we are contributing to that ourselves and we invite everyone else to also contribute to it so one of the things that we would like all of you to do using this is that the most important thing here is what we want you to do is to find out and then run different possibilities and find out what happens to your probability how does it change if you change your inputs or if you change your output result what happens to your inputs so these are different questions that you can ask and query your network and what about all independences in the model there are many more independences that we know if we have additional data so to make it clear what condition independence is to give you an idea what it is if you have sufficient information that information alone is enough to give you any other idea about the conditional variables that are independent of each other if that if that information is missing then there is a flow of influence that happens and active trail falls into it again so you can find out if the model has an active trail between any node to any those are different features of this this is an open source library just to give you an idea it's not proprietary it's out there and and people are and we have researchers contributing to it so you can also find out we structures like as Ria was explaining there are we structures are where if you have us if you observe the value you then it activates the trail what does that mean actually what that this is a little bit confusing compared to all other structures we structure is what what you're actually saying is you're reducing the sample space or possibilities if you know that for example you let's observe if wind influences humidity right those are independent of each other we know wind and humidity are not related to each other however if I know my weather outlook is actually bad or good then I know for sure that and if I also have additional evidence about wind I know my humidity cannot be low so that is the idea of restructure given additional information your probability will only fix certain samples from the sample space and so it reduces the probability or increases the chance of it falling into one or the other bucket so how much time do we have till 15 minutes so okay so what I'll do now is I'm not going to talk more about this because the idea here was to give you a fair sense of the power of the network that you can with an example now let's get back to the slides to cover up to summarize everything that we've seen so far and let's see what things can be what can be done with the Bayesian network as well as as well as the Markov random fields and other more advanced areas so this falls into something we think is explainable AI right and this is a standard so explainable AI covers many many areas but when does your model actually become explainable it becomes explainable when you can ask questions from it and you get answers quantitative answers that you can actually measure so PGMs can visualize these structured models and Markov networks you can identify anomalies with this and you could model date models you can build models out of sparse data sets you can predict from a single sample you can actually build models from few samples very few samples you don't have to have all the samples the only the caveat here is your model is going to be uncertain to the extent that your data won't change if your data changes over time then your variance is going to change over time which is expected so the idea it allows you for you to build from sparse data build models from sparse data and the applications are huge we'll talk about a few use cases ourselves so let's see so what we saw is we can learn how busy is a restaurant given the data you can actually use the same model and build it out you can build a Bayesian network with actual values and what you can also do is you can feed in the values from real-time data that you've collected in a supervised fashion and then fit the data to learn the conditional property distributions the question comes down to how do I learn these CPD's what if you don't know right you can collect data and then learn these values all right so so you can learn the CPD's and what and someone has a very interesting question it is very interesting because it is a problem which is structural learning what if I don't know my structure that is actually a good question so it then falls down to n different possibilities we can learn them and that's where algorithms can be useful you can come up with efficient algorithms because as your number of nodes increase then it's you'll have to come up with methods that can quickly detect the structure and there are algorithms for a structure learning what it falls into so what we saw we can also tell if a restaurant is busy that a game isn't going on or that the road is pretty clear you can get evidence or you can tell how likely these while these events are going to be so does the conditional probability of the game change if we observe traffic an accident so these are questions that we should we should be asking and we can ask a new data and new evidence so suppose you have new data right or new evidence suppose there is another factor that affects your entire network you can easily incorporate that you can build on it so finally you can actually tell how likely is the introduction of a new dishes with the probability and the variance so we saw what these cpds are essentially so you can come up with the you saw that the cpds can be entered into different states so as you see sunny falls into two states and if you assume rain rainy is going to fall into two states for simple reason that it's going to rain or not rain with the probability then your table is going to tell you all different possibilities that you can pick off if a pedestrian is going to be walking or not walking and and then you can answer all complex questions with influence flow of influences you can detect flow of influences this flow of influence is going to be very useful to you the reason is given certain data sets in a large network for example you can tell what led to your result final result and so case studies so fraud models this is one of the famous fraud models from from one of the papers by David Hackerman and it's a part of refactor.ai who was our client essentially so the idea here is you can say suppose a fraud occurs what does a thief do perhaps kick on the car and then by gas and flee away from the scene by a lot of jewelry perhaps and age and sex let's say men and women are equally probably to be thieves 0.5% 50% and let's say the age band is going to it's possible that people from 30 to 50 are more likely to be thieves that we see from the data this can be learned from the data and so we have a CPD for probability of age probability of sex and probability of committing fraud and as you can see the the the relationship is different here committing a fraud leads to someone actually buying gas right so now saying if a fraud has occurred or not occurred will actually influence if a person actually bought gas or jewelry and you can ask questions such as given that we know the fraud has occurred was it done by a male or a female who's likely to have done it so this is going to be very useful for you to detect or do a backward analysis and you can also have if you have additional information about the age then you can round off with a better probability score on who likely would have done it and credit risk models these models given payment history outstanding loans can you actually come up with predictions of interest rates probability of the interest rates and causality as we talked about earlier and most importantly what this is something that we are using as well and you can use as well which is record linking various fuzzy deduplication of records this can be done using graphical models Poisson gamma models to figure out say the suppose there are two names that differ slightly and that rate is going to change over time you can merge these records using graphical models and of course this is some one question that nobody asked but you might end up asking is what about continuous cases we've only seen discrete probability distributions what about continuous probability distribution right so these are called as linear Gaussian models which is one type of there are more continuous models than these linear Gaussian models are one area of continuous models very crucial they're also known as linear dynamical systems or Gaussian Bayesian networks they're used for object tracking and even robotics so this is in short covers pretty much everything that we had to tell and this is also motivated by one short learning paper by Brendan Lake what what he claims is that if children can learn effortlessly why do we need a million samples to train and this paper is a very renowned paper and graphical models hierarchical models are one ways to solve this problem any now I will take questions now that's a good question we've not done any research or experimental results ourselves we've done some experiments on coin tosses for illustration that you can look at and in the coin toss you can see how the variance actually reduces as your number of samples increases but once you have the fair estimate adding any more samples will not help because only learning you've already learned what you need to learn and with deep learning what helps there however is if there are additional variances that you you can end up learning with additional neurons but if it's not sufficiently going to change you might be happy with the result it's essentially what I'm trying to say so it depends on the use case and how your data changes if if you have a streaming data then it's a different question because if you have a streaming that this happens in linking various records together right you might have a new data dump that you get and you need to link these records up then your rate of error could change which can basically change your model second related question suppose you look at natural language processing there are rules in place for grammar sure yes if you could imagine a system that has some combination of like an expert system then you use probability could you possibly get to a better accuracy with lesser data than deep learning well I wouldn't compare deep learning with accuracies because I myself haven't applied both on a simple example to claim anything but one thing I can tell is graphical model is already used in LDA if you've used topic model done topic modeling and used LDA it is a graphical model and gives you far more accurate results than traditional transfer frequency inverse document frequency that it does because it's just more honest with respect to the structure you're drawing samples from and most importantly something that I didn't talk about is that this falls into an area called generative models which means if you have the model you can draw data from it you can create fake or it's not exactly fake but rather you can create instances of fraud your fraud model you can draw samples and say this particular instance is a fraud something that might not have occurred earlier so you can draw samples from your distribution that's why it's called generative sure yeah so generative models let's say you have a generative model of fraud right detection then if you want to create some instances to showcase what kind of transactions and credit cards could be looked upon as fraud you could generate those cases you can randomly sample and generate these cases so supervised so that again is a term so super this can fall into supervised too this is semi-supervised you can call it because the CPDs can be learned from data so it is supervised right if you're comparing to neural networks for example then that's a black box and this is a white box that's the major difference that you're looking at right so that's a very good question falls into two things one is is your data actually biased or have you collected less data which is which happens to be biased those are two questions that has that is different later one which you've collected less data and you've not collected all which is actually one biased right right right right so that's exactly where this is useful actually because when you have less data you can come up with a prediction with an uncertainty so you'll have a you'll have a biased prediction but you'll have an uncertainty associated with it as your new data comes along it becomes it shifts to the right value in fact I can I don't know if I can show you an experiment to that respect I might have it here just to show you one of the so refactored is where you can actually it's open and there are more examples which is now also part of Sunni Buffalo graduate school curriculum so that's another area where that's another place where you could actually so you can do it both ways if you think your data is imbalanced you can balance your data and train your model in the conventional fashion but if you want don't want to do that you can feed your data as it is to reflect the bias in your prediction and then come up with a uncertainty score saying hello I have a prediction but this is now I'm not very certain as more data comes along it's going to be mean is going to shift to the right prediction value correct sure sure sure sure sure yeah I think you want that right you want that bias because if you are saying that your data is biased with 60% and 40% you want to keep that probability of being a male is going to be 60% so if you randomly pick a sample you're going to find more females than male then you want to you want that in your model so why wouldn't you want that unless your data you collected is imbalanced so that's why I was saying if you want that bias where a lot of the inputs are biased right you're going to this will basically point out the probability score if 70% of my time I see males in my result I want to assign a probability of 0.7 to it that is the way that's the how in fact that's exactly what Bayesian rules depends on so you want that suppose your data that you collected is wrong that's a different issue that you have to do what you would do in any other case use any kind of balance or under sampling over sampling you can do whatever that that's independent of your model but if your model is going to account for bias you need that bias I don't know if that is clear correct correct yes and what would a supervised model do in that case so for example you let's take the toying cause coin toss example and if for example you are you have taken 100 samples but the person who is supervising he never counts for the tails some of the times so you can say that that's a bias that this person has right you're talking about something like this right so for that bias there's something called the prior in Bayesian networks so you can take a prior and you can assign the prior such a value that it takes into account such a scenario that ok the person who is supervising he is a biased person he will never account for the tails which happens so many times so you will give more weight age to whenever a tail happens then to a head so here is an example it's a prior that you give before taking into account the probability values you give a prior to it yeah you can work with the weights yes so in those are called as conjugate priors precisely and here is an example you can see that if with a single sample your uncertainty is a uniform distribution with the coin toss right then what is happening as you collect more samples your prediction is biased here if you see four heads out of ten tosses you'll end up with 0.4 somewhere around that but you have an high uncertainty around it when you as number of samples increases you get in for an unbiased coin you'd get 0.5 as your result but with a less uncertainty so this this is the advantage of Bayesian modeling where you could use all the previous samples and then your prediction gets better and better but your end your uncertainty reduces so this is an example on coin toss the Stanford has conducted some experiments on with just coin toss data and you can use it to come up with Bayesian models absolutely this falls into online learning the online learning one of the ways of best working with online learning of what you're talking about is a graphical model where you and the advantage here is you don't need to train on everything again you can throw your previous samples away and only train on the new samples yes Yes. Surely yes yes. Yes. I think you are repeating the same question which is structure learning. So when you do not have the entire structure there are approximate inference and the structure learning algorithms to learn the dependency. So in short to give you an idea you try all the different combinations and fit CPDs. The one that fits best is what you pick. There are map algorithms and MLE methods. So structure learning usually is based on either maximum a priori algorithm Bayesian method again or a maximum likelihood estimation. I can send you papers on that and the current PGM pi already supports that. We have examples there as well. You can take a look at it. Yes, the PGM pi does support a priori implementation. I do not know that. It depends on the use case I would say. I do not have a general answer on that. I do not understand your question A. So we can talk about it later. I think it is more involved. You have paper it is complex. No, no, there is no. In fact deep learning can be used in this paradigm too. If you have 5K features you need to have GPUs to learn. So deep learning I was talking about with respect to deep neural nets not necessarily deep learning as such. So deep neural nets in fact the one-shot learning I think uses deep nets, right. Yes, yes, yes, correct, correct. So that is a good question. So this gets into a hidden Markov model. What exists underneath is different from what you are observing, right. So there is a noise that you add to it. So it falls into the paradigm of something that is related to a hidden Markov model. Thank you everyone.