 So, during this course of the talk, I will explain how we could, how we could use sampling algorithms to extract meaning from data, and I will show some use cases where we've been able to apply it in the context of the problems that we try to solve it belong. Okay. Let me first start with a quick problem. Suppose we have to find the area under this curve, right, denoted by A. How many of you in this room think that you can find the area? Okay. Only a few hands. That's okay. Okay. How about this? How many of you can solve this integral? Okay. I see a few of you deciding to leave the room. Don't worry. I know we are all computer science engineers, which means we don't know what we are strong at really, right? However, there is a saving grace, which is that we can try to solve it with a computer. But how? Any guesses? Sorry? I actually can't hear you. Numerical integration? Okay. Sure. So let me quickly show you an idea of how we can do it on a computer without knowing math, right? Suppose we are knowing very little math. Suppose we can put this curve in a rectangle, right, basically take a rectangle that can enclose this curve, okay? And then what you do is you randomly sample points from the rectangle. So you use uniform random distribution and find points within the rectangle. So some points are going to lie within the curve and some points are going to lie outside the curve. So you will know which points lie within the curve by basically taking the x value applying f of x and seeing if y, if the y that you randomly sampled is less than f of x, right? That means that the point lies within the curve or else the point lies outside the curve. So I have denoted the points that lie within the curve in green and the ones outside is red. So now who can tell me how you will find the area? Exactly. So the ratio, right? Which is that, so the area is nothing but the number of green points with respect to the total number of points, which is multiplied by the area of the rectangle, which is C into B minus A, right? So essentially if you sample more points then the results that you get is expected to be more accurate, right? So this is an example of how you can use sampling actually to solve a rather complex looking problem like this. So I'll also give you one more example and what we'll, these are sort of simple examples to motivate how sampling as a concept is very powerful and then we'll then step into some real world applications where we will apply the same technique and see some really good results. Now assume that, you know, we are in this two-dimensional space and we have to sample points which is x, y and x, y as a certain probability distribution which looks like this. As you can see more points lie somewhere around the center which is 0, 0 and lesser points are lying sort of distributed at the periphery. So now how do we pick points from this curve or this, yeah, from how do we sample points x, y? Any thoughts? Okay, so let me, let me get to the next step. So I'm going to introduce a very powerful algorithm called GIF sampling and once you get to know GIF sampling you'll realize that you can apply it in almost every context including your own life and I'll explain what this algorithm is all about. Now suppose that, you know, instead of knowing p of x, y which is actually a joint distribution, right? So the joint distribution across both the random variables x, y, suppose we know the conditioners which is probability of x given y and probability of y given x then we can sample this in a easy manner. Now you may wonder why computing these conditioners is easier than the joint distribution but for the moment I'll ask you guys to bear with me and take it from me when I say that in most cases knowing the conditioners is much, much easier than being able to compute a joint distribution. So you will see that in real world examples later on but for the moment let's assume that we know the conditioners probability of x given y and probability of y given x. So then the algorithm goes, goes in this manner, right? You just pick a random point x0, y0. The probability of that might be even close to zero but don't worry about that. You sample a random point and then what you do is you change the x value to x1 with this conditional probability, right? Probability of x1 given y0. So use the conditional distribution to change the value of x. So now as you can see x0 has become x1 and now in the next step what you do is you change the value of y again using the conditional distribution probability of y given x and you sort of iterate this continuously wherein you change one of the random variables at every step in time. And once you start doing this, a few iterations later you will be at xi, yi and then again you apply the same thing. You get to yi plus 1 with xi which is a conditional and then what this algorithm guarantees is that after some time, which is called the burning period, the values of xi and yi that you will get is actually guaranteed to be from the probability distribution p of x comma y, right? So that's the beauty of that and let's now fit it into the context of the probability, the joint probability distribution that we saw, right? Because the x0 comma y0 that we saw happened to lie almost outside the probability distribution, right? This point as we all know is less likely to belong to this joint distribution, right? Now what's going to happen is that in the next step when you apply the conditional distribution the value of y might remain the same but the value of x has changed and then you will keep the value of x to be the same and then change the value of y because you are picking from that conditional distribution as you can see and it's going to change like this and within moments you've reached a point which is more likely to belong to this distribution than otherwise, right? So this is basically a sampling technique wherein even if you know nothing about the distribution, you can start at random and then you fix one thing at a time and you know a few moments later you'll actually be at the place where you want to be, right? So that's the power of Gibbs sampling so I'm sure you know when we have unknown parameters what you do is you start random, fix one parameter, fix the next parameter and keep doing that and a few iterations later you'll be at the place where you want to be. So let's see how we can apply this in the context of some of these problems. So I have missed a video here so but I'm sure all of you here know the K-means clustering algorithm so what the K-means clustering algorithm does is that given a bunch of points like this it will first pick a few random points as centroids and then cluster the rest of the points towards it as then you will assign the rest of the points based on the distance from the randomly picked centroids and then again you recompute the centroids then again reassign the points to the nearest clusters and so on until you sort of reach convergence, right? So that's the K-means clustering algorithm. I'm sure all of you have seen this clustering algorithm but what is the problem with K-means? Sorry, it's a convex clustering, what else? Very basic issue with K-means. There's a round covariance. Yeah, you need to know the K. Yeah, absolutely, so you need to know the K, right? And you choose some value of K that may or may not be right. So the determination of K is always a challenge with K-means. So what we'll do is how can we try to cluster these points without knowing K and how can we use some of the sampling concepts that we learned just now to solve this problem? Let's assume that this is the final state of the clusters, right? When you apply K-means. Essentially what these clusters represent is that each of these centroids actually have some sort of circle of influence around them. The points nearest to the centroid are more likely to belong to that cluster than points further away from it. In a sense there is sort of a waning circle of influence from the centroids, right? So in a sense this is nothing but a normal distribution around the centroid. So there is, assume that each centroid is trying to pull points towards itself and just like a Gaussian distribution, it's sort of circle of influence vanes as the distance increases. And when these points are two dimensional, essentially you have this kind of a 3D representation of the normal distribution around the centroid. So I want to quickly explain this algorithm. It's very interesting. I'm sure you will be able to apply to some of the problems that you guys are working on. But before that, I'll just do a quick brush up of the probability distributions. I'm sure all of you have heard of the Bernoulli or the binomial distribution, which is a coin toss, right? Where in there can be two outcomes and binomial is being able to toss it multiple number of times and Bernoulli is sort of tossing it one once. And then you have the categorical or the multinomial distribution where the number of outcomes is not just two. It's not just adsor tails. You have multiple outcomes. A classic example is being able to draw balls from a bag, where in the bag has different colored balls, right? So in this case, there are red, green, and blue balls. So it's more than two outcomes. So that's an example of multinomial distribution. So I want to introduce one more interesting distribution called the Dirichlet distribution. How many of you have heard of the Dirichlet distribution before? Yeah. So though the name sounds a little complex, and sometimes the treatment of it also looks complex, but it's quite simple actually. If the multinomial distribution allows you to pick a particular colored ball from a bag of different colored balls, what the Dirichlet distribution does is to be able to generate the bag. So how should that bag look like? Should the bag have a uniform representation of all colors? Or should it have a skewed representation? What kind of bags should be available for you to pick a ball? So that's what the Dirichlet distribution does. And in essence, it generates a multinomial distribution. So as an example, if you set the alpha parameter of the Dirichlet distribution to be 100, you will get a bag that looks like this. On the other hand, if you set an alpha parameter of 0.5, the bag would look more like this. So you can see the difference between the two bags. One bag seems to be a little skewed. There are more orange colored balls there. In the other one, the number of balls seems to be more uniformly distributed. So how are we going to use this to implement our K less means clustering algorithm? So whenever you encounter such problems, the first step is for you to come up with a reasonable generative model. And I'm going to define the generative model to be something like this. So let's say I have a bunch of points which I want to cluster. So let's make some assumptions on how did we get this set of points. So the typical generative model would look something like this. You need to choose how many clusters to generate. And then once you have decided the number of clusters, for each cluster you decide a mu, which is the mean, and the variance or the covariance if it's more than one dimension. And then for each cluster, you also decide number of points that should belong to that cluster. And then you generate the points. So this is how the points were generated. And now the problem is that you don't have any of these parameters with you. And as a data scientist, you are given only these points. And you have to infer what went into generating that data. So let's assume that there were three clusters generated like this. And as you can see, all these clusters have different centroids. The orange one has a smaller variance. And the blue one has a bigger variance. And the green one has a skewed sort of covariance to it. So now the next step typically in solving these problems is called the inferencing step. So we know that this data was generated in this fashion. Now we have to infer things, which is compute the parameters with which this data was generated. So this is what we are going to do in the inference. So the inferencing problem is as follows. So now assume that there are no more than M or max clusters. So you may say that M looks like K. Well, for solving most of the problems, you need to have some sort of prior. M could be the worst case as many number of points that are there. But in most cases you know you have some sort of hunch, which is better than having to guess the exact value of K. So choose M. And then for each of these clusters, we have to infer the number of points that belong to the cluster, right? Which essentially is nothing but a multinomial, which is how many points are there in each of the cluster, which if you recollect is like that bag, which is number of points belonging to each cluster. And for each of these clusters, we have to infer the centroid, which is the mean and the variance, right? So let's see how we can get this information. So which is the next step, which is inferring the parameters. So as I said, please pick a sufficiently high value for max clusters. Now why did I introduce the Dirichlet distribution? There's one more parameter that you have to set, which is the alpha. The alpha decides how these clusters are likely to be distributed, whether it's going to be sparse or uniform. It is definitely better than having to choose one value of K, because alpha sort of decides whether all clusters have points or few clusters among these have points. And then what do we need to find? We need to find all these values, mu1, sigma1, n1, mu2, sigma2, n2 for all of these max clusters, correct? So this is essentially a joint distribution. Now as you can see, finding or sampling from this joint distribution and finding all of these values is a hard problem, right? So now what can you do? Instead of sampling from the joint distribution, we could sample from the conditional distributions. And now you shall see that the conditional distributions are much, much easier. So we're going to now apply Gibbs sampling algorithm. So first, use the Dirichlet distribution to sample a multinomial. So again, this is a random multinomial. It may or may not be right. Now what you do is assign random points as centroid for each cluster. Assign a random covariance for each cluster. Very similar to the way we tried to sample points from the joint distribution earlier. And now based on this configuration, assume that this is right. Now you assign points to the cluster. And remember that essentially you're doing this assignment or sampling based on the conditional distribution, right? Similarly, now once you have assigned points to the cluster, now recompute the mu and the covariance in each of these clusters, which is again an assignment based on this state of the assignments of points to different clusters. So assignment or sampling based on the conditional distribution. And then again you recompute M, which was a multinomial, based on how many points are in how many clusters, as well as taking into account the Dirichlet prior. Again this is an assignment or sampling based on a conditional distribution. And essentially you just do this enough number of times and pretty soon you will see that these converge and you will have basically the split of the points into clusters. So I had a video here, but I'm not sure. So I'll go through this quickly. As you can see at the earlier step, it has actually reduced, my max clusters was set to 10. And it's gone down to four already. We know that there were three clusters, if you remember, in the example that I showed. And so you can see that the fourth cluster, which is orange in color, is slowly vanishing. And you're left with basically three clusters, right? With the right covariance and sort of means. So how can you apply this? So we did this in our company to basically identify similar companies by looking at the skill set of employees within those companies. So if you take the skill set of employees within the company, represent it as a vector. And now if you do the KLS means clustering, I've done this on a small sample set, just for illustration purposes. You can see that companies which are similar come together. Like the Goldman Sachs, the Morgan Stanley, the Tava Research comes together. The Cisco, the Juniper networks, the Tejas networks, et cetera, come together. So this is pretty powerful because it's very hard to know how many different kinds of companies exist and set the right value of K. So this is one example. For those who are interested in knowing the code, I have uploaded the code that I wrote for this over here. So you can look it up as well. And so bit.do slash KLS means. So I'll quickly explain how many more minutes do I have? I know I've run out of time. Ten more minutes. Thank you. So now we already saw one application. Now let's see how we can apply to a slightly more complex problem, which is what is this? It's a resume. So the problem is the resume parsing problem, which is understanding the contents of the resume. So why is it a challenging problem? Because it's a semi-structured document. It has both text in it. And there is also a notion of a structure there, which is characterized by the visual and logical layout, the organization of the content, the different fields, the font size, color, indentation, all of this convert. What would otherwise be an unstructured document into a semi-structured document? Okay, so now as before, what do we need to do? If we are going to apply a sampling algorithm to extract meaning from this data, we have to define the generative model. So generative model is like how do people write their resumes? So we'll try to basically get to the minds of people who write their resumes, which is all of us in this room, and try to come up with a model, which fairly matches how people do it in real life. So resume is nothing but a bunch of text blocks. And each text block, as it could be of heading type, or it could be actual content that you write. It has a certain styling. You decide where should this be nested under, which means that every text block has some sort of a parent. And then what are you going to write there? So let's say you're in the work experience section, then within that you can put in the heading of the current company that you're working on, and then within that you're going to write what exactly you do in your current company. So there is some sort of structure there. So all of these parameters is something that you think through, and then you build your resume. So this is a generative model which we need to infer from. So what we do is we try to see what are the properties or aspects with which you write this resume. So typically what happens is that text blocks of similar class tend to have similar styling. So for example Google and Microsoft are companies, and essentially you would tend to write both of these in very similar font or styling. Similarly the role within these companies, you tend to write it in the same format. And also what we see is that in most editors you would tend to have text blocks to have the same styling at the similar nesting levels. So you can see that even though these two sections are different, but be computer science which is actually a degree looks similar to software developer which is a role because the nesting levels within the semi-structured document that is the resume is same. So at nesting level three, most editors make you type the same font styling. And also another interesting thing is that, you can see that Google and Stanford University have similarly styled parents. So if two nodes have similar styling they tend to have similarly styled parents. So what we do is we take these properties and based on these properties we try to infer what were the parameters that went into making of this resume so that we can have a pre-structure of how the resume was written. So essentially the I text block, right, so the resume has a bunch of text blocks. The I text block is generated by sampling from this joint property distribution, right, which is deciding whether it's a heading or content block, what is the styling, what is the text based on certain parameters. So what is the inferencing problem here? So inferencing problem is you're given the resume. Now you have to identify the structure and what does each of these text blocks mean, right? So essentially given these, we have to determine what does this text mean. Is it talking about the institute you studied at or the company that you worked at? So all of these parameters, right? So essentially, as you can see, you're trying to sample from a pretty complex joint posterior property distribution. So how do you solve this problem? Instead of sampling from the joint distribution, you have to sample from the conditional distribution, which again in this case is much more simpler if you take a look at the properties that I have listed. So this is an example of the output that you get when you pass this. So we get a structure of what each of these mean and more details of this. We presented our work at ICDAR. So you guys can Google for it and get into the details of what happens in the inferencing stuff because you don't have much time left. So thank you all. So happy to take any questions that you guys have.