 My name is Uncle Martin, I work for 227 Innovation Lands. So today I will be talking about some of the aspects, some of the critical aspects behind predictive analytics. Actually I will be focusing on the text mining side of the sports scenario. So now what is predictive analytics? Most of you are familiar with the problem. We basically want to leverage some size, some inference out of the data we have and we want to make a prediction. Now in many of these circumstances, the data we deal with is essentially text data, which is unstructured. I mean we have the log data and other kind of more structured data as well, but I will focus more on the structured part of the data. So text data is very high dimensional and it is fast. Now what do we mean by this? So high dimensional because in a particular course, there are a lot of words and each document would only have so many words out of the dictionary. So in that sense, if we treat each word as a dimension, each document would comprise a lot of dimensions. Although I am wise fast because only few of these dimensions will eventually be filled. So this is what I mean when I say it is high dimensional and it is fast. So we will look at how do we capture the text data? How do we represent the text data? So initially, first of all, I would like to point out that the general step done to whenever we start dealing with text data is some kind of preprocessing. So in the preprocessing phase, we typically remove the stop words, which is nothing but the very commonly occurring words, like articles we have and positions and all those sort of things. So this is called stop word removal and there is something called stemmy, which means that we basically cut down the word to its root. So for example, like raining becomes rain and so on. So this is a standard preprocessing step we do in most cases, not all cases, but in most cases. Now text data can be represented in a variety of ways. One of them is backwards. Now when I say backwards, I mean that we don't really take into account the relative positioning of the words in a particular document. So that's why it's backward. So all of which particularly famous model is vector space model. Now vector space model, as I initiated in the last slide itself, is basically representing your document in a vector and which has the dimensions of the entire dictionary you have, the entire dictionary you have. I'll show you a visual presentation next. So this matrix is called term document matrix. So you're seeing that each row represents a particular document. And we have end documents in this particular corpus, which is nothing but a collection of documents. So each document has, in this case, key words. So this w11, w21, and so on. These are the words which are basically denotes the elements of this particular matrix. So I'll just move on to the previous slide once again. So I've written the measure of these elements. So what is the value we get to these elements? So value can be given in a variety of ways, out of which the popular ways are like, you simply count how many times this particular word is occurring in a document, which is nothing but a term frequency. But a slightly better way to capture the, you know, the distinctive information of your document is to use something called EFIDF. Now what is EFIDF? EFIDF is term frequency multiplied by inverse document frequency. So for example, if you have a word which is occurring a lot of times in a particular document, it will have a high term frequency, right? So suppose take an article like THV. But this article also occurs in a lot of other documents. So document frequency, which is the number of documents this particular word is occurring, is also high, which means that IDF is low, right? So TF multiplied by IDF will be low. So essentially the idea, this EFIDF measure captures the distinctiveness of a particular word in a document. So the most commonly occurring word throughout the course will have a low EFIDF. And the words which occur very infrequently will also have a low EFIDF. So the words which will have high EFIDF are typically the words which occur in one or only few documents, a lot of times, but occur very few times in other documents. So this as you can intuitively guess, captures a lot of information about a particular document in a given course. Now having said that, like in this very slide, so I mean assume that the news are EFIDFs now. So each document can be represented by, as you can see in the right side, by a vector, right? So now how do we measure the similarity between the two documents? So one again, a simple way of measuring the similarity that can be otherwise, but one straight forward way is to measure a cosine, this cosine, the theta, cos theta of this particular between the documents. So if the documents are very much similar, they will have the cosine value, I mean they will basically, the vector will be as near to each other. So this is very intuitively a thing as well. There are, as I wanted to, other measures also, right? So as I already talked about, this particular bag of word model doesn't capture the positioning of words. So for example, if I have a sentence, say the sentence is our document, which means, as an example I can give you that it is raining, and the other sentence is raining, so these two documents will be, in the bag of words model, they will not be differentiated, because the related positions are not being measured here. So this we can measure by something called n-gram model, right? I mean in picture, the example I gave you was not as great, because n-gram model will also capture the same thing, because by n-gram I mean that instead of capturing, now one word will capture a particular group of words, like sequence of n words or something like that. So this is just an example I gave you to demonstrate that there are other ways to capture, to represent our text data. Now the more interesting parts of this representation are the semantic models, because they are able to somehow capture the information, the semantic information which a text gives. They are not like as lined as the previous models. So one of them could be parsing. Now in parsing, parsing typically means as most of you will be familiar, it means that tagging the sentences in our given document into subjects, objects, and words and so on, right? So if with our text data we have this additional information about the nature of those particular words, they have some more semantic information about the word. But other computational techniques also exist to basically capture this semantic information. So these could be one of them could be dimensionality reduction. So as the name suggests the dimensionality reduction means that out of this huge dimension which a document has, as I pointed out that a document will have all the dimensions of the words occurring in a dictionary, right? So but we don't want that because in a typical document we don't have all the words from a particular dictionary, right? So we would want to reduce that. So one way is to capture this is a technique called latent semantic analysis or indexing its words in time in a different context. So next slide tells something about it. So see the idea is to see X here is again a down document matrix. So without going too much into mathematical detail I will just tell you like what does this machinery does, right? So what happens is that once you have the representation in this matrix form, the text document matrix I have pointed out previously, right? This one. So this particular matrix could be used, could be expressed as some as a product of you know, u, sigma and v, where v across both. So where, forget about what do the v, but here the important point is to remember that sigma here is a diagonal matrix, right? So this is called a singular value decomposition and sigma the important you know take away from this is that sigma here now is a diagonal matrix which means that it only has values in the diagonal part, right? Now what we do to reduce the dimension is to truncate the sigma. So suppose you have the sigma is like an n cross n matrix. So you just truncate say for example just choose the first k, so to say them in the value in this very diagonal, right? And form another diagonal matrix and by some operation just revert that to original x. So essentially what I am saying is that from x as you can see in the you know this the lower line x we have decomposed into u, v and s and from s we made something like a steric s which is again you know truncated form of s and we recover this you know steric x. So now this steric x will have a lower dimension than this original x and it will somehow magically will be a you know representation of the same matrix in a dimensionally reduced form. So this is the essential mathematical machinery behind this LSA. So there are other dimensionality reduction techniques as well but this was just an illustrative example. Then there are different class of methods called topic models. Now what are these topic models? So topic models specifically I will tell you something about LDA which means latent dimensionally allocation. So again the name might sound scary but so the essential idea behind LDA is this. What we do with the standard assumptions we take is that assume that the number of words in the documents is in corpus. So in corpus as in the collection of documents the number of words in the document are distributed as per some distribution. Forget about what distribution now. So assume that they follow some probabilistic distribution as of now. Now you assume that you know each document is again a distribution of some topics. So the idea behind LDA is to figure out what topics these are. I mean till this point we don't really know what are the topics we are going to talk about. We just assume that this is like which is a very intuitive guess right. I mean whatever you are talking about each document of you have will be a mixture of something of that sort of some particular topics. Say your document is talking about you know some sports sometime but then it's talking about politics some other time right. So you can say that it's 90% sports and 10% you know what politics and so on. So this is a very intuitive assumption and then last assumption is that each topic is a distribution of words. So what this is what this LDA does is to you know backtrack from this assumption. So these might look like you know obvious you know assumptions but they basically give you very you know I mean they basically they compose your documents into certain topics. So basically you will get the topics which are describing your whole document. So these so far we have seen what in semantic model policy dynamically reaction in topic models. Again there are other kinds of topic models as well. So these techniques I mean like as you must have seen that this talk what can we basically leverage out of these topic models. So one you know state application is to basically you know kind of club for example if you have two different documents and both are talking a lot about sports right. So we can merge them together merge in the sense we can group them together. So we can say these topics these particular documents are talking a lot of sports. So let's like form the group of these and there are other topics that are talking a lot about politics other topics about food. So there is an essential grouping involved in this topic models. So this grouping is technically called as clustering. So there are other clustering approaches also. So the idea I was to say demonstrate that this topic modeling in itself is a clustering approach. So as you might be familiar clustering is nothing but zone. Grouping of this you know documents in certain understandable groups. I'll just demonstrate an example of a clustering popular method called K-means clustering. So in K-means clustering see if you can look at the diagram given to the illustrator. So what happens is that in this data points we have somehow represented our document like this this varying points. Just select randomly suppose I want to have I want to group this entire collection of documents into say what three groups right. So what I did I mean as a standard procedure what I can do is to select any three I mean this can vary actually but one of the approaches that we select any three random points which are represented here by red, green and blue right. Now what we do is that you know we see okay what are the point nearest to these points. So if you can like I mean one nearest measure could be this the cosine similarity but there will be other nearest measures also. So for the sake of you know explanation can assume that you know the documents which are nearer to this are club together. So you can see in the second of the figure so the red there is we have colored several documents which is to indicate that these particular documents are club together. Now what we do is that again you know figure out the centroid of this particular group so for now in this picture we have three different groups these may not be the you know very group we want because I mean these may not be the this may not be the final answer we have. So what we do is that we again figure out the centroid of these particular you know groups. So you can see that in the third picture how we are figuring out the centroid and after figuring out the centroid we have three new points and again repeat the same procedure. The procedure converges after the point and we have a stable you know clustering. So this is as you can see that you know in the last picture we have a natural clustering which has emerged out of this you know so many documents we have the written picture. Again a problem comes usual problem we have phase which is called classification so the difference between clustering and classification is that in clustering we don't really know like I mean what are the groups we are going to you know our data is going to fall in. So it's in that sense unsupervised in classification we know the groups we want. We know that for example our data is with say foods, foods and what quality and we want to before we know everything about the data the group decision. So and we want to figure out which document in our purpose belongs to which group in the nearest possible sense. So this is the classification definition of the classification problem and yeah so classification the problem can also be a problem with write your ways to distinctive ways to be rule based and other I mean okay let's let me talk about rule based classifier first. So the rule based classifier for example if you are dealing with text data you can figure out okay I mean suppose your text data leads without you know customer data so you can say okay like the particular documents or the particular you know the this paragraph which about say buying or selling I'll say these are the these documents a lot about buying something the document which talks about you know customer help or something like that talks about let's let's just look into the help required category I mean I'm just giving you an example an example. So you know we can have a lot of rules we can have okay we can say this word is present of this particular phrase is present let's do this and so on. So this is really a rule based machinery. So you know so an example of this could be a decision tree approach so see this example. So here this decision tree is finding that okay like how many the survival of the passengers in that Titanic ship right. So I mean like this is so for example the information we have is based on the observation we have collected right. So essentially it's like the data I mean we ask questions to data we deploy rules to the data we ask questions. So here we are asking is the sex male yes if yes then is the average age greater than 9.5 years then there is a probability that okay this is the this much the probability that he has died in this much the probability and still has to mean that if he has siblings or if he has the wives or something. So this then so much the probability so this gives us that you know so this is based on our past data so this is just an illustrative example that how do we you know I mean so you know some website which has a lot of traffic from everywhere. So you can you know one strategy could be to enjoy such a decision tree learning framework and figure out you know figure out a lot of things in your previous data and then like in the new data which you have presented try to see where does it fit in. So this is the essential you know model for the decision tree learning approach then we have other class of approach for linear classifier. So you might be familiar with something called logistic regression. So what linear classifier do is that you know it's basically you know kind of the output can be framed as a linear algebraic combination of our document and a coefficient vector okay so yeah let me not talk about this I'll show you two exactly cases where it is used in industry so see so this model is called very categorization so if you can we can look into this data so this is that typically the customer data you have right so I mean so you can see the label agents in customer right so we have figured out that these are the lines regarding ratings we have the lines regarding issues and so on right. So these are again these are the standard classification approaches we have used. So now having figured out that okay this which one of the line leads to the rating which is the issue line which is the issue line and so on you know categorization among these very lines so for example among this very like cancellation lines that lines like the type by the type by the customer which says something about cancellation of a particular product we again categorize into a second level of classification which can be which could be visible here in the down layer so you know I mean so this is the standard approach we do on a day to day basis in our data and this might be a little fascinating to give you a lot of insight about you know what do you do with the first of all the numerical measures of how many customers are like rejecting or accepting and so on so you see that they are out of this categorization line. Now so this is a good example but this is not particularly a very sophisticated example a sophisticated example of this whole machinery which could be something called ontology learning now what is an ontology? So an ontology basically is a you know is a kind of you can say a sort of dictionary or rather a sort of actually a more than that which has which captures a lot of information about the terms you have in your data the kind of groupings which these terms have the kind of relatedness which one concept has to another concept and so on and what are the rules you can figure out from your data for example you can figure out an example of the group could be that suppose you know the customer is saying no in one of the lines in any three lines consecutively so it is not going to be like at all or it is not going to kind of convert into a you know into a prospective customer at all so these kind of rules could be figured out from your existing data so you can see that these kind of you know this is again a very high level picture and these kind of rules employ a lot of machinery I mean I if you can figure out the to extract the terms which are the important terms from the data we require a lot of pre-processing then we require a lot of you know we have to extract terms then we have to pass terms that they are announced and so on and then again so to form the concept out of these texts we have standard trusting approaches then once you have found the concept it is a hierarchy so we can do something called hierarchy clustering which is not a plain clustering which I mean by which I mean that it is not just grouping the data into a regular cluster but it is you know kind of a cluster of the increasing or decreasing order so you would have one big cluster which will have many sub-clusters cluster and so on right so by this very fashion you can basically you know collect all your concepts you can figure out the relations so one of the you know kind of soft the state forward way to figure out the relations in your data just to see you know that okay that I mean use something called that by if you can extract the words like you know the relational words like by in some propositions that I have this so you can you can extract have and see what are the you know subjects and objects leading to it so this one just to give you you know and take a stop what goes on in this machine alright so these are the few good references I have brought down so if you are any more interested you can as long as you your data right in any data again just not the only similarity measure again this was because I could only talk about the kind of things we have on a you know you know you know you know you know you know you know you know you know so you know but at least as far as the storage is concerned you know you know any kind of so if the clustering is more efficient like this that can be understood