 What I am going to do is I am going to talk about a subject that is not something that a lot of people talk about which is the you know the devil in the detail right what makes us a great data scientist versus the knower of all the tools and algorithms and that transition right between knower of all tools and algorithms versus being a great data scientist what does it take and you know I will show you some of the things that I have learned over a period of time and I put them all together as what I call the art of data mining and do not think of this as one big talk it is like you know seven small talks. So, you know do not look for a coherent theme I am going to talk about many many different kinds of things that we have learned and hopefully some of the things will be useful to you in what you are doing right all right if so if I have to summarize what I am trying to do is basically I am you know sort of making this a reality which is in theory theory and practice are the same, but in practice they are not right in practice things are different. So, we are going to see how things are in practice when we apply all the stuff that people do all right. So, we all are here because we believe that you know we are drowning in data and we saw a lot of data of different kinds today we saw genomics data, but you can imagine right people coming from different different verticals you have data in all these verticals. So, we are really drowning in information and starving for knowledge and and the goal is now to extract more meaning and insight into this take better decisions and all that. So, let us look at the life of data right. So, the first thing we all do is clean clean clean right it is a never ending process then we do basic things like we summarize we report then we do more interesting things right the dawn of the database system the indexing the searching the internet and this is where we are we are very efficient with that and the next stage is we understand what the hell the data is right can we understand the data the way we do as humans would right can machines understand the data visualize it then we analyze and build models and then we generalize and make predictions and once we have make predictions then we make decisions out of those predictions right. So, so really what we are doing is we are making decisions out of data we are not just storing it we are not just analyzing it we are not just writing our map reduce jobs on it we are at the end of the day we are making decisions from it. So, let us look at some of the stages right. So, we generate some insights we create some features we build some models on those features we make some predictions and then using those predictions and some business objectives we make some decisions and based on the decisions we get some feedback which we add to the data and then the cycle goes on and on right. So, you can apply this model in all of your domains and obviously the kinds of features you you create and the kinds of business objectives you have you are going to get different kinds of decisions and all that. So, we would not go into that but let me ask a question first what do you think makes us intelligent? Pardon? Applying. Applying what we know ok. Identifying patterns are predicting things great right. So, we all agree that there is a clear definition of what makes us intelligent right dogs are less intelligent plants are even less intelligent and maybe you know future humans will be more intelligent. So, what makes us intelligent? So, let me tell you in one slide what makes us intelligent this is just part of the definition. So, let us see how many of you can recognize the letter on that they are all the same letter right. What letter is that? A right have anyone of you seen any of these A's before? No right and still we do this and that is a transition between a database and a intelligent system right. So, we learn from that and this is one of the important qualities of data mining which is generalization the ability to predict and assign a label to a new observation. So, these are all new observations based on the model. So, we have a model of A, we have a model of B, we have a model of a BMW car, we have a model of our grandmother. So, we can you know recognize all those things based on that and that is what makes us intelligent right. So, let us look at how do we make machines intelligent? So, we know humans are intelligent most of them right and then sorry and then. So, how do we make machines intelligent now? So, there is a whole process there is an art and there is a science to it right. So, you know we collect data. So, you have sensors all over the world of different kinds right remote sensing genomic sensors all kinds of sensors a click is also a sensor then we also collect the ground truth which is the what the value you should predict. So, if you are predicting you know if you are measuring some kind of thing on the ground somebody has to go to that place and actually see what is on the ground and that will be what the output is then we build you know engineer features and then we train our model we pick a model type and then we evaluate the model and then we deploy the model. Once we deploy the model we get predictions on the unseen data right. So, we all understand that part and there is some art there is some science and then there is some engineering in all this right engineering is the scale and all that art is what we are going to talk about and there is a lot of science in terms of modeling techniques and all that right. So, now we will start with the short you know think of these as flash talks ok. So, let us start with some of the things that we can do better to improve our model. So, I am going to give these titles to each of these short talks. So, then this what is called tame the distribution chaos ok. So, let us look at. So, what I did this morning was I went to the Kaggle and I downloaded this data from Higgs Boson and I I just looked at the distributions of those features right. So, here are 4 distributions of there are 30 dimensions in this data and look at the distributions if I want to show you some more these are distribution. This is what I call a distribution chaos that all your features have a very different character to it right. Now imagine if you want to you know what will happen if this is the case right. If none of the distributions are kind of similar what will happen you are going to get a weird Euclidean Euclidean distance right. You cannot just take Euclidean distance just like that right you do make the distributions more well behaved and you are going to have difficulty learning model. So, before we start jumping into hey push this into this thing let us look at the distributions let us deal with this chaos and all that. So, we do a lot of things like transforming the data into different distributions and scaling the data right. So, one of the one of the most interesting things about this is one of the most interesting transformation. So, generally you know you think of normal distribution as being normal distribution that means it is normal to have normal distribution, but that is what the text book says right. In reality how many of you have seen a normal distribution in your data right nobody right. So, that is the reality and generally a lot of distributions are a lot normal or exponential and things like that. So, there are a lot of things we can do to transform these into better features first right. You can also discover and these are all of the same Higgs boson data you can discover interesting things. So, this feature if I take the log it turns out that it is a bimodal. So, understanding your features and the distributions makes you a better data scientist is the is the thing you do not build a model first you understand your features and now you can do things differently on your features right. Now, what about not normal distribution there are distributions like this there are distributions which are multimodal can there be a universal way to transform all these types of distribution. So, you do not have to worry about these kind of variations right and my favorite thing to do here is what I call a cumulative density transformation. So, if you look at any of the distributions and you know this is p of x x on the x axis and p of x on the y axis. So, if you take x and you actually create a cumulative density function for that and you can do that for any density function then you can map this value of x to this number which is how many points are below me or what fraction of data is below me and that is a very universal thing because you can apply this to any kind of distribution right and what that does is it brings all of your features in the same 0 to 1 range right and now you can do your Euclidean distances that other things ok. It works on all kinds of distributions and you can take a log before you do this, but it would change anything because you know it is a monotonic function got it. So, it is one of the things that you can try now there is another kind of very common distribution like the Ziff's law it occurs all the time this is if you rank all the words in descending order of frequency take the log of the rank and the log of the frequency and it is going to look like a line. So, it is a power law and this occurs in many places it is not just in words these are cities in descending order of their population these are websites with the number of unique visitors. So, it is a very common phenomena in a lot of domains and again here log comes to our rescue. So, we all know about IDF and we take a log there and the reason why we take the log is to tame this exponential thing and sometimes you also take a log in our TF IDF right when we do TF. So, there is a fractal nature. So, what is true about the whole corpus is also true about a single document and therefore, what we do on the single document is also take the log right. So, the point of all this is look at the distributions before you start building the models make the you know make them homogeneous before you start computing distance functions right alright. So, the next flash talk is about independence assumption is an unnecessary evil right we have been told that hey assume independence and your life will be simple build a night based classifier or a vector space model and you will be done, but really again like there is no such thing as a normal distribution there is also no such thing as a independence assumption right. So, so really one way to look at independence assumption is to say how important is a word you know in a collection a collection could be a document or a tag set. So, there are two kinds of word weights that we already have right we know that IDF weight gives me you know the rarer the word the more important it is and TF gives me the more frequent the word in a document this thing, but I believe that these two are actually very useless things what we really need is what we call a contextual weight and this is what I mean by a contextual weight. So, if I look at these collection of words on the top and the bottom right. So, you have rain thunder umbrella lighting chocolate kids birthday candies chocolate cake candles. Now, if you look at the word chocolate according to IDF weighting it is going to get the same weight in both the places right, but obviously in our mind we think oh the word chocolate does not belong here as much as it belongs there right. So, we are giving different weight to the word chocolate based on the context right and that is what is missing and if we can solve that problem then we have overcome what we call the independence assumption that you can give weight to a word independent of the other words is a wrong assumption let us do it more systematically. So, let me give you an example this is from Flickr this is an image and these are the keywords or tags that were given to this image ok on Flickr and if you use the IDF weighting these are the weighting and basically it says the rarer the word on the Flickr corpus the higher the weight right. Now, one problem with that is that each word let us say the word finger is going to get exactly the same weight it does not matter where it occurs right and that is a big problem and that is what we need to address if you want to do anything meaningful with our text data this is the first problem we need to address right. So, how do we do it right? So, let us look at that question again how important is the word in the collection and let me define it in a very simple way then we will do the math. So, we are saying a word is important if it is strongly corrected to other important words in the collection simple right. So, you imagine your part of a team and you are connected to your manager very nicely. So, you are important now and the third guy who is connected to you is also important because he is connected to another important guy right. So, and you can imagine this whole thing going on. So, let me show you a beautiful use case of the page rank algorithm on doing this right. So, this is the page rank equation and let me break this down. So, what I mean is the word is important ok. So, we are saying the word i x i is the i th word in the collection think of the word chocolate and we are saying is the word chocolate important or not in the next iteration. So, the word is important if it is strongly corrected to another word j. So, the word j is the other words right the word cake or the word candy or the word kids. So, x j is those words i is the word chocolate. So, we are saying if this word is strongly connected to this guy right and this word is also important. So, if the word cake is important and cake is strongly connected to chocolate then chocolate will become important right and it is a recursive definition right. And so, that is what we can do and this is where you can put all your independence assumption. So, the prior of the page rank you can say is my t f and i d f. So, p 0 is proportional to how rare the word is in the corpus and how common the word is in the document. This is what it becomes proportional to and this is what takes care of the independence assumption right. So, you can mix and match with the theta parameter and you get a application of page rank or random walk on computing this. So, what happens is and that is the independence part right. So, if I do that this is what I get right. So, same words sorted in a different order because now think of these words as do not look at the picture. So, the picture is nothing to do with the tag set right the computer is not looking at the picture right. So, we are just looking at these words and it is saying oh the world child smile happy mother woman family this is really what makes the theme of the collection right. They bubble up automatically because they are strongly connected to each other and the other words like you know Kerala, orange, Hindu, finger they all go down in this right. So, you see the transition right. Another example here is a set of words on this picture jumbo peanut alligator ball and then if I do the contextual weighting I get these weightings right and this is the beginning of all text mining which is how do you weigh a word in a document right and if that is wrong then everything else you are doing is not really helping ok. So, think of this as one way to think about independence assumption and not just think about knife maze versus something else think about page rank to do that alright. So, this this quote is actually coming from a discussion I had with Alex Molag and we were in Yahoo Research and we had long argument and then I said hey you know it it is not really how you count, but how you normalize that matters ok. So, let me tell you what I mean. So, let us look at this example right. So, and the key message here is model the variance that matters ok. So, let me tell you what it means. So, let us look at an example here. Let us say I want to create an information retrieval system and I have a feature that says match the title field could be any field in the document. So, match the title to the query right and let us say I come up with this simple thing which says oh the word every word in the query right has a weight this could be the IDF weight or whatever weight and then what is the match of that word or the token in the field right what is wrong with this matching function I mean it is like you know the more words matching the field the higher the match right. So, it makes perfect sense, but what is the variance that will come into this because of what we are doing right. Longer queries might get more weight shorter queries might get less weight. So, what will happen the model will have to deal with that variability which is not really real right I mean when you score it does not matter you are doing a relative score, but the longer queries you will you know try to get higher weight. So, the model is now working in two spaces right one for single word queries and two word queries and three word queries why make the life of the model so miserable. So, what we can do is just simply do a sum of the weight of the token at the bottom right very simple things like this can make a huge difference in the kind of models that you are going to get. What about the length of the field does it matter query length matters query and field are two parts what about field length I will let you think about that question and you know use it. Let us look at another domain credit rating right. So, let us say I am a bank and I want to you know decide on your credit rating I am looking at your feature which is how much total card balance do you have right. So, you have taken loans and credit card loans if I look at this raw feature what is the problem with the raw feature that I am looking at right. So, imagine if there are rich people they will have a lot of this number will be high poor people will have this number low new graduate students will have very low right all kinds of things right. So, there is a variance in that which is not useful to the modeling class how do I remove that variance yeah exactly right. So, we do something to say ok what is your credit limit and how much of your credit limit you are using. So, let us say I have a 5 lakh credit limit and I use 4 lakhs 3 lakhs all the time, but somebody has you know 5000 credit limit he also uses 4000 3000 our score should be the same right this feature should be the same. So, think about whenever you are using absolute numbers that is the most dangerous thing to do always think is there a variance in this that does not affect the model and how do I remove that variance right total card payment. So, if you do card payment regularly again it is not about how much card payment this is an absolute number right and absolute numbers are very dangerous things in modeling you do not want to use them you want to use normalize. So, I will say what fraction of card payments out of the balance do I pay my balance in full or do I pay my minimum balance and what fraction of balance do I keep right that is a better feature than the thing ok total debt by excel total debt is a very dangerous feature because again right rich guys have a lot of debt poor guys have less debt they are more happy because of that, but you want to divide it by something like at your income right. Now, you realize oh you know what total debt is actually exponentially distributed annual income is also exponentially distributed ziff's law. So, maybe a better feature is to take the log of the I am just connecting this to the previous thing and saying how do you combine multiple such things to come up with better and better features right. So, look at these features look at the variability when you create a feature like that and see its correlation with the class and see if it is modeling only the variability that matters and not the rest of it right query length does not matter you know total annual income does not matter. So, how do we normalize for all that alright this is another interesting thing which is model the deviation from the expected not the actual thing. So, let me give an example in forecasting let us say I want to forecast something I want to forecast the total sales of let us say iPhones in some context the context could be this place this time this segment of population whatever the context is right. Again total sales is an absolute number which is a very dangerous thing to model because total sales in Mumbai are going to have a completely different distribution than total sales in let us say Bhopal or somewhere else right. So, again it is a very dangerous thing to model an absolute number and the variability in that is going to be huge. So, what we do instead of modeling total sale we say what is the total sale in this context and is it higher and lower than the expected sales for that context right. So, on an average in Mumbai I sell this many iPhones in the summer that is why I have expected and then what is the actual and then I take the ratio. So, this is a much better feature to use either as a predictor right something you want to predict or as a input feature ok ok click prediction right. So, we do a lot of click prediction. So, let say I want to predict what is the CTR going to be for a given query, query is there for a given URL at a certain position. So, this for this query I show this URL at third position and I want to measure the click through rate right. So, I can measure that because I can show this result at the third position and I can see the click through rate. Now that itself is a is again a very dangerous raw attribute to use. So, what we can do is we say ok in general what is the at you know CTR we see at the third position and are we doing better than that or worse than that for this query URL fair and this is what people use to either bump the page up or down right not the actual CTR values ok all right. So, so far everybody is with me we are doing well ok I think we will finish this sooner yeah position yeah. So, you can depend you can normalize it by let say device. So, you can say expected because let say cell phones right you do not see beyond the fifth position or something versus let say bigger screens right. So, you can have these additional things on top of it. So, that is a great question whatever we talked about earlier we can apply to the expected value first based on the context like device location and all that. So, you can do all kinds of things position is just one of the dimensions in the context, but you can say position and then country and then time of day and then device what is the expected CTR and all that and then do the actual right. So, great all right. So, do not take your defaults for granted and this is very important I learnt that in one of the three search engine companies I worked for I do not want to tell you which one it is kind of embarrassing for the company, but you can imagine right. So, this is also an exercise in how do you debug a machine learning model we all know how to debug a software right. We know what the behavior is we write unit tests and they fail sometimes we do not and then we are miserable with the segmentation fault. So, so there is a life of how do you debug a software program right. So, this is what so in software you will see a different observation it is not working the test is failing in machine learning programs how do the machine learning programs tell you that there is something wrong ok there is no test for it right how do you tell. So, what we was noticing is that one of the models were unexpectedly complex ok it was way more complex than we wanted it to be. So, imagine the decision tree or a gradient boosted decision tree or something like that which automatically changes the complexity based on the data. So, it was very complex and we did not know what what is going on right we thought you know maybe the and and that is where you know that there is some kind of a problem the model is kind of compensating for some bug ok. So, you know this does not happen in software right there is no such thing as compensating for a bug you can have a bad design and a huge code base you can fix the design of the code base and you can get a better code base size right that is what happens in software engineering in in machine learning you can have complex models which are an indicator like if you have an SVM classifier and you have lots of support vectors something is wrong right. If you have a decision tree and it is huge something is wrong right if if it is unexpectedly big. So, here is what the problem was and I wanted to tell me what the problem is right. So, let us say I am again doing a matching problem. So, this is the field right imagine it is a title of a URL the quick brown fox jumped over a lazy dog right alright. So, I create one feature which is term frequency of the query to the field and I measure that right. So, quick occurs once brown occurs once dog occurs once cat does not occur. So, I have these features and I can calculate those features right. Here is another feature called first occurrence which means the earlier the word occurs the better right. So, here are the values that I have chosen right. So, quick the word quick occurs at position 1. So, 0 1 2 3 4 like that and you know these are the positions right and the cat does not. Now, can anybody tell me what is wrong with this this whole thing cats what or it should be infinity right and that is what the problem was and we had no idea what the hell is going on right. So, everybody got that the wrong defaults right because we love 0 defaults everywhere right, but without thinking about all the features if you put all the wrong defaults something may go wrong and then the model will try to compensate for that right. So, here that is the problem. So, what is happening is the model is saying hey if the query term is present as the first term it is a great match because the number is 0 right if I add right and, but if the term is absent again the value is 0 which is equal to the best possible match and now the model is completely confused about this it is like telling your children to opposite things and they get confused about what is the right thing to do and that is what happens. So, they go around coming up with very large you know reasoning around that which is not necessary right. So, this is what happened and we figured. So, how about a minus 1 would minus 1 be a good default I know people love minus 1 if the values are supposed to be positive right first occurrence 0 to infinity. So, let us use a default which is not in that range would minus 1 make a great default no it is even worse right you are actually saying hey you are doing well you are doing even better if the if the thing is not present right minus 1 is even better match. So, again so, what we did was then we just fixed it we said field length plus a constant not infinity right and then we and then the model is very nice in course right. So, debugging a machine learning model you know think about the default values as one of the potential sources of problem alright the next thing is why should models to all the work right poor things right and SVM just throw everything at it and let it run overnight on this 1000 node Hadoop cluster or whatever and let that figure it out right machines are great machine learning is great I am very hands off right and all that. So, these are the two mindsets to machine learning it is a attitude problem which is one is a very model centric we love our algorithms right I am an expert in something you know it is a you know universal approximation engine whether it is a back propagation or whatever. So, therefore, I can throw whatever at it and it should be able to figure it out right. So, that is one mindset to modeling and what we do there is we add we create very basic simple features like we talked about right the absolute values without normalizing without taking care of the distributions without doing all that we put this and the model has to go through hell right. The other mindset is what I call a feature centric mindset which is you spend a lot of time on each feature right you see how to normalize you see if you need a log you see all these things and then you come up with a complex feature and then it will lead to a simple model right. Now people say oh you know what I you know my model is logistic regression and others will laugh at you and say oh you are just using a very basic modeling technique right, but that is not the thing you are actually working a lot on your feature side. So, each feature is by itself a very great model in itself think of each feature as a model in itself its modeling one aspect of the problem and logistic regression is just combining all the aspects into a final score ok. So, that kind of mindset is what makes a great data scientist. So, here is a very simple example you must have seen it in textbooks. So, if I have these two class problem and the data is like this right. So, if I say hey I do not care about feature engineering I just found my model to learn everything this is what I am going to learn right I am going to say neural networks I am going to need so many hyper planes to figure this out right, but if I was cute about if I was like you know looking at the data and figured out oh this is the structure in the data let me create a new feature right which is x 1 minus a. So, this is the constant we learn and then I I create an equation of a circle suddenly I have transformed this data into very nice two parts and my model is very simple right. So, there is no shame in using a simple modeling technique as long as you are doing a great job of feature engineering. So, let us look at a real example a simplified version. So, credit card fraud detection now let us say I give you these four raw inputs right. So, what is the time and place of your previous credit card transaction what is the time and place of your current credit card transaction and I want to know whether the current credit card transaction is a fraud or not right. Now, how do we do this? So, if I throw these four numbers four features into the model right I am a model centric data scientist I do not care about features if I throw these four things will it ever learn anything useful because time can be anything place can be anything right all that stuff what can we do with these four features to do feature engineering to create a better feature pardon right this is all right. So, if you look at a distance in time distance in thing and then can I do even better usual time transaction periods, but usual time transactions could be right. And you know if you know a shopaholic is shopping that the you know time is very small and the distance is not at all everything right. So, obviously there are a lot of things there, but I can actually measure a velocity if you would right velocity of transaction and now I know that velocity cannot be a very high number. So, what I have done is I have taken four raw features and I have a good data scientist who thinks about feature engineering and I say based on the domain knowledge I know that two transactions cannot happen in Eastern Europe and South Africa at the at in a very short amount of time and therefore. So, velocity now think of velocity as a monotonic function which is very highly correlated with fraud right all right. So, these are the kinds of things that will make a better model nor you know the the neural network does not know anything about your domain knowledge it does not know velocity does not know human nature it does not know anything you know all that. So, we better use that right. So, same thing in text right I mean you can use unigrams all day long right and you are going to get some level of achievement in your classification model. So, I can use words like United nation not sorry this is nations security council u dot n dot. So, these are my unigrams or I can go and improve the quality of my features right. So, I can do bigrams and trigrams, but the problem is nations security will become a bigram and a blind bigram is more dangerous than a unigram right. So, think about that do not just say oh I have a you know python bigram detector and you know it runs on Hadoop and let us me just get all that right. So, do not think like that start thinking whether bigrams is what you want, but actually what you want more than bigrams and trigrams are real phrases can I detect phrases can I match my text to Wikipedia or have another way to detect real phrases and then build models on top of it can we do even better than this. So, now you have gone from unigrams to phrases right and you are not thinking of bag of words, but bag of words slash phrases can we do better than this ah synonyms right exactly. So, between synonym and that there is one more stage that is a disambiguated phrase right. If I see the thing UNSC is it the same thing as United Nations Security Council or is it some other abbreviation I could match this to that, but it will be a problem and then we also do the concept. So, the idea is how do you grow a feature capability from a very simple unigrams all the way to disambiguated phrases and their concepts and you know the same thing in text in images you can think about you know these as features from face data that we have created which are higher level feature than small shift features right and you created higher level features the same thing in another data and using these higher level features we do a better job. So, the idea of feature engineering is a very profound idea that is what is I think the most time that we should spend on which is how do you create these higher level more meaningful features and not worry about why my SVM is you know taking so much time to learn right this is where the meat is ok. So, let us talk about one more thing which is it is not what you say, but what you mean that matters right. I am not talking about life and personal communication I am talking about data mining ok sorry. So, let us let us go back to that question again what makes us intelligent. So, let me show you a picture you must have seen this before tell me what you see wow you see dogs all the time there what you see and you do not see anything right how many people see a dog was that easy to see I mean if you have seen this before it is easy, but see most of the people are not able to see the dog yet right. Now, tell me if you were actually seeing the dog what is going on in your brain dog is eating something right there is a tree behind it and all that, but you know if you give this picture to like a 5 year old 7 year old they will take more time to figure this right all this is also something that makes us intelligent which is the semantics when I say semantics I mean semantics of the parts and the semantics of the whole what is the semantics of the picture depends on the semantics of the parts of the picture and the parts of the picture have no semantics see this this blob here right or this little thing here is actually the ear of the dog, but by itself it has no meaning if I just look at it that way, but if I look at it in conjunction with the rest of the picture I start to assign meaning to this part right, but the context itself has no meaning yet each of the parts is still no meaning. So, it is again a recursive problem like I said right a word is important if it is corrected to other important words this recursive definitions are all over the place in machine learning. So, this is again you know the meaning of this part depends on the meaning of the other parts which we do not know yet. So, now we are doing this kind of thing oh this must be this therefore this is that. So, the whole picture will suddenly make sense right in a moment of flash you will start to see a dog because your brain is turning through this kind of thing assigning meaning to all the parts and saying no no this does not fit I do not think that is what it means right. So, understanding meaning is a very important thing that intelligent things too. So, but you know indexing systems are you know very dumb right no matter how big in scale they get they are still very dumb they are doing keyword matching they do not understand. So, if I put the word program I can get results like computer program, NASA program, radio program you know America's program learning program. So, the word program has so many meanings right the word rock has so many meanings the movie the this and that. So, and and you know the dumb search engines sorry I am from Google and I am calling them dumb search engines, but that is where they are right they are dumb search engines who do not understand and therefore they will just give you a 10 blue link list and then we need to do something about. So, semantics is very important and you know that is what we need to do when we need to work on text data. So, we saw an example of phrasing right phrases are important if I match these two sentences and the word time occurs here and here the word new occurs here and here I am going to say oh that is a great match, but you have to first phrase if I before you can match right same thing here if I say I was right to avoid a suit against apple or you know man in red suit on my right was drinking apple juice you cannot say that these are very similar sentences a machine which does not understand meaning will say that, but humans will not say that right. So, we are not in the keyword matching you know we are in the 21st century they are not in the last century therefore we should be able to understand text before we do anything on top of it right and then again you know find a suit charging orange. Now, we know that orange is also a company not a fruit right apple stone suit oranges, but actually I will show you an example where we do. So, right so let us look at this from a parsing point of view right. So, there are two big communities in text understanding one is NLP community they love their parsers that is what they go by this is a Stanford parser on the sentence. The sentence is New York Times reported construction of a new international airport in New York City right. So, the word new occurs in three places and I give it to this nice Stanford parser that everybody in the world uses and it starts with the word reports because that is the word and it builds the whole parse tree and let us look at what happens look at this word new here. This word new is treated as an adjective right how is my brain interpreting that it is not a single word it is actually a whole phrase therefore please do not separate that word and give it a meaning please do phrasing before you start doing parsing right and if I do that if I do that then it says ok now I you know I can believe this parse and then you have new as a adjective here is fine right, but not this new as an eject understand. So, we talked about phrases as a building blocks starting point even before you do parsing you should start thinking about proper phrase embedding before you do parsing right all that and then semantics right now tell me what is this word spelling correction important right now same word easier to it was easier right now let us read this right I know you can read this right correct. So, self explanatory that although the spellings are not correct because of the context you are able to find the meaning of these words. So, that is a very important philosophical statement that the you know the meaning of the word is in the context of the thing ok. So, let us look at the disambiguation problem what is it that we really want to do there are two parts in disambiguation one is take a raw word the lemma the thing that appears and first discover how many meanings are there for that word in the corpus and it has nothing to do with the dictionary and that is the worst part about text mining that we can rely on word net or you know thesaras or this and that when the real world data is very very different right I mean there are far more meanings of a pronoun you know tweets has its own language right look at the hashtags what they mean. So, the notion of word sense has nothing to do with oh let me look up the all the meanings in the dictionary or a word net or something right. So, it has to be discovered and in the pharma world let us say in the medicine domain the word pain could mean more of something than the other, but if you are talking about philosophy the word pain could mean something else right. So, so that is the first part which is how many meanings does a word have and second part is given the context let me first assign the right meaning and then do anything going forward understand. So, we are not doing things at this level we are doing things at this level this is what I call a semantic tokenization of the corpus ok. So, what we have done we have done phrasing properly and then we have done disambiguation two stages and now I can say apple 1 as in fruit is equal to orange 3 as in fruit, but if I say apple is equal to orange you know the again I am relying on the system the the LDA or whatever to figure that out that apple actually has through so many meanings and orange as so many meanings. And now one meaning of this is equal to one meaning of that and the mess of doing equivalencing you know puts the modeling technique that is why LDA does not take long time because it is LDA it takes long time because we do not do disambiguation before we do LDA ok. So, think about these stages now phrasing disambiguation then LDA or whatever your favorite synonymic is right. So, you know there is another interesting problem this is related to the dot picture and I want to see the so the connect right. So, if I look at these sentence right orange filed suit apple these are all ambiguous words like in the dot picture all the parts of the picture were ambiguous right you are not solving one ambiguous problem of one word given the context what you are solving is a joint ambiguous problem which is given that this word can have so many meanings so many meanings so many meaning so many meaning which meaning of this goes with which meaning of this goes with which meaning of this goes with which meaning of that. So, it is a joint disambiguation problem it is not a one-off problem like rock as in like I showed before right. So, these are the kinds of joint problems we always solve it is very similar to the contextual waiting thing right the weight of this depends on the weight of that. So, if I assign a different meaning to the word orange here the meaning of the word filed could mean something else right if this was a fruit the word file could mean filing the peel of the orange if that was the meaning of that word right. So, the meaning is not a very obvious thing and that is what makes us intelligent that we do a joint inferencing over the disambiguation space all right have you heard of that too many scooks foil the brass I just borrowed that too many classes foil the classifier right have you seen that before I mean most of us work on a binary classification problem right spam not spam fraud not fraud you know credit give or not you know, but a lot of problems have many many classes. So, so what happens is when you build a complex model and you say ok let us say this is 20 classes right let us say this is a neural network you are building 20 classes and now you have a number of hidden units here the problem is this that these hidden units are forced to learn the boundaries across all the classes simultaneously now imagine the pain of the neural network right we are like sitting there with you know with our SS drives and Hadoops and the poor thing has to figure out how to distinguish between 20 classes simultaneously and come up with the right hyperplanes right. So, instead of doing that one of the things we could do. So, I will just show you an example of on this data and then on another data what about the hierarchy of classes. So, two classes you know not all classes are equally difficult to distinguish, but we know how to do a two class problem very very well right. So, we know that so why do not we use that to our advantage and let us solve the pair wise if you think of agglomerative clustering of classes. So, what will happen let us build a classifier between all pairs of classes and look at the let us say validation accuracy and that tells me that this class has the lowest validation accuracy which means this is very hard to distinguish and they are very similar to each other. So, now at least I can distinguish with some classifier these two classes however good the quality is and now I can combine them into a meta class right the next pair that I found was 1 versus 7 and 3 versus 8 and you can see why these are this difficult classes right. So, we build these bottom of agglomerative tree of classes pairs of classes and then we keep doing the agglomeration on top right. So, what is happening here is each of this is a two class problem and we know how to solve two class problems and here we have 6 versus samples from both 1 and 7 right. So, are coming here and we solve a two class problem what are the benefits. So, if I look at a 1 class versus the other my feature space is going to be very small right. So, imagine if I ask you to just distinguish 4 and 9 and say which features will be useful just for 4 versus 9 it is a very simple set of features right. If I ask you to distinguish all the 10 classes simultaneously it is a very complex feature space you are looking for, but if I simplify the problem 1 versus 7 all you are looking for is this top bar or something right and that is a simple. So, you simplify the feature space you simplify each of the pair wise classifiers you actually can interpret what is going on. So, these are very simple you know three dimension three features out of 128 features you know whatever the 28 times 28 could actually distinguish between these are two ways of writing let us say 7 right when some people like it with a cross without a cross. So, you can discover these hidden structures also, but you can because you are only focusing on two classes at a time you can do a lot better right here is something on the on the letter status and digits is 10 classes this is 26 classes we get this lots of advantages right you can only do you need to only do localize features I versus j I do not need to build a very complex classifier for I versus j things like that right simpler models you can interpret the features very nicely your accuracy goes up and you can model all these in parallel right. So, you have C choose two classifiers we know how to do parallel. So, we do parallel right. So, these can be done in parallel and you you do not have to load the whole data in memory and all that just the data you need and all that right. So, think about whenever you are dealing with lots of classes try to break down the problem in the class space that is automatically discovered that is an additional advantage which is you know imagine agglomerative clustering instead of clustering data points your clustering classes and you define a distance between two classes by the validation accuracy of the pair wise classifier. So, now you say merge this first because this is the lowest accuracy and then you know keep doing automatic no no no no no you rediscover the features for each pair wise classifier right right but you update the matrix by when you added a meta class now you build the classifier of this versus the n minus 1 and keep doing the same thing as agglomerative clustering. But at every time you learn a new set of features you do not depend on the previous because now one versus 7 had this thing, but with 6 it be a completely different volume right. So, you have a choice to do that right it will be very small compared to the traditional if you try to distinguish all the letters simultaneously right see the complexity depends on the manifold of the you know distinguishing boundary right the boundary between classes what we have done is we have focused on two classes at a time. So, we have you know simplified the manifold that we want to learn and then you know we learn multiple such things and then we combine them at the next stage. So, I think that is how we we also do things. So, let us say I want to distinguish between you know chairs versus tables at some level of hierarchy and furniture versus cars at some level of hierarchy right. So, our brain also learns this hierarchy in the same way you know it says these are all furniture some meta class here these are all cars, but then within cars you have subcar you know used cars new cars whatever and here you have multiple class right. So, our brain also learns hierarchies of objects in the same way. So, here when you think of an object you can think of a class or a meta class you know all characters with a circle kind of thing right C, B, one class right all letters with vertical lines horizontal lines. So, we imagine you know doing these kind of hierarchies building automatically right and I do not think we learn a neural network the traditional way right solve the whole thing ok class labels are very precious, but take them with a pinch of salt. So, I am going to tell you that even to generate your thing that you want to predict needs an art it is also an art it is not always given. So, let us look at an example credit card fraud. So, you know you do millions of transactions and most of them are normal right. So, you do not call the bank and the bank assume that that transaction is fine right. So, there is a lot of positive or negative data which is non-fraud right, but then let us say you lost your card or somebody stole your card you immediately call the bank and the next transaction that happens on the card bank knows that that is fraud and it puts a label there right. So, we collect label like that, but what can happen is you may not know that you know somebody else is using your card you see the bill after a month and then you say oh I did not do these transactions and then you get labels later right. So, how we get labels is also a domain specific thing and it can happen in so many ways. Now, let us say you did not look at your credit card transaction and you missed some of the fraudulent transactions. So, you did not even report all the frauds that happened right. So, label data is not like a holy grail that is written in stone it can have all its own problems right. So, if you try to beat the data into submission and say no no you must fit this class labels you know you are probably learning noise in the label data ok. Let us look at another example credit risk. So, here you want to predict this will give the customer default this is the different default than the previous default this amigration. So, I am just trying to connect all these pieces ok right. So, default in the next 6 months. So, that is what you want to predict. Now, how do you create label data for that right you cannot sit today and say ok I am going to wait for 6 months and figure out something. So, what you do you look at historical data. So, this is where you are now right and you know that this guy has defaulted or not defaulted ok. And what you do is you look at t minus 180 days back in time and you treat this as the observed data although you are here you could actually observe this data also, but this futuristic prediction demands that you do not look at the future data right. So, you do not look at this data you only look at this data as your input and you try to predict whether it defaulted here and you treat this as an observed data. So, again this is an art to how to create label data out of existing database that you have right. So, depending on how you define your problem you can actually create data in a temporal order in that sense ok. Alright let us look at retrieval right. So, we said that hey as soon as you click on a reserve we get a feedback from the user, but there is a lot of noise there right. So, sometimes you might click accidentally right your daughter is playing with your iPhone you did a search and she clicked on something and you know Google says oh there is a click down right, but it is an accidental click. How do we distinguish an accidental click right? There might be spam clicks right I mean I do not know if you have seen that case study where Google was trying to see Bing is copying some of the results and then what happened was we did a lot of spam clicks on Bing and Bing thought that this was actually real and they started building models on that right just to see whether that philosophy was correct. So, sometimes you know one search engine can try to screw the other search engine right. Now let us look at short clicks right you go to a click you read the snippet you look at the you do the click as soon as you go there you find that oh this is actually a spam page and that is what not you want immediately you come back. So, how long did you spend on that page tells you a lot about whether it is a spam short click or a long click and then there is a position bias right. If you click on the third link and not the first two what is that tell me that first two are not relevant to you third is relevant to you. So, not click is also a signal right. So, the art of collecting labeled data depending on the domain is also very interesting add clicks again there are lot of spam clicks right I mean if you are retailer 1 and retailer 2 and one guy wants to you know finish off your you know budget on on shoes like Adidas and on whatever and then what they can do is this guy can hire a bunch of students and just click on those ads of the other guy and he runs out of money and nobody you know and then you can sell your stuff right. So, there is a lot of that happens this is actually a real fraud I mean we were shocked to hear that oh whatever we do there is a fraud right credit card there is a fraud you know ads there is a fraud you know email there is a spam you know web pages there is a I mean I don't know I mean human nature is this weird right it's just making our lives harder and harder every day right. What about accidental clicks or curiosity clicks right why is this ad there you wonder right yeah that ad is there because of this and then you say oh yeah that's great that people right. So, that's a you know that's what we call a curiosity click right and then how do you say whether click is important or conversion how do you define conversion right did he actually click and then bought something or he just clicked and then left something right. So, how do you define what is a feedback what is a class label in this class okay alright so labels are very precious right if you talk to people in genomics and cancer research for example you know to collect one cancer patient who died of cancer somebody has to die of cancer to collect a positive label right sorry but that's how it is and that's why it's important to cure the world of cancer right but we need that to get this it's not like doctors are waiting when I need one more sample I need one more sample right it's not like that but it's a costly right. So, you know astronomy right if you are looking at a star in a certain direction it takes a whole year to get back there if you can't make your observation on that precise day you know you to wait for one more year and astronomy PhDs probably take 10 years time probably because of this reason for them the label data is far more costly right obviously there are other issues cost is there noise is there we talked about the noise. So, how do we but you know unlabeled data is very cheap right satellites are going around all the time collecting a lot of unlabeled data on the earth but people who have to go to these weird places and collect the sample know this is like a marsh or this is a sand or this is a wheat you know they have to suffer through the collection process right satellites are very nicely collecting input data. So, we have lots of input data but very little label data right. So, how do we do things with that. So, there is a whole research and semi supervised and active learning. So, here is an example of a semi supervised learning where we said ok these are two news groups from the news group data right these are two very simple news groups to distinguish I am talking about pair wise classifiers again think of you know there are 20 news groups I am picking one news group pair which is very easy to distinguish and we label only five examples and we build a classifier right we get some accuracy then we use that label all the remaining thousand examples pick a few call them labeled and then we build another classifier and then the accuracy goes up right. So, you can it is ok to have very few label data if you know how to you know bring in the unlabeled data into the mix and keep improving your classifier right. So, we get 100 percent because these are very simple classes and computer OS versus graphics we start with 10 labels and we get to 99 with just 10 labels 5 of this class 5 of this class and we can still build a classifier right. So, do not worry if your cost is very high you can use other techniques to compensate for that active learning I will just show you a picture. So, let us say initially I only have these six samples as two classes right the circle class and the square class and if that is the case I am going to build a classifier like that right it is very simple now I use that classifier to score the rest of the label data. So, these are the other data points that were there, but I did not have labels for them, but I can now score them. So, I have scored according to this model, but you know this is the actual labels. So, obviously this model was not able to capture the correct decision boundary right, but it captured something. So, now I want to know the question is hey my labeling cost is very high right. So, which points that are not labeled should I get labels for and that is what active learning does. So, we say generally that you know points that are near the decision boundary the current decision boundary are the ones that you should go and take these data points go to a human expert who can label them for you right. There is no point labeling these guys right there are already the confidence of the model is very high. So, what we do we build a model with whatever data we have we apply the model to all the data then we look at the confidence or how close the data points the unlabeled data points are to the decision boundary and we sample in proportion to that. Now there is one more mix in this which is let us say I have already labeled let us say I have already got a label for this guy there is no need to get a label for this guy right remember we are trying to be efficient about the labeling process right. So, redundancy if I have a label of a nearby data point do not get another label if I am very confident about a data point according to the previous classifier do not get a new label if I am here then only get a label also density right I mean if this this guy is kind of an outlier and it does not fit in the mix anywhere there is although it matches all the other criteria right it is kind of close to the boundary right it is not close to any other label data point, but still it is far from the cloud and you are not going to get a good confidence in the model in this space anyway. So, why get a label on that because it is a costly thing right. So, how do you optimize your labeling process should depend on all these things right confidence sorry this is confidence twice, but yeah I think this yeah okay correct. So, active learning is another interesting area okay. So, this is the last flash talk which is there is more to a decision than a prediction. So, remember we went through that cycle of you know feature engineering and this and that at the end we make a decision from a prediction prediction is what the score comes out right a score is high or probability of default that is a score what is the decision the decision could be increase the credit limit decrease the credit limit right cancel the account right send him a letter these are decisions right. So, decisions based on prediction ultimately we want to cross that last mile. So, let us look at this example I wanted to show you this is also part of my vision talk visualization talk. So, here we have collection notes from a bank which is a bank is saying hey you did not pay your you know bills for the last credit card statement and you know you get these annoying calls right sometimes in the middle of a presentation you get these calls and you know these guys talk to you and this you come up with an excuse right oh I forgot or I do not have enough money I am going through divorce. So, what we do it was sorry I am not going through that just kidding right. So, people come up with very nice excuse right now in the dog ate my bill or something would be a nice thing. So, collection notes so it is a very dirty data because some uninterested guy who does not really care is typing what the hell you are saying and he knows you are lying. So, there is a lot of data noise and all that right he is using his own slang and you know probably putting some curse words from his side right. So, there is a lot of that in the data and but it still has some structure. So, we learned the structure and we say oh these are the concepts in this thing. So, I can read it for you bankruptcy you know not sufficient fund forgot out of town you know waiting for funds accounting mix of problems. So, there are like all these are concepts that we discover right and then we also have a model on the side which has not used the text data at all, but the data point has text the data point has other features using the other features we have built a scoring model which says with will he actually play pay or not right no matter how many times you call him he will just not pay right. So, there is no point keep calling him wasting your time give it to some other agency and they will take care of it right. So, yeah so that they do that right. So, you have to take decisions on that. So, here is the difference between a score which is a one-dimensional set of numbers and a decision. So, if I look at this visual what I see is that there are hotspots in different places right. So, this is a hotspot because you are unable to reach. So, definitely the guy is not picking up your phone he is not going to pay right. This is a hotspot for a different reason bankruptcy and all that. So, this is a different thing he is not going to pay for a different reason right. This guy has payment disputes he is saying this is not the right thing you know I have payment disputes. So, he is not paying for a different reason maybe you can resolve this guy through talking and bribing and all that, but that is a different thing right. This guy is not paying because he is not sufficient funds. So, the story is scores only tell you a probability they do not tell you the reason unless you generate reason codes right. So, sometimes your model should also generate the reason why it is giving a high or a low thing, but here based on the talking and putting these two pictures together we can actually say oh there are five six groups of people who are not paying. So, let me take different decisions on different groups of people. Let us not just say score above this make a phone call score above this send a policeman score above this right you know just whatever else right. So, you do not do a linear thing you understand the reason ok and by the way this is not what I meant by the art of data mining although it is pretty picture but that is all what I mean all right. So, prediction score is not everything you to understand the reason behind things and then. So, that the step from prediction to decision has to go through that extra business constraint understanding why the score is high and then appropriately taking the decision right. So, in the same thing if I say sending reaffirmed to our attorney mailing reaffirmed to attorney court case still pending. So, if I put that query it fits right here. So, you know this is kind of a both a visualization of the text data and you are able to visualize the concepts the model as well as a query on top of it right. This is another guy thought she was caught up already she transferred the money checking on Monday. So, she is going to pay and it falls like here in a blue region right. So, just from the text you can actually see what they are saying and therefore, whether they pay or not right it is a very life thing you can do all right. So, let us do the last example which is on recommendation right. We love recommendation engines some of us hate them right I get the wrong movie all the time right ok. So, this was a recommendation engine problem. So, let us say the guy bought two products right car deck and car beneath this. So, this is like a very old car and this is what we recommended right. So, normally when we do recommendation engines we are very tempted to recommend the first thing right oh this is got a high recommendations code let us send that right that is obvious thing to do. So, prediction directly translates into decision right, but now let us see what we can do on top of it. So, let us say I look at these products and the graph among them. So, this is a graph that says if he buys this he will also buy that right. So, that kind of graph now let us say this black thing is what you already have the red thing this one is the most highly ranked product. Now, if I recommend this product what will happen he is going to buy this maybe right and he will go away and you know you are happy he is happy, but if I recommend this guy a lower ranked product, but in the middle of many other products then what will happen if he buys this guy with a lower probability the chances are that he will end up buying more of this stuff. So, the overall utility might be higher although the conversion rate might be lower right is the bidding model that we just saw probability of click might be lower, but the utility that you know I might get him to buy all these other stuff is higher. So, again recommendation score just do not go and jump and say you know these are the top 5 products try to create utility function on top of recommendation score which says what is the overall utility to me right whether I want to maximize loyalty whether I want to maximize cross cell what is the overall utility and how does this joint second order recommendation help me in making a decision ok alright yeah so that is that is all I will just summarize so we understand and tame the data nuances the distributions and all that we preserve the variances that matter query length versus not do not think of model as a modeling exercise think of it as a feature engineering plus a modeling exercise right do not beat the data data into submission as in you know I need to get 90 percent accuracy no matter what maybe your labels are wrong maybe your defaults are wrong right there could be reasons for that and generating class labels has to be done very thoughtfully and creatively and decisions are more than just prediction you have a score you have a cause and you have a business objective and then you come up with a decision ok alright so I will just stop there with this code why are we all here it is a closing talks I need to have a nice code you know. So, Rikers will I do not know how many if you know it but look him up. So, he says our technology our machines is part of our humanity dogs have not created Hadoop and birds have not created planes humans have and we have done this because we want to extend ourselves right we cannot hold so much data in the brain. So, we create a database system we cannot see far into the space. So, we create telescope we cannot you know see things in the microscopic level. So, we create that so we are really extending ourselves and our relationship with the world and that is very human. So, we create them to extend ourselves and that is what is unique about human being. So, whenever you feel like you are doing something cold and fuzzy and non-human just go back to this and say no we are doing something we are extending ourselves and you know hopefully that will give you some piece ok questions. Hi here yeah sorry yeah. So, I am Sunil and my question is related to the segment where you said we need to model the deviation from the expected right. But in both the examples that you gave I was I mean how about there is there are two things right one is the actual thing that we we are trying to model and there is an expected like the total sales and expected sales thing right. So, the expected is basically a dummy kind of model or an obvious thing like. Yeah, if we did not have a model what would you say right. So, the sale of candle will go up in Christmas time you are not adding any value it is a seasonal thing the expected thing is already high. So, what is the big deal? Right. So, given that we know I mean see the fact that we are able to expect something that means that we know the attribute that affects the right hand side that we know that this is what is going to happen. Right. So, why not add that attribute in the modeling set of attributes itself instead of model and the deviation. Sure the problem is this if you are building a linear classifier it does not know how to take ratios right. So, you have said you know total sales expected sales are two features going into a linear model right. So, it cannot do that. So, what happens is that is why you to compensate for the inability of the model through feature engineering. So, you could probably take log of this log of that and the ratio will come up. Yeah. So, that's the mindset right. You can be creative this way or that way. Right. Thank you. Yes. But what about velocity right. It's a more complicated thing. So, sometimes it can be beyond the scope of the model right. Hi sir. Okay we have some time for a couple more questions and then everybody make sure that you stick around for after this we're going to have an awesome panel discussion star-studded panel make sure that you stay in afterwards and during we are going to be giving out tons of free goodies. If you take a look over there we have everything from plenty of sponsors make sure that you stick around you do not want to miss this. Thank you very much. Let's get back to the questions. Hi sir. My question is in contextual waiting. So, how do we define the context? Is it a sentence or paragraph or document or the collection? It depends on the data. If it's a tweet, tweet is the concept. If it's a YouTube video, you could say all the videos watched by the same person in the same session or you can say all the videos watched by the same person in the last three months. So, context is again the art. How do you pick the context is up to you and you will see that there's more noise less noise more signal depending on how course are fine grained. The context is. Sir, here in that Flickr example that you gave the page rank kind of waiting was the other important term tacked for the same image or the importance of those other term across the collection. No, no, within this it's a contextual. It's not a global thing. So, the word let's say I bring a motorcycle in the room the importance will be low. You will say what the hell and make some noise around that. But if I bring in another microphone or so the context is important. If I take the motorcycle in the parking lot, that's where it belongs. The chair doesn't belong there. So, it's always about within the context. What is the weight and that's why it has to be done for each data point. You have to run the page rank on that data point with just a small graph on it. Not a global page rank on the whole thing. That's the difference. To your left. Yeah, yeah. Thank you. I really enjoyed your talk regarding active learning. Is there a way to do it without the human in the loop? Can you touch upon active learning and provide some alternative ways of getting this getting the ground truth and labels? Yeah, so that's a good good point. If we don't need a human in the loop, we don't need machine learning. Think about that. Huh? Yeah, we don't need. So see all kinds of label data has a human, whether you are clicking on a search result through crowdsourcing, whether an expert is saying cancer versus not, whether it's a lawyer in the legally discovery work. So everywhere there's a human label and if not human, there is another complex process, right? Like if you take your DNA and put it through the machine, there is a cost to doing that also, right? So don't think of human versus not. Think of the cost. And you know, sometimes human cost could be low, right? Mechanical, crowd sourcing. Sometimes human cost can be high. Think of the cost. And yeah, and think about this. If you don't need a human, do you need a machine? Yeah, so thanks for your talk. Really enjoyed it. I don't have a background in data science, so I hope this doesn't sound really nice. So I was just thinking, how far is it possible for statistical solutions to semantic questions go? What happens when you're dealing with complex languages, say the law? How is the Google scholar search engine different from a normal Google search engine? Is it possible for a machine to understand a joke? I mean, it's not possible for all of these answers to be purely generated by statistics, right? So how would you approach this? Yeah, so that's where we need to go beyond statistics, right? So like we did parsing, think of parsing as an NLP algorithm. But you know, my belief is that unless you put multiple such building blocks together by itself, one building block is not going to cut it. So if you are, you know, you have to be sort of a breadth person also to see that you need five different building blocks. And maybe your company has a separate team for phrasing, separate team for disambiguation, and they're not talking to each other, but you need to put all that to get real value out of your data, right? So yes, I think data goes through a sequence of embellishments, right? Improvements in its quality in its semantics and all that. And you have to recognize what order, what are the things that you need to do to make that better and better over time. But sticking with one technique saying this is my favorite, this will do everything is the wrong attitude. So, and recognizing that you need, you know, all kinds of things is very important. And how do you, so the next wave of machine learning is not about improving SVM, it's about, you know, using SVM with a glomerative clustering like in the hierarchy. That is the new kind of innovation that will happen in machine learning. Salish. Yes. Over here. Yeah. To your left. Okay. Very to your left. I'm sitting at the monitor. Okay. So you showed an example, like you had, this was in the default section where you're putting zero as a default. There was one feature where you indicated the presence or absence of a keyword in that. In the present, there was separate feature for the position. Right. I mean, isn't the first feature, you know, sort of redundant in this case? I mean, because the second feature also. Good question. Right. So if the model is looking at them very independently, right? So if it's a part of different branches in the decision tree, then it's a big problem. But if these two features are part of the same branch in the same path in the decision tree, then it'll figure it out. So interaction between variables is a very important part beyond feature engineering. You've created features, but what about interactions among them? That's what the model is supposed to learn. And if your model is able to learn that interaction and say, whenever this is zero, ignore this, whenever this is non zero, then use this. If you're able to learn, that's great. But, you know, you don't want to take that chance. So I mean, adding that will increase your, you know, Yes. I mean, letting model do most of the work is like deploying child labor. You know, I mean, the poor model is not really a grown-up person yet. Right. I mean, it's still growing up. It doesn't know many things. So you give him little tasks that it can chew. Yeah. Yeah. Okay. Unfortunately, that is all the time for questions that we have today. I know that you guys have many, many more, but we have to continue on with the next panel discussion. So everybody give him a huge round of applause. This is fantastic.