 So, good afternoon everybody, I know it is going to be a bit difficult right after lunch, I will try to do my best to keep you all active, but also upfront before getting into it I also want to set the context of this presentation, so not to disappoint you also if you know find something more interesting in the other room then you know you have to go ahead actually. So, what is thus cover is basically it details a specific use case that is similar entity reduction, where is it applicable you know how do you actually like you know go about that process, how is it traditionally done like you know the high level what is traditionally the approach and how can you solve it using the machine learning technique. So, this is what is the intent actually and the tips on how to choose an approach or an algorithm you know some tips not I mean the entire gamut, but you know some indications based on you know whatever experience we felt some tips on that. What it includes use of a specific product or tool, so you know it does not talk you know how do you use a tool or you know so many of the presentations get into you know architecture and internals this is more about you know it is more about an approach having a use case you know how do you apply machine learning to that use case. So, that is the intent of this presentation. So, having said that why I felt this topic was very relevant is yeah machine learning has been there around for a while, but now basically it is now I could actually term that you know machine learning for the masses. So, before it was more of an allied and you know exclusive kind of thing wherein you need high compute power, you need high understanding and you know you need to be get a detailed understanding of those algorithms to get into it, but now it is not restricted to only the high cost options. So, there are you know anybody with you know a little passion and little understanding can get into this domain, there is lot of the cost of storage is gone down, you have on demand processing procurement. So, any company you know whether big or small can actually go ahead and you know experiment with it and you know availability of multiple machine learning libraries. So, based on your comfort level. So, if you are comfortable in Python you know you have libraries in Python if you are comfortable in Java you use Java and if it is you know your statistician you have an R kind of background go ahead and use R. So, these are the reasons why I felt ok you know these things are been done there are many you know maybe commercial tools which are doing it, but today even a person you know any person with some you know interest can actually get into this and you know those these kind of experiments. So, why similar entity detection? What is the need for that? So, you know this I just felt that this picture like I just you know put it over there thinking that you know the way the data is today is like we have you know these different blobs over there you know some understanding of the data, but you know the connectivity like you know is this blob the same as that blob you know that is something very difficult to get actually. So, I thought ok this picture actually represents that you know the mass of data where you have different blobs you know sitting there in isolation how do you connect them? What is the similarity between them? That is something which we want to actually you know try to actually address as part of this presentation. So, coming down to the individual level like you know. So, you have these four guys you know looking at it what do you see actually? You see from the face of it you know actually all the four of them you know as a human you can know immediately ok these are the four same guys you just that the you know they have some additional features one has a stake one is turning this way. So, it is all you know same, but for a machine to know the same thing like you know given four customers in your system how do you know all the four customers are same it is a much more challenging task. So, how do you actually understand what are the features you feed into the machine? What are the attributes you feed into the machine to recognize that individual? That is something which we have to decide and we have to program the machine to understand that. So, as for a sample like you know just some thing like. So, assuming we have three customers you know they have three different customers. So, customer ID 1, customer ID 2 and customer ID 3 these are three distinct customers for whatever reason you know these have actually been captured as different in your system it could be that you know the person is intentionally given something wrong or the call center agent who has taken that entry has forgotten something or it just could be the miscommunication like you know instead of 1987 you know somebody types in 1988. So, if you actually see here what you see is that you know it is essentially the same middle name is missing in the first entry there is a hairy middle name in the second and in the third it is a Herald and if you see the date of birth you see that the last person has a date of birth of 1988 while the other two is 1987 actually. So, and then the addresses as a human you know immediately you know one MG road Bangalore is same as one Mahatma Gandhi road Bangalore actually or one Mahatma Gandhi road Bangalore but as a machine how does he know a Bangalore and a Bangalore is same. So, these are the kind of challenges actually like you know when you have this kind of entity detection you actually face these kind of challenges actually. So, these are questions like you know it is person one same as person two. So, come you know what are the potential use cases for this like. So, one could be you know the fraud detection this could be an essential want and you know the way the person represents himself could be essentially want and you know you know mistaken he wants to system to believe that he is three different people actually. So, you know if there is each of those guys one makes a claim on policy one other makes a claim on policy two and the third makes on claim on policy three. Indutually these three claims look very innocent because they are all you know not such big values like but cumulatively when you actually have three claims and you know that is the same guy making it immediately the organization has to wake up to that you know there is something essentially wrong here. So, this is one you know potential use case wherein if you tag these guys together you know it could be a red flag for you you know immediately go and find out what is happening is it you know the same guys what is happening over is something you need to take an action on. On the positive side what do you say on the positive side as you have three different guys there and individually they have you know small numbers of money not too less but you know still not that much that you want to give them individual attention. But it could so happen now if you cumulate these three immediately you see that okay this guy this customer is very valuable to you he has a huge business with you. So, that is when okay you actually start concentrating more on him you need to focus on him and do you know something personalized. So, that is also another use case where finding similar entities and tagging them together would be very helpful. So, what are the challenges actually you face? So, we saw some of them already like I think the most essential challenges the data basically the incompleteness we saw that some entries are totally missing incorrectness where you know an 87 is entered as an 88 synonyms. So, basically two things you know could mean the same like when you have a street and you have a road it's essentially the same thing. But how does the machine know that and then abbreviations for you know you have MG road you have Mahatma Gandhi road. So, these are you know some challenges which you see with the data huge number of comparisons. So, what happens is that you are talking of your entire customer base you have lakhs of customers over there and it's a huge number of you know comparisons because it's a one on one comparison. So, pair wise comparison. So, every customer has to be compared with everything else. So, the number of comparisons is very huge. The volume of data again very huge entire you're talking the entire customer base there and the different kinds of entities. So, even though essentially in this presentation I'm only talking about some very simple features like you know name address date of birth actually it could expand you know you can have pictures you can have voice recordings all these could make you you know did it who's that person you know how do you identify them. So, these are the kind of challenges you face in the similar you know entity deduction domain. Traditionally okay has it been solved yes it has been attempted in different ways. So, it's not that it's something totally new like you know some you know in fact there was a concern like you know it's already there what is different in this presentation. So, essentially traditionally what you do is a data extraction. So, first you get the data from your different systems you augment the data like you know. So, basically you know it could be using asset years it could be using you know a city list. So, you actually augment the data maybe a geocoding you could augment the data using the additional data and then you use pre-configured rules. So, traditionally you have a lot of set of rules which you upfront say that you know if this this this then essentially this customer is same if something else then you know they are probably same. So, these kind of rules are something you pre-configure into the system and then there is always a manual verification phase because you cannot totally believe. So, you know it's mostly marked as probably same and then there is a way you have to get back to the customer call him up and validate okay whatever you're assuming is it correct or not. So, these are the you know some of the traditional approaches there. So, just for a sample rule in a traditional approach. So, basically assuming you know you have a pair for a customer pair if the name is same the data worth is same and your address is different and the postcode is same probably there is a chance that the customer is same you know it may not be same but it could be same. So, you mark it is probably same and then you manually verify that there could be another rule where in all those four fields are you know exactly same in that case immediately say yes you know these guys are same and you know you can mark it in the system. So, what are the limitations you know why we look for another approach. So, typically like there's no self-learning capability because these rules I have to be created on your you know upfront you know you have to understand what is your domain and then upfront understanding of your domain is very essential for creation creating those rules actually and it's difficult typically to add new features because it's all hard coded there you have a set of you know maybe five features seven features you know you have those rules already in your system adding a new feature all the entire you know rule list has to be you know modified there restricted data types typically they deal with you know very restricted data types you don't have a choice of you know probably like how do you know include an image in this kind of a rule something we have to think about and then restricted ranges for features so typically rules are very point-based you know either you have a same you have a different or you'll probably some range maybe point two five point five you cannot have an incontinuum over there you know probably data but it's you have could have a distance from point one to you know totally same to a totally different what happens to the continuum these are the things very difficult to configure in a traditional rule-based system. So, what is the solution. So, till now we just saw the you know what is the problem space you're talking about how is it handled in the traditional domain and now we're actually going to the solution approach. So, one idea is we could try you know applying machine learning and what is machine learning I think it's pretty you know many of you might be like already I think but just for the benefit like you know this is the definition you know which I found very relevant to what we are talking today you know given by Averand Bram and what he says is that it deals machine deal learns and deals with the design of programs that can automatically learn rules from data adapt to changes and you know improve performance with experience and this is exactly it's something if it really works is exactly what we want wherein you know get a new set of data he automatically understands that learns the rules and then starts performing better and you know if you find some feedback you know it's not you know in line with what he's assuming then the model gets updated. So, if you have something this is the ideal world and this is great actually so this is something we want to actually aim for. So, what are the algorithms like there are like of course there are different ways of categorizing for this presentation I thought that you know it made absence to you know put it in three levels like one is the supervised. So, when you have a lot of training data when I say training data is like you have a set of features which you have decided like you know for example for a customer for a person probably you say that he's identified by his name probably his first name his last name his middle name his date of birth his gender maybe you know added salary these are the kind of you know features which you desired upfront and then you already say that you know if there are two people with you know these kind of distances between their features like you know this kind of distance between first name this kind of distance between second name and you know basically for this kind distance thing this is the kind of thing that this guy is same different probably same. So, you have a set of labeled data already upfront then the best approach is to go is for a supervised learning algorithm. And always any case supervised when you have labeled data will actually give you much higher precision much higher recall values so that is the right way to go when you have a representative set of trained data if not then you actually go for the unsupervised wherein you have a set of data but you don't know which group they belong it could be that they're same they're different but you know okay these are your customers these are the distances between them what is the kind of you know what is the kind of pattern you can detect. So, in that case you go for unsupervised where you let the machine decide you know put it into the machine let them cluster at the end of it you would get you know probably very clearly distinct things where you can identify yes this group makes and you know all the people in this group look you know probably you know same this group is exactly same and this is a case wherein you know they're different but you know and typically it does work we have tried some experiments and it didn't work that and continuous learning where in the case might have some data but not enough actually so if you have continuous learning there is a feedback loop when as and when you progress you get some feedback you know it learns from that and then it updates itself you know today predict maybe accuracy you know so on a smaller precision levels as it goes as it gets more data it gets better and better and I think that is a very ideal system which we are aiming for over there. So, how do you choose like you know at this time like you know how do you choose between some of these algorithms okay there are many ways but you know I just whatever I felt like in the experience I put some of those pointers there. So, availability of label data then straight away you could actually try supervised it gives you much better scores over there data characteristics so for example if you have training data but you know you have data characteristic for example there was a case in our data we had that almost 90 percent of the data was actually a negative there was only a 10 percent you know less than 10 percent of thing was similar actually so in that kind of case when we went for the normal training approach we just took whatever training set you know we spread it into you know the you know test set and the validation set and took the entire set and trained it we did not get good performance and what we figured out is that that is because the number of positives in that training set was actually representing your data characteristics wherein like there was hardly like one percent you know out of every 10 entries one of them only was actually you know a positive labeling so automatically a model could not learn sufficiently about the positive labeling so for that we tried random number under sampling which was very helpful wherein we wanted to ensure that the data from the positive sets were equally represented as the data from the negative sets so there was an equal representation of both of them and that helped improving the kind of precision entry calls we got from the model actually and then also the feature characteristics so you know the algorithms are very there are different kind of algorithms like you have logistic regression you have you know artificial neural networks under supervised itself you have multiple algorithms so for example one thing is that if you have a continuous data then what we found is a thing like a multi-layer perceptron an artificial neural network perform much better than a logistic regression a logistic regression work more well wherein you had categorical data you know very very clear levels you know you had probably feature one it had zero one and two could be the three levels in that case it worked very well but in case of multi-layer perceptron worked better when you had you know a range like you know point not one point not two you know a whole range over there so that's again you know there so similarly there could be different characteristics and according to that you have to choose an algorithm and try to work with it and result sensitivity okay this is another key parameter wherein you run your algorithm you choose an algorithm you run the model and you get a set of outputs actually you get an F score you get a precision and then based on that you can take a call like you know what are you more sensitive to so there could be certain cases where it all cost you know even if I get a hundred negative it is fine but I need to get all the positive so I need to have very high precision high recall in high recall in that case I want all my entries all the positive entries have to be represented there all so I need a high recall or is it could be the case that you know it's okay to get wrong results as long as you know I get a sufficient number of results like so what is your sensitivity you're looking at actually so that is also another thing you actually have to do so having said that all these things also work well in an iterative manner so when I say iterative manner is you choose one algorithm you get something don't be happy with it like you know you try one more algorithm it will be usually the case that out of three algorithms one of them would perform better for your data set so always experiment you know a few times choose the one which is very closely working well with your data set and that work quite well for us so this is again I don't know whether how many of thing but this is just something like how do you validate a model actually so it's a pretty maybe people have done machine learning would find it very simple actually so one of the things for supervised learning algorithms you use you know the precision that you call in the F score these are very very you know critical actually to actually validate a model and for example in precision basically it's like no fraction of the retrieved sets which are relevant actually and in recall it's basically fraction of the number of relevant documents retrieved by total number of documents you know total number of relevant documents and F score is just a harmonic meaning you know it just there's a harmonic meaning of those two to give a single score like rather than monitoring two scores you could use one score and you know validate your model so so just for your understanding like for people this terms are new assuming you have a you know a data set with characteristics of 200 items and this is a label set you're talking about you have say positives are there 30 and 170 and negatives in that case supposing you run your model and it predicts something like you know 40 positive and of this when you validate you see that actually 30 are only the true positives another 10 are you know falsely marked as positive action so this is what is your characteristics of your model action so in that case what you can see is totally the is that precision is calculated as the 30 which is the true positives divided by the total number of retrieved results which is true positives plus false positives and you can see that it's giving me a you know a precision of 0.75 and you see the recall out here very interestingly the way of course this you know sample data but it says okay the true positives retrieved is actually the completely in all the true positives are retrieved so this is an ideal system so if i want to track you know all my positives then i could go for such a model if it gives me this kind of recall and of course the score which you see is just you know the computation on that so what now now you've just seen you know the high level like you know there are algorithms you know you have options to choose them now assuming you've decided that you have labeled data and you go for the supervised labeled algorithm then what is the flow actually what is the typical process you do so first is you choose an algorithm okay then you identify their relevant features what features make sense when i say features for example the characteristics of your data set so as i said for a person it could be his name it could be his date of birth it could be his salary it could be his address so these are the typical things which identify a person once you identify the person you write some simple logic to extract the value of those features and then this is the heart of the whole system how do you find distances between the features so this is a very this is the core part wherein you know how intelligently find the distances that is the you know that that's stronger algorithm is so for example if i have a you know something like a name how do you find it so you could use something like a Levenshain's distance you know you have something like a harry misspelled a hry and another harry which is a hry then how far is it so using Levenshain's distance you can say that okay it's not that far it could be probably just one r is missed out by mistake so in that case yes it's very close so then again how do you find distances so this is for a simple string case you could have go to complex cases you know probably two picture clips how similar or dissimilar how far are they so for each of that for each of the features which more drive a model you actually have to create the distance algorithm which is where the heart of the whole you know story is once you use those features you find the distances you can actually feed that into a model into the model for creating the model and this is assuming of course that you have a trained data so you actually create a model and then using that is of course a supervised thing you feed into the system and we'll see down the line how you know for example one particular algorithm how it has been done and then once okay model is done what do you do next you validate it well basically if you have a you know typically if you have a hundred kind of a label data 75 would use for the training probably 25 you keep aside to validate your model how well it is doing and then you say okay you calculate those scores which we talked about if you're happy with the performance great accept the model as no you know you have to actually go back to the drawing board probably have you have what data try to use that also if not you have to try another algorithm so this is process as I said it's an iterative process till the time that you know you are comfortable with what output or what scores you're getting and then once you're done with it yes you can actually start using the model to classify future customer pairs so what did we choose we've tried a range of things and in this case we're talking about the neural network multi-layer perceptron and again I'm just now what I'm trying to do is I gave you the flow I'm just trying to map it to a specific algorithm how we have done it so this is actually covering that so the first step we chose the algorithm the second step is we're choosing the features which is now case this case we're speaking first name last name address date of birth and gender then we retrieve the value so we have written simple extraction routines to get all the relevant values for that and then we write the custom distance algorithms so for each of them that is where in fact a significant amount of time was spent in writing those distance algorithms so at upfront it looks trivial but you know as we did it for our it so it has to be tuned your data characteristics also so for example there could be as I said date of birth so we saw cases where you know actually it's like there is like maybe 23 7 say 1980 and there is a 23 you know my you know 7 instead of 7 there could be a 9 and 1980 or instead of 1980 it could go to you know 1990 in that case how do you calculate so you know we cannot you know in the very simple case you can say okay these date of birth itself are different let's reject the record but that's not how the data behaves what we figured out is that you know there are typing mistakes we cannot take it as upfront so we'll have to find distances between those numbers so we use a range of say in the tense place if there is a 10 difference still be actually accepted you know so those kind of custom changes based on the kind of errors in your data it has something wherein you have to spend some time to customize it to your data then pairwise distance is calculated for all fields across all customer place so once this is done you have chosen the distance algorithms then actually you start the actual process of you know calculating this pairwise distances for your entire customer set so then so you got the pairwise distance how do you actually train the model now so this is the model training phase so what you see here is a simple multi-layer perceptron network so you see i don't know whether you can see at the back but these are basically the input nodes so typically in our case since we have chosen you know a gender date of birth the address the last name and first name the distances are what is fed into this input feature node and of course there will be a bias term over here and internally this the whole multi-layer perceptron works in the term concept of hidden nodes so they have a set of hidden nodes and by default some weights are given to it at the start of the algorithm so how this whole thing works is that it would predict based on those default values and it will get some sort of output so you have two outputs there same and different okay so if you how many ever whatever output levels that many output nodes you would have so these are your output nodes okay so if basically now how it works is that you know it predicts something and then it figures out that in your label data what it's predicting is slightly different from what is actually what is what is supposed to be there so then it would feedback so it has a backward propagation loop it goes back to the input nodes and all these distance of these nodes start learning the weights again and this entire process is repeated like no typically like maybe hundreds of times the number of iterations you know you specify there until it stabilizes to give you a reasonable value of thing of the predictions actually so once that is done then you can actually take that model and then you go to the next step of using that model to calculate as I said how does it predict on the test on the held out test set you validate on the test set and then you calculate the precision recall and f score if you are comfortable with the precision recall and f score then you accept it and so once you find it above expected thresholds you accept the model actually and then you start using it for the actual prediction and based on how what would predict you could link the data into your data on your data set so what what we saw as different blogs over there so once we come here we try to actually like you know we can actually start linking them together so if you see taken at a higher level concept so this is similar customer entity direction you could also take it for you know not exactly similar customer but you know in a little higher level it's not like the same customers but they could have similar taste they could have similar you know attributes so those kind of things this could be expanded to support multiple different kinds of use cases so what is the result we saw so we chose then as we saw the neural network we saw that you know it was the recall was very good 89 percent so quite happy that means only 11 percent of similar entities were going undetected which we were happy with it you know at least 89 percent we got on the negative side we saw that the precision was only 65 percent so what this implied was that there was a huge set of data which was actually coming as you know similar you know even though it was up it was a false positive so you're getting a lot of false positives but still you know we felt okay you know it's still there is some benefit in the sense that at least greater than 60 percent of non-relevant data was getting eliminated because of this so the focus was on now on a smaller subset which we could concentrate on so that was the advantage of you know probably you know using this this algorithm so of course we'll take you know in the last also what what else we you know we could do that we could take it later so in between I would like to actually talk about big data so as I said the entire process the whole challenge was like you know the data was huge and when we initially ran the comparisons it was actually like you know going to you know more than four days just the comparison the feature creation was taking four days to go like you know so that is when we had to you know try to find out alternate mechanisms so again you know bringing out why it took so much time because it's a it's a pair wise operation and so that means if there are one like you know there are n customers you have to compare it all the other n minus 1 customers so immediately are talking about a huge scale there and also we see that every comparison is not one comparison for every feature you have a comparison so so basically it's like what we're talking is a huge computation and highly time each of them are very highly time consuming operations so these are some of the ideas which we have you know used actually so one as the use of map reduce techniques what we did was we told to we split the comparison into the m you know m different machines and every in the every machine now we compared with the all the other customers so so basically you had a you know at least an m time speed up I mean nearly m time obviously it's not m time but slightly less than that and incremental addition of new customer pairs so we actually designed such that when a new customer pair comes it's just an incremental addition you don't have to recompute everything and most of this was you know always was handled in batch time comparisons and tagging so it was typically an overnight job and not a real-time job so these are some of the ideas which we use to ensure that the performance of system is met so what is the way forward actually so we use neural networks as filtering algorithm but as you saw that the performance was not all that you know like you know almost is like as I said only 65 percent is what was our precision levels like so we're not so happy about that so what we could do is that we could use this as an initial algorithm and we have to plug in additional algorithms down the pipeline we did not try random under sampling even though you know that was something which was very relevant to our data set our data characteristics were exactly for random under sampling so we should try a random under sampling it will definitely give a better model and then we could experiment unsupervised so some of these experiments we have done and then some more are there so for example like you know we have k-means mini batch k-means, birch, rbms so in each of these like again we figured out like for example k-means because you know it was such a huge data set it was taking more time so mini batch k-means gave a similar performance and much shorter time actually so that was something which was good actually and then again birch was a very good hierarchical clustering algorithm which helped in the initial again it could be used as a you know as the first stage wherein you use it for the initial clustering and then use that the output to further know use some other models to further predict and then continuous algorithms we have tried with R2 but again still very much like known infancy actually so there are some other things for and also very important is other entity types so right now it's very simple strings you know numbers so we want to actually bring in other entity types into this whole system so so with that actually i come to the end of the presentation and i'm open for questions thank you for that when you showed us the graphical view of the neural network you had multiple hidden layers at what point did you decide to add hidden layers to increase your accuracy and what was the methodology for doing that right so so typically like the more non-linear data is that is when you need more hidden layers so luckily for us our data was not so non-linear it was like a pretty just a continuous value and it was not so non-linear so two hidden layer itself was more than enough for us hi uh so you talked about f scores uh for supervised learning uh methods algorithms to understand how good your model is uh if i'm using a an unsupervised learning algorithm uh how do i go about it correct so you cannot use these kind of traditional uh metrics there are some specialized metrics okay which we could discuss offline like so i'm not having a tough but then yes we have applied some other kind of metrics which which is not which can not work like you know because the thing is with unsupervised you do not have labeled data and these precision f scores by default assume that you have labeled data so this is very specific only to supervised what you have other metrics which we could discuss off hello yeah uh is the manual verification has been totally eliminated or is it few uh some some part of the work has been put down as the elimination here so it manual verification was not totally eliminated because what happens is that you have probabilities so you have three kinds of outputs one is like you know it's hundred percent same that's when you know you can completely eliminate it then you have something called probably same and so that is a category wherein you need to manually verify uh uh supposable means of error uh i'm over here yeah like you know if we have some nearby keys like you know if i enter a and if there is like the uh the next key might be entered wrongly so do you uh like consider kind of uh errors when you model that yes so basically i mean if you do that then what are the other kind of things that you might do yes so basically this is exactly the reason why we cannot do a simple string comparison so what we did was you use the you know for example the the string case like you have say as i'm a hary and a hary so both these are actually like you know it is there's an r difference if you use a simple string comparison would say these are different so we use 11 chains distance algorithms so there are different distance algorithms what we used was the 11 chains but you could use cosine distance there are different algorithms so you could actually use the distance and then figure out so again like you know for example if you're talking about numbers like i say i said there was a very specific case of this date of births wherein there was a complete like you know 19 suppose you have 1974 instead of 1974 it was entered as 1984 so in that case if you actually take a simple data but straight away it's okay these two guys are born differently so they are different but actually it's not the case because there was a one digit missing in the 10s case so we had to write custom code to handle that in the 10s digit you know even if it's missing then we give some sort of weightage actually we say that okay you know 0.5 or 0.6 we actually and that has to be done iteratively so that's where the effort comes unfortunately i have not come across but i think this should be like so what we did was all completely what we in house actually so we worked on it and we created it but so unfortunately we have not come across any package like that Hi this is Vinisha like you were talking about unsupervised learning right like we have something called corpus for each of them if i'm not wrong we have corpus which determines how a particular word is similar like we have something called brown corpus stuff like that which helps in saying like a particular word is similar to another word so how about like for example consider a case Sachin Tendulkar hit a century a similar article would be the master blaster hit a century for a human being we do get the point like this is right but how do you this master blaster is actually a common in terms of a corpus if you see master and blaster will be separate but Sachin Tendulkar will be a proper right if i'm not wrong so the problem with with respect to the corpus is how do they compare each other correct how do you improve your corpus intelligence yes so this is actually there's exactly one of the cases we saw is that you know the case of you know synonyms like you know it's not exactly similar meanings but these things mean the same like you know and then they were talking about a specific case of proper noun synonyms so typically what we do is we handle that with gazettears so you know one of the cases is like the nicknames actually so you see in many countries like you know there is a case of rob and bob actually so these mean the same thing that is when you actually use the country specific gazettear country specific nickname gazettear and augment your data so definitely in your weightage if a rob and bob comes will not use a common distance algorithm rather that will say that you know apply this gazettear and then probably have to give much higher weightage no probably not a one but probably like a very much higher weightage so I mean the question is actually how do we train data like the corpus itself should grow itself right like every day it should train itself and improve its intelligence yes yes so how do we actually yes so so what we did was actually like we have this writeback mechanism actually so if you see though that's not covered as part of the presentation at the tagging phase so basically the what happens with the output of this actually goes to a UI which will actually go to a call center agent and he will see that you know these are same or different and if something the system predicts are probably same if he is going to tag it as different so that data goes back to your system you know your thing so you learn from that actually so there is a writeback mechanism wherein at the user interface stage even though if your model is predicting as probably same but the customer is sure it's different or your model is predicting different but customer is sure it's same those kind of things would be written back to the system and then we will retrain at regular intervals so that is how we try to handle this thank you hi shijana here so two questions the first one is you spoke about how adding new customers to the system you're able to minimize the number of comparisons could you elaborate on that and you mentioned right that you don't actually when you add new customers to the database you're not doing a pair wise comparison with the entire set of records but you're able to reduce that number okay so that was the first question i had the second one was certain so given that there'll be a huge amount of redundancy in the database right you can essentially probably write a set of heuristics that will take this redundancy into account and we certain features much more heavily at the cost of others so did you kind of test benchmark your algorithms against any such heuristics right so for the first question when i said incrementally adding it's not like it can be compared only to some of the customers what was done is that only for that set you have to compare with all the customers so that is one thing the second is we have not did not actually experiment in any of those heuristics so it's something which is something which we have to do and it's in the to do list but we have not actually so it was more of a simplistic implementation just wanted to add a small comment to the interesting question he asked with that so one of the things that we do when we see you know chat data or likewise is because we know that somebody is you know typing it we exploit the characteristics of the query if you would in a in a way that you know so you know q u e r t y so u will have some in some sense a distance which is closer you know to q and t right yeah so q and t that kind of stuff along with the keyboard you know so 1990 in your case will be similar to 1970 or something along that right i think that is one kind of distance that can be looked upon yes yes that right that is correct hello yeah ma'am i just want to know that how did you tune the neural network once the model is ready you are getting some outputs how did you tube the tweak the parameters and what all changes that you make in the neural network to get the right right so what we did in this neural network was the only thing which we tried to change was the number of hidden nodes so like we first tried with 100 node and then we tried with 200 nodes and then we were happy with the performance so as such we have not put too much of thing the thing is that because of the supervised whatever training set we gave it was actually automatically doing the backward propagation we also tried to adjust the learning rate you might see there's a small number at the bottom plate so we did change the learning rate and try to you know put it at different level so first set a faster learning rate and then we minimize the learning rate to ensure that you know it uh froze at you know it what do you say it stabilized at the right level so there's some some of the few parameters which we experimented i'm just one more thing ma'am was the optimization algorithm parallelized all the whole thing was running i mean uh what was the algorithm parallelized or it was a single instance that was running so the training algorithm was a single instance the comparison algorithm was was parallelized so as such the training because we did not use a very huge data set we had a very in fact one of the things was we had a small training set so that was also one you know the performance was like relevant to that actually but for the actual uh using that model that was parallelized actually wherein the distance calculation was actually split across multiple groups thank you hello um how do you handle outliers outliers in this case how do you find outliers okay handle handle out how do you handle outliers yes i got i got the question actually so as such like uh we did not see you know any significant outliers okay as you saw that our training as i said the one of the limitations of our training set was it is small so we had hardly like you know something like no less than hundred comparisons so that that was a sad you know had to actually manage with those hundred comparisons you know created distance and you know handle with that so as such we did not you know get the chance to see like too much variation so it was more of distance base and it was i think whatever we got was within that set if that is the case the models what you built will be over 50 isn't it yes so that is one of the reasons like you know so we did not stop for this like so this is one case we were not because the number as i said the training set was something which was not too good so in fact like that is why in the prediction set the kind of if you see the kind of uh accuracy which we got was pretty low actually so it was like only 64 percent so this is one of the reasons like so in fact we try to get more data set but unfortunately this is all we had and we had to live with it so that is why we had to start experimenting with the alternate models like the clustering and all that but as such this where we felt it was usefulness for the first stage elimination like you know you could use it as the first stage and then probably after that you know filter on that or cluster on the whatever output has come on that we could do the further cluster hi uh this here have you deployed this for any use case in any industry uh vertical yeah so this is this is given this is this was actually adopted to the customer site so it has been it's it's been known but as such it is not going to production yet so we have delivered it to the customer but they have to start using it so what we do so to just to add to that is like we said that do not start using it directly so use your rule they already have a very costly commercial you know solution MDM based solution and we have this solution parallel so for about six months what he said is let's use this in parallel like so you use this to your prediction and parallel use your MDM and then keep learning you know what this is predicting and what that is pretend keep feedback from the MDM system into this and at the end of you know six to eight months when we retrain the model we are confident that we perform much better so as such it is not gone into production just another question uh see one of the use cases key use cases for this in the in the banking industry as such uh is the watch this filtering right where you need to match your customer list against some o-fac list or terrorist list that that can keep getting that you will keep getting from the regulators right in such cases you may not have the label data so it may be a tough scenario to use uh supervised learning uh algorithms so how do you how do you tackle this problem in those cases see up front like so as i have not actually worked on that use case but uh what i'm thinking out of mind is that yes you have you have set of features and you have your customer features so you could actually find the distance between them so that is the one one thing is very clear that you can actually get at least basic features like of the terrorist like you know his name probably you know is you know taking it further probably if you know a spatial identity and you have some customer with their face you could actually do a similar entity detection you know comparing between these two phases how similar they match and use that also to tag like so these kind of things could all go into it and then see how similar that guy is to some of the extreme existing records and we could try that look of course i have not tried exist out of my mind i'm talking what is the distance method you are using we are using the 11 chance distance and it's customized slightly so 11 chance purely did not work so we had to do some custom data specific hacks but 11 times is the core and last question please you said you could have done a logistic regression but you didn't write we have used logistic regression also internally yes that is one of the algorithms which we have tried and we but we froze on this actually so as the part of the experiments like so as i said like logistic regression was working better with categorical data but in this case when you see the kind of range you're talking about distances which you get things like you know anything from point not one to point nine nine so you have total range for those distances in that case neural network was performing much better that is why we chose the neural network but the outcome variables are still categorical right the outcome variables are yeah it's categorical so basically you're just talking of in this case actually only two levels basically same or different so that that's all it is okay that's all we have time for today if you have any other issues please follow up with that thank you