 everyone okay so next talk is machine learning technique for building a large scale production ready prediction system using Python by Arthi and Arthi is a senior architect in in the CTO office of Wipro technologies she has a be from UVC OE Bangalore and MBA from IM Bangalore just to give a brief basically I think he's just covered it I think I've it was six months back so I think since of then I moved to a new group which is the cognitive automation team of at Wipro and yeah basically in areas of interest machine learning semantics natural language processing automation and likes coding in Python so what's the agenda for today the talk basically at the end of this talk I hope basically that whoever comes here like maybe for somebody who's very advanced it could be a repeat but at the end of this session I would like everybody to go and be able to create a classifier on their own using Python that would be the at least my objective and people have achieved that that would be I would be happy what are we trying to do so we just step one step back when we say classifier what are we trying to do so basically today in today's world we are actually there's a lot of data coming up from all sources you have you know Twitter feeds you have you know news articles your Facebook everything just keeping on coming how do I actually be able to only see what is relevant to me actually so in the such a big world when you know there is so much happening it's very difficult for me to actually scan all that and say okay this is relevant you know I will read it but is there a way which you know the machine can automatically tell me really this what was relevant to you it would find out and then send those things to me so specifically the focus here would be you know where you know things which are there in the free unstructured text so that is the context here where we're talking about there's a lot of text coming on and for example the one example could be the news articles itself where in there is so much news how do I actually be able to tell that you know when there's so much news which of this news is relevant for me actually that would be one scenario where this could be applied the other scenario basically is that in fact this this scenario is more closer because this is something which we have actually implemented in our company wherein wherein there is a problems people are facing a lot of problems and traditionally when people have problems they used to actually go through this huge menu-driven interface actually so when I say problems I'm talking about you know in an enterprise kind of scenario where people could face you know ample coming of problems it could be just a problem with my laptop it could be an infra problem wherein you know I mean I'm just feeling too cold or you know the restrooms are not clean or it could be a problem with my payroll so we're talking about so many different kinds of problems so many different kinds of you know separate functions which are actually organized you know going to actually solve this problem for me so typically while solving such problems what at least we used to do about a year and a half back is that people have to go to the system figure out okay where in this huge set of you know lot of drop-downs it should be like you know at least 10 to 20 levels over there and they had to figure out where in those levels is the right category for them they would go select that category enter the problem statement and basically the system will actually then go and assign to some agent which is there in that particular team the issue with this kind of approach was basically people did not really know where who against which group they had to actually raise this problem because most people are not aware looking up in a you know a restroom problem has to be you know probably you know done by group a while you know another problem payroll is by a group B so what used to happen is that these kind of problems used to each of these problems used to go and log on to different buckets actually so before actually it finally reached the right resolver it has to go through multiple circuits wherein there was a lot of you know delay and a lot of worry for the end user because they had to be a lot you know answering calls from multiple agents and each time they'll say okay sorry but I can't solve a problem and they would reassign to some other team so at the end of this thing is that we could can we develop a system wherein a person just states his problem you know a single you know line English saying that okay this is my problem and you know the system just automatically goes and assigns it to the right person in the right category so this is what we could do assuming that we could build such a system so basically that summarizing what we said we are trying to do an automatic classification of text and you know just we just talked about few examples where this you know this kind of thing would be helpful for us so now that we know what we're trying to solve what do we do where do you start from to build such a system the first step is the availability of lot of historical data you know when you say the term corpus is nothing but you know for people who are not familiar it is a huge list of data wherein every data line item would be rightly classified to a right bucket so for example if it was a news articles each news article could be correctly tagged you know is it a baseball article or hockey article or if it's a problem ticket then that particular problem ticket would be tied to you know is it an you know is it a payroll issue or is it a facilities issue what kind of issues this so this is the base premise wherein if I have to build such a system I need to have a large corpora wherein I have a lot of these kind of problem prior problem statements which are correctly tied so this would be my first premise fortunately for us you know even to experiment there are also such data already available and there's a very popular 20 news group set I think many people might have already seen this wherein you could actually go to such a thing and you could download it and you know quickly go and implement such a system so this 20 news groups the data just to quickly show you how it would look it's once you download it basically you would actually see that basically there it's automatically split into around 20 teams like you have hockey you have baseball you have you know hardware softwares there's so many categories actually and if you see each of these directories have actually a list of articles and in our case each article would be a training item for us and the directory name would be the training category for this so this is your one sample set and just to see how it would look we could actually see one sample one sample article how it could possibly look actually so this is one article so you could see that there is some sort of large text over there which is a news article on that particular category so this is how the data was looking for that particular case so now coming back to our presentation so so as I said now we're trying to classify text and we actually have this huge corpus as a prerequisite and we actually are talking about the 20 news groups as a possible sample set now we come assuming that now we have the data next is the actual solution itself to build such a solution basically you need these major blocks actually the first basically the first first block is the preprocessing block when a lot of data and you know data kind of activity data cleansing and you know all the data related activities would happen preprocessing followed by the natural language processing wherein you would process this text in a way that the system would understand then would be the actual model building and once a modern is built your training model is ready then you could actually start using the system in real real use and in real real thing when you start using in real time the same the process is similar a person would enter this problem statement it would be processed through the nlp process proc and then the predictor would actually predict that particular thing into the right category so if you see these are the high level block diagram of of such a system so how what is the prerequisite to build such a system to build such a system there is the sklearn the sklearn is a well ski kit learn of python is a very good library and you could actually potentially it has like a host of algorithms and complete support to build such a system within the sklearn library itself so this is what we used and now going into each of the steps in detail so as i said the first is the preprocessing wherein there is a lot of data related challenges in building such a system again this is a very intense activity in its own so i'm not going to go too much in detail we need to actually go separate lecture to understand each how that could be handled but i'll just touch up on this section wherein the typical data challenges which you could phase are you know the data is the uncle very unclean when i say unclean what i mean to say is that the data itself the text would have a lot of a non non value added terms so for example there could be names there could be phone numbers which as such does not add any meaning to that particular thing so the same name could appear in you know all the different classes so such terms would pull down the quality of the classifier the accuracy of the classifier so we need to do you know things like the named entity recognition to you know identify such things and remove tag those items and remove all these non non value added kind terms so that would be the unclean data the second basically is the wrongly labeled data here what i'm saying is that there could be a ticket which actually belongs to say category a but it's tied to something else for example i could have a payroll ticket which is you know probably classified as you know for all you know maybe a you know facility ticket for some reason because the agent was just too lazy to close it in the right category or some system error so if i have such data then again the system is going to wrongly learn because all it knows is you know what it sees so these kind of wrongly labeled data is going to cause a lot of issues in the accuracy of the classification the third kind of problem would be would be the unbalanced data so when i say the unbalanced data typically when you have huge set of classes there could be certain classes which have you know lot of you know training items like there could be thousands of training samples for certain classes and then there could be certain classes with maybe only five or ten classes ten items how do you handle with you know this kind of thing because typically in this kind of case the those classes with very large number of data set would typically totally you know take over the smaller classes so you know smaller classes would never get predicted so this is a very acute problem in the real life which we faced and the fourth was the large number of classes wherein there are so many classes you know like in our case i think it was like more than 4,000 to 5,000 classes were there how do you actually you know handle with so many so many classes so these could be the typical data related challenges which we could face in a real life while building such a system so how do you you know think so again i said this is each of these is a separate topic on its own but i'll just touch upon it the first is eliminating the less informative records so can i find out by some automated techniques saying which are which are the tickets or which are the records which add value and which are the training records which don't add value so if i do that kind of thing i could figure out you know things which don't add any value to i could automatically drop it so this is one way where i could you know eliminate the less informative records the second is i could use active learning when what i could say is i will take a very small corpus which i would manually you know ensure that it is very clean and i will use that corpus for every class probably i'll take 25 or 50 tickets as the 50 year records as the base and from that basically i would use a technique like a k nearest neighbor to expand into the other training records to see whichever are new to a particular class i would add it to that class so i could use this kind of active learning technique to actually you know increase my training data set the third would be the random oversampling this of course did not work that well for us but it could be tried wherein for the ticket for the classes which have very less data can i randomly over sample it and you know try to duplicate certain records so this could be one possible way to you know ensure that more tickets or more records are available for the classes with less data and finally merging of similar classes so there were a lot of as i said when there are too many classes many of these classes may not have may not even make sense standing in isolation so is there a way i can automatically find out which are those classes nearest to and merge similar classes so this is something we could do to reduce the number of classes from the original size so now that assuming that we have done the complete cleanup and we have a comparatively cleaner day set the next big thing is that now we get the we have to greed this data into our system so for this basically you could use basically things i'll just quickly show you the ipython notebook so which for this so what you hear you're seeing is the first of course is the import all the dependent libraries are imported and in the second section we are actually doing the task of reading the data so in our case already there was a pre-existing library for reading in for the 20 news group article itself which is the fetch 20 news groups which we use to read in these data sets into the into the python code and of course we could do it for any other thing so just to show how if you have your own data how you could do it i'll just show you a small snippet of code of how it could be done for your own data set so all you do is assuming that you have a csv file wherein your one your description would be your training set and then your classes would be the category to which it is classified i could just read use the csv dick reader to read this training data into my code and at the end of it basically you will have you know two lists basically so all you need is a set of two lists wherein one list would hold you with your entire set of the training data itself and the second list would actually hold the the classes corresponding to the the data so that is all which is required so this is how it looks like at the end of this once you read it you will see a list of each each item in that list is basically a training set and each thing in the other second list is basically the class which that particular training set belongs to so this is just to show how you could read your own data into the into your python environment so once this is done the next step is basically you have ingested the data now and also as you ingest you would split the data into training and test sets so typically it would be something like in the 70 30 or an 80 20 split wherein you could actually take 80 percent of your records to for your training and probably your 20 percent goes for your testing and you could actually if you're using the csv dick reader method you could do something like a random shuffling to ensure that you know the records are don't follow any pattern and then you actually do the split so this is the interesting data step so now actually we have cleaned the data now ingested the data and the next major step is the natural language processing so for the code to understand your thing it you know it has to actually come come in a format which you know it can understand so for that we need to do something called the vectorization which is we would apply the tfidf vectorization which is the term frequency in inverse document frequency vectorization so basically this is nothing the term looks you know it looks big but it's nothing but it actually weights every if there are like you know every unique token or every unique word it would actually weight how many times actually appears in a particular document and divided by how many times it appears across all documents so this will actually give a weight so at the end of this tfidf process what you would have it's like you know just for your visualization of course it shows it in a different sparse matrix format but for your visualization you could actually have a very huge you know an umpire kind of thing wherein all your columns would be your different words and your rows would be your each training sample so for every training sample it would actually get what was that weight for that particular word so this would be the output of this tfidf step and as i said it would be a if you see in that thing it would be very huge because typically your number of unique terms in a document especially if your copy you know if your training set is very large it would be very huge actually and so forth and then also it'll be very sparse because even though there are so many terms across those documents for a particular document or for a particular training item there will be only some of those tokens which appear so only some of them would be actually having weights and most of them would be zeros actually so to ensure that thing we have to use a sparse matrix kind of format to store that data so that is the vectorization part of it and so again here nothing to worry the entire thing is available all you have to do is call the tfidf vectorizer which is available in sklearn and it again has a lot of very good parameters to play around with and i'll just touch up on few of the parameters which you could play around with with you know the tfidf vectorizer so one basically is that how do you handle decode errors so you could actually tell you know you could actually say that you know you can ignore it or you could replace it or you know you're very strict about it so you could set that you know so if there are any special characters appearing in your in your training set you could actually automatically handle it using this flag the second is the n gram range so this is to ensure that term see when i was talking about terms so you don't have to only look at single terms you can also look at biograms trigrams you know four grams just to give an example a term like a big and a data you know could be in isolation but it also has a lot of meaning together where you know people are nowadays talking about big data you know or people are talking about i don't know barak obama so these things are actually coming together so it makes more sense to take these as grams you know connected together and so basically you could set what are the grams you're interested in so you could actually tell the gram range from you know a min gram you know to a maximum saying that probably i'm interested from one to five or you know one to seven so this is something which could be configured as part of the tfidf initialization process the third is the stock words parameter if you set this parameter it will automatically remove all these noise words or the stock words of english from that system so that is the stock words parameter if there's one more called the lower case wherein you if you set it basically it will automatically convert the entire thing into lower case so that if two tokens are just you know varying by case they would still be considered as one so again these are just few of the parameters if you actually check the tfidf api itself you know i think there are like host of other very useful parameters which could you could use to you know clear on with the api so once you've done the vectorization now as i said this is a very huge matrix which you're getting out there so can i do something to compress it so there is a there is a one thing called the select k best which using which you could compress it the select k best api itself accepts a lot of different kind of functions and we specifically use the chi-square the chi-2 which is nothing but the chi-square test and what that would do is it will ensure that you know it will only retain those features which are rarely correlated for that class so all the features which are not that correlated would be dropped off and this would ensure a good compression of the feature set so this is something which we actually applied so coming back to the the code itself how this would look now we could actually have a quick look at the code actually so now as you said now we have we have done the cleaning we have read the data into the into your python code and then this is where actually do your actual vectorization when the first step in vectorization is initialize the vectorizer itself with the parameters which you are interested in followed by basically you do the fit transform so this is when basically you would fit that particular vectorizer to the particular training set and the output would be sparse the sparse array which I was talking about and this would be followed by the actually the you know the you do the same process for the both the training and the testing so you transform the testing also and then you would actually do the compression using the cellet k best this is the piece of code which you see here over there and if you see just for our interest like no and also actually done something like no to an inverse printing of the terms just for our understanding I just took one any one sample in to figure out what is actually getting stored in that and if you see basically the extrane is nothing but an array which actually is talking about 11,314 training records and it's it has got one one lakh 1,320 you know unique tokens in that and also when you do an inverse transform these are the kind of terms which were there is part of a particular you know record in that transform so this is how it would look as part of the as post your actually vectorization and your tf idf transformation which has been applied so once so once you've done basically the compression you know vectorization now comes to the core which is the classifier so now you can actually get in the core of building the classifier so here again skeket learn is has a huge set of algorithms which you could choose from and you know and we did a lot of experimentation across different kind of things and in the end we actually stuck with rich classifier because we found that of all the thing that gave us the best performance and what I understand by reading it is rich classifier works more better than the high dimensional spaces because the way it actually does the separation planes ensures that it doesn't overfit the data so this is what we understood and that is we felt okay that is why it probably it's doing much better compared to the other algorithms so so so the next step as I said is you shortlist which classifier interested in you would again follow the same step of initializing the classifier with you know the parameters for the classifier and then you fit the classifier to the particular training data so this is basically how it is and if you go back to the code quickly go back to the code and see how it looks so it's very simple as simple as you know the classifier you know initializing it to you know with the specific terms and again it this classifier in turn has a lot of good parameters you can actually set you know different weights for different classes and there are a lot of things you could do and then you can actually you know vary the solver use so those flexibility is available and the only difference between changing so if I want to go to some other algorithm all I have to do is just change that one line of code so I just replace rich classifier with probably you know whatever I want SVM or which were native base whichever I feel I just have to replace that one line and the rest of the code which still work you know with the same workflow so all I do I create the first classifier and then I would fit that classifier to my particular train training data so here if you see this is the processed x-train which is opposite vectorization as well as the compression and the y-train would be the labels associated with each of those training sets so this at the end of this basically you would have a trained classifier so once you've actually completed the training process the next step is the actual prediction process which again very simple because it's just one line you use the same classifier which has been previously trained and then you would just run a predict on the transformed test set so this is what you do and what the output the pred output actually would be the list of class labels predicted for each of those items on your training on your test list so that was what pred would be and so basically now what we have done is we have actually created the you know trained the classifier and then and then we have completed the prediction and now comes to the very heart of the system which is evaluating the system because just by training and predicting okay we are happy but how do we know it has done well enough so again we don't have to do much here because there are very well established APIs here so if you see typically there is something called the precision and recall and then there's a combination of both which is called the f score so if I can actually get the f score of the system and I have a benchmark of my f score so supposing I say I'm happy with say 75 percent and above f score that is my benchmark all I have to do is run this particular thing and figure out what is the f score of this system and if it tells it's good enough then I'm I'm good to go actually so just coming back to the code for this piece of thing so if you see here all you have to do this is a matrix called f1 score and you just give basically what is the actual value in your test set and what is the predicted value corresponding to that if you actually pass these two lists to that f score it will actually tell how good it has actually and if you see in this case it's around 69 percent of course we have not done any cleaning just you know raw we have just done applied what is available and there is also one more very wonderful thing called the classification report so beyond the f score the classification report what is that is basically is that it gives for every class level what is what is the f score so what is advantage here is that there could be certain classes which are not doing so well and certain classes which are doing well so this would give me a trigger saying that okay for a class like you know the first one which is that alt alt tsm it is doing unit 56 percent so I need to put a lot of effort on that you know to clean it and you know focus on that but whereas a class like you know say comp windows x which is that 82 percent precision okay seems comparatively okay to me so if I run this classification report it will tell me to which classes I need to put more effort I need to do some cleaning preprocessing or probably even eliminate it if it is very bad and this will actually give me that kind of control so basically now as you have seen we have actually done we have actually come to the end of the step and we have completed the evaluation of the training the training set the the complete system so so what do we do next actually the whole idea is that once you evaluate the system you actually start this is when actually you do your iterations so what this means is that you keep on iterating till you reach your f score your precision and recall like so for example when we first built our system we were actually not even at 60 or 65 percent and a lot of effort actually went and fine tuning the you know the parameters and you know cleaning the data eliminating the classes maybe asking for more data all that effort so basically coming up here is just you know not even one you know 20 percent of a job after this the hard work starts wherein you get into the data and you start tuning it and you know doing whatever is in your thing like you know maybe parameter you know you actually value you know vary the parameter or you know do all those four different data cleansing techniques which we did so at each level we actually try the maybe even vary the algorithms so in fact here we have talked about only a single thing but what we did in practice was an ensembling of algorithms so we didn't use one but we used a combination of three or four different algorithms combined it together and then actually you know got the final output but that is not something which we talked about here but so these are the things which you would actually try and keep on iterating till you reach that f score of you know which is you know comfortable enough for you so as you know everything is good you'll have a very good classifier then comes the actual the challenge of how do you do the deployment so also basically things are working fine everything is running but here you're actually talking about a very single system but how do you do a large-scale deployment so for this basically what we did was we used basically you know these are like a horizontally scalable system so we used multiple instances of the same the python what endpoint we developed and horizontally scaled this out and used a load balancer so what this ensured is that when a new request comes so basically for our live deployment we had about more than thousands of requests in parallel because this was for our entire organization which I think more than lack of people across the globe and we used a load balancing system and that actually ensured that basically that at a no point of time any particular system was you know overloaded and even if some system is down automatically the request request will not be directed to that so this ensured the large-scale horizontal scalability for us and of course sizing the instances to ensure that you know the number of parallel instances are in a sufficient and then you know with enough backup so just as I you know initially told we have used it in our internal system for building a single line interface for entire raising all the problems tickets in our organization and where the system would automatically classify it and what did we see we saw almost a very good performance 50 percent reassignment index improvement and 100 percent months we saved in effort because the lack because of this this entire thing of redirecting was actually reduced and this saved almost 100 percent months of effort in just three months and we actually managed to make a first complete production release in under three months actually so this is something which was like something really good and basically what we have concluded is that I think python has got very excellent libraries for handling machine learning problems and it can be considered for live production use so this is a very this is something which is in production right now and we are able to achieve the needed scalability and performance required using python and the language itself as all of you know you know is it is very easy to learn and you know we can write maintainable code so with that actually I come to my end of the clock thank you all for all your time and of course thank you for python you know the wonderful community itself you know for giving such wonderful libraries so that everybody could work on it so we can start with the question on session is it what is the frequency of getting the updation of those three if any yes so right now basically it is a trained models we are not actually doing the automatic learning back that is something which is in progress right now where we're trying to do the automatic learn back at regular intervals to automatically build those models next question hi this is Shashank here I have two questions actually one is how do you smartly choose the f score threshold and the other one is how do you iterate smartly so that you know you're just not doing anything randomly right so I think the f score is something based on your tolerance like for example when when we did the POC itself as soon as this problem came to us first we did the person we told up front that okay this is something we're seeing only 60 percent how much ever we do we will not see more than you know 75 to 80 percent you know in the in the short timelines you'll see and that is something the you know the entire group as a collective collectively said that yes this makes sense because even at 80 percent accuracy this is at a much better but much better thing than rather manual wherein we're actually facing you know much lower accuracy levels so it's some call which is a business decision could you explain a bit more what exactly 60 percent f score mean yes so basically there is two things such as the precision and the recall so the precision is basically how accurate whatever it is like if something is very accurate so there are two parameters one is that how accurate it is and second is assuming that there is something in the data set right how well it is able to pick it out so these are the two things actually so it's actually a combination of your false positives and true positives so this is a combination and if you actually of course it's a well defined algorithm basically so when you use the combination of these two and do the logarithmic combination of these precision and recall you would be able to get the score for evaluating of these systems so we have actually just used this this matrix itself wherein like we've just seen the score beyond that how we evaluated the system was basically on the performance itself because it's something which is live so we actually had a problem in fact we actually used an ensemble of almost four or five algorithms and then there was a problem of you know it was taking like more than a threshold of almost a minute or something then we had to go and drop a certain of these items in the ensemble so for us the response time was very important and of course the accuracy was very important so this is how we evaluated but I think there might be other more standards which we have not looked into it yet hello ma'am I have a question so basically we also did the production deployment of with the same APIs the one of the major issues like you know the what we face is you know if we want to if our training site is pretty huge then you know making the system live will take a lot of time so to save that what are the things that you are doing what we did is like you know take like do the computation and put it in the RPC and whenever we change the algorithms or any API changes we just you know the same API RPC calls have been used so that you know the re-running the system getting live is in few milliseconds rather than you know taking it minutes or something so what kind of things that you did in your system one good thing for us is basically we already have a running model so this is not something like life or death so all we did was we had to actually build this only in the off-peak hours so there was already running and we just had to once the model is built as such you know deployment takes very less time so the model building takes in a different thing and we could just push the deployed model and the downtime is very less for us so that was one thing as such for the running itself in live we had a phase the problem you could do something like a batch learning which again SK the skeet get learned supports wherein you could actually do a batch wise and you know iteratively learn the model itself so is there any performance degradation on the batch learning because i see there is a performance degradation from for our case but is it happened to you i we have not faced that actually we have not faced that okay cool yeah thanks hello uh mike how do you iterate smartly you did not answer that one uh iterate sorry can you do the whole context yeah so you said that most of your time goes into how do you trace money is that so again that again see you go through all the things and then you figure out as i said the business business call wherein you are at 80 percent are you happy so what is that number you are happy with and comfortable to go live so till you reach that comfort number you have to keep on iterating and where do you put your efforts on as i said that uh the detail class wise thing will be where you put your effort on because you know which classes are doing badly so as you know you can actually directly go to those classes do your cleanup do your merging and more effort can be spent on those classes itself and that could be one thing and of course you can actually play around with the algorithm itself so wherein you can actually change can you change some other you know use some other algorithm is something you could play around with hello ma'am ma'am uh ma where are you hello my question is uh whenever you retrain your training corpus retain your classifier so uh do you retrain your classifier means how you deploy that means uh whenever you uh there is a classifier uh in the in prod and we want to retrain it so how do you do that right so basically our training system is completely different we have a training training you know training infra which is separate and then we have something called the one click deployment so basically what we do is that the training once it is done we use a one click deployment interface which would push all these models across all those different uh VMs which we have so that so that is how it would actually work out actually actually retrain your classifier again or it's it's just uh just update the uh previous previous training right now basically it's a complete retrain we're not doing an incremental training but that's how the ski cutler works easily that way actually so we're not done an incremental training yeah you talked about uh iterating till you get the required uh f score and so on can there be a situation where uh instead of converging to a ideal score it starts diverging and going away there is in which case what what could be very possible because for that basically we actually do the entire iteration process it has to be properly like using you know common techniques like your version control techniques so you don't randomly change it because then you can't you know go back to your previous step so basically you be very clear document each of your steps and at any time something goes wrong in us you know every step you measure it so you don't just blindly do hundreds of changes and then measure it you measure all those changes and then anything goes wrong you can immediately pull back those changes right and you also talked about dealing with a large number of classes where you said one wave could be to merge some classes similar classes but uh in real life that may not work because if you're to for example the ticket problem you can't route it to one department which is similar to another department it has to go to the exact department so then what you do do you do a hierarchical thing where you first uh get the category of similar and then within that you do one more very true so basically we do a clustering within that particular function actually so so for example for facility we had like set 25 groups in the end of this process which was we recorded 8 because most of that even it was very wonderful because even after we examined it then they also said oh yeah these are actually doing the same they I don't know why we have 20 classes so it was very good actually so it might lead to real immediate feedback into the actual business process okay great thanks yeah yeah so uh my question is that do you try any other libraries apart from skykit uh an LTK or text blob uh kind of and what were the deficiency if at all you found out in those libraries so in fact what we did was which we actually initially we had an ensemble which included some java libraries also so we were actually using that and in just before deployment we had a lot of problem because of the performance and then we actually completely had to drop the java part of it so we actually assembled across both java and python so then we completely dropped that uh the java but because it was taking too much time an LTK is like in python itself yes yes no so we haven't used we did not use NLTK because we did try so in fact NLTK I had a one interest you know it's not related to this but the same problem we were using NLTK for the named entity recognition and we found that it miserably failed actually because Indian names are something it is not doesn't work properly so what we had to do then we actually used to work on so we actually took a list of names luckily because it's our own company we have a list of known names we used that for tagging and then eliminated that so we used more like a stop word removal so NLTK such named entity did not work for the Indian names in our beta so you prefer skykit rather than NLTK yes skykit at least worked NLTK was actually in fact we did some of the machine learning initially in NLTK and it's extremely slow also so one you should be benchmarked on that and it didn't work actually it takes too much time and how about the polarity of any content if at all we say like a content whether it's a negative or positive how do we gauge whether this statement whatever is like if I want to see only the positive feeds and I don't want to see any negative feeds and how can I gauge it from the textual yes content so what did you do on the basis of polarity right so basically we for this as just what you're asking is the sentiment analysis so we have built a sentiment analysis separate thing in fact we did a six class six level sentiment analysis but that is something not you know related to this but that was a different requirement wherein something came you know it's of course it's again a incident ticket and we wanted to find out how you know is the person who raised a ticket happy sad angry so we did a six level thing and they're actually the techniques are a bit different because you need to bring in things rules like you know if somebody says I'm happy but you know it's actually done not happy so actually just because I was happy I cannot give a high school so I need to consider those kind of things which would change the polarity so the kind of techniques used are slightly different but that we've done it but not for this problem it was a different problem thank you you talked about using a lot of ensemble algorithms in combining them here could you talk a little bit about how you did it and what are the other things that you combined it and how was the scoring done for them right right so actually what we did was we did not go for any fancy scoring mechanism what we did actually was the majority voting so that is very simple is assuming I have four algorithms in my ensemble if some if two of them say that you know there is it's class A and you know another says class B and the third says class C then the of what class A will be given the maximum weightage class B will be given you know comparatively let's make we take in class C would be the least weight I mean class B and C would be of course the same because only one one algorithm you know one one algorithm told it so we just used a very simple majority voting kind of scheme text blob we haven't used no we haven't used we have tried pattern and the thing but text blob we have not used so I think over sampling didn't work for us even though we mentioned and literature says it will work it doesn't work because it's like you're creating duplicate tickets so we have not in fact we are still having that that problem is something we have to try to solve we have not solved it so well as so only thing we did was you know use the brute force we went back to functions said give us more data and you know that's what we did but I think there should be some technical way to do it but we have not cracked that yet actually so for a scalable machine learning production system you are just using sklearn or you have tried some custom made python libraries or because what I've seen is like sklearn doesn't scale well so so whether you are using some customized custom made python libraries or say you are using some something like a graph lab okay so what is your data set size I'm asking you so no what is you said it didn't scale well so what was your data set size I mean I've worked on now I've been working for close to eight years no no I asked what is your you said it didn't scale well so I said what is the data set size it didn't scale well so you must be having some threshold right it is over a billion over a billion so probably we've not worked on that size so when we said last size it was like we actually had I would say about one million tickets not more than that one million or one two million records so it is not like the google size thing we have not come there so was one and million two million it worked very comfortably well and then beyond that if you have more classes then we have to use some other techniques like vertical partitioning but we didn't have to reach that because it was comfortable enough for our things so in fact this worked very well in a very normal standard agb ramp I just want to ask you talked about splitting of data like into your separating between like 80% of data to be training and 20% for testing how do you decide about how much splitting of data should be done for testing and how much should be done for training do you do it randomly or do you define a particular parameters for it just like you use KNN like KN nearest neighbors so you classify using a like a k fold like if I give a five fold then it should be divided like 90% to 10% should have five pieces so how do you decide that okay so actually the two parts to question one is can I use cross fold validation 10 cross fold validation it is much more effective so I'm not mentioned that here but that is more effective because even though you have you're using 90 and 10 but you're iterating across all the data sets so using a k fold cross fold validation is much better so I mean this is just a simple thing just to show you the thing yes yes yes yes you said in your ticketing system I assume you have a hierarchy of levels that the user has to choose from so did you train it it was a different training classifiers or how was it done so we tried both so one was we actually put everything at one level and then we tried classifying and the second was we actually went for because we knew that you know this is like so we had very clear groups like we had something called vivid is then we have called infra then we had facilities so basically we actually tried also building models at each of those different levels so each at basically so we had one model at IMG one model at you know infra so basically what will happen is when person races the ticketed to first identify to which of these buckets it falls and then within the bucket you would actually have you know a training I mean a prediction so this is one thing which we actually tried actually of course there were challenges because things were actually because you're talking about splitting across so many functions so we did face challenges but in the end we went with the hierarchical approach so basically it would predict the top level hierarchy and then the user would still have to down drill no the user doesn't have to do anything everything is automated so you have actually basically the models actually are built such that it would first identify to which level it has to go and within that level it would run one more model and so it's a totally black box for the user okay thank you so what was the volume of the ticket generation as well as the number of users using the system so the number of users we had the entire company is working on that so it's almost like one and a half lack of employees strength of one flag it's across the globe so both you know it's internal to the system and external so you actually have requests coming both within the corporate environment and from outside and then you are almost like 12,000 to 14,000 requests normally on any given day that was the thing and we actually actually plan for even 500 to 1,000 parallel requests at any time so when there are some problems it could actually go to that peak kind of no okay any questions there what's the training data size that you have in your production service so we actually used I think almost more than half a million set of tickets half million to one million that was the range which we actually had actually so we used a historical corpus and we could actually go for more but we used around half million to one million that is the words you find in like n-grams versus you know every the complete data set you are coming out of million that things that you have we have used the complete data set we did not do any you know partitioning or anything like that so whatever that data set we used the complete data a lot of noise in that because yes that is why we did the cleaning actually so we didn't directly use it so I said like that you see the first step was the cleaning but of it wherein we did a lot of work offline like we figured out okay what didn't make sense what was less value added it was all done in the first step actually like even in few combinations would never make any sense and which may stay you know for longer duration may not help so how you how do you think you know we can eliminate these kind of stuff I'm sorry I can't hear you and I couldn't get you can you just repeat that question so we have n-gram models right okay right so in n-gram models there are few you know few combinations which we never used because of the noise there are few things which happened so how do you plan to remove those things in the out of system because there are higher chances that you know it might be consuming lot of memory in the system that you are using right now okay okay so I think your question is as part of your thing what what to how many grams are you taking it right so in the in our production system we did not go for multi grams we just went for two grams actually so we did not go for five more than that actually so that that's all we used any questions last one so you told about the classifiers could you please tell us which classifier you used or which algorithm like gradient descent or something so we I used the basically the rich classifier that was what which is the best of course we used other things also but the rich classifier was the best actually rich classifier it's available directly in skate learn so we don't want to write anything just search for rich classifier and company documentation is available okay thank you yeah hello just two questions the rich classifier you need to get nc two different classifiers right one for all the classes then it does it does it work well like it doesn't take too much time it actually I mean we were worried it will take more time but actually it worked very well okay there was no worries the next question is putting this learn model directly on Hadoop distribution of the inside the company is that what you showed in the when you did not use any Hadoop so in fact we just used your normal windows machine in fact it's it could be a windows or a Linux it's a normal 8gb machine a normal VM actually it could be either windows or Linux open to no problem across both these and it was a normal standard 8gb machine that's all okay last question I would like to ask I have been using apache spark for machine learning so have you compared skylight with apache spark no we haven't done that actually but spark is something high on the agenda we want to try that but we haven't tried it actually so have you done any regression logistic or linear regression using this yes we tried so basically in if you see the iteration we tried logistic regression we tried name base we tried a host of things and in the end basically only rich classifier work all the others were you know certain things like for example random forest you know they didn't even complete they took too much time they didn't complete and then things like logistic were not giving such good accuracy so early rich in the end was the most accurate how much accuracy you were able to achieve but as of now we are actually attached 80 plus percent so it's still a journey it's still happening so in fact we're trying continuously but 80 percent is something which we're okay to live with and we've gone ahead with that so any open source community which you are working for like I can contribute to the project or someone can we would love to start something we have not done that actually so skill kit learn is the place and deep learning is something I would we'd all like to work on actually at least from my group and myself but we have not started yet we have to do that we would really like to contribute to the project and learn like many people can that would be great actually we should start something yeah thank you guys anyone we have one minute left okay fine thanks thanks thanks a lot