 Good afternoon guys, welcome everyone in, settle down right after the lunch, we will talk about continuous online learning today, it has been a very old learning method, but it has been resurfacing with some of the bells and whistles of some of the new techniques and optimizations, computing speed, etcetera whatever we have come across in the last four or five years. Let me give you a quick, this is a problem which I am going to walk through throughout the talk, online learning is applicable in various domains, today we are going to witness one of the more challenging domains of Twitter, which essentially is a very noisy medium, this is a quick chart, 70 percent of the noise is already there when consumers are engaging with businesses, that is another thing that there is a 100 percent noise when businesses communicates to their consumers, so we chose to work on a problem where in customer service, we are listening on behalf of brands and we are trying to figure out, what is it that the brands can go and make actions on. If you see spam, identifying spam in is could be modeled as a classification problem, you go about setting up as a classification, setting acquire good quality data set, you know select a learning algorithm and then train, test, tune, reach to benchmark accuracies or better than base level accuracies go beyond 80 percent and you can start using them into your production systems. However, as we go ahead, you would see that the accuracies of what you had built over a time would start deteriorating, the performance is going to start degrading as the time passes by and this is where I want to introduce this concept of non-stationary distributions and what we are looking twitter as a language, as a data set, it is essentially a non-stationary distribution where if you think of a stationary process, this is essentially assumes that everything is time independent, which means the averages of all the measures which you have counted over time remains constant. If you are able to see, I have brought out some of the four important, I would say, orthographic features of this data set. It talks about how the emoticons are getting used over a time and on the horizontal scale, what we have is a month and month and we are running a data analysis of 24 months. How the feature as emoticons is getting used, so this is an average of how, what is an average of finding an emoticon in a single tweet or versus what is the average of finding a URL or a hashtag or a handle in the tweets. So, this over a time is, the average is drastically changing too and this is something which is a characteristic of a non-stationary distribution. So, our whole game today is going to be catching up those non-stationary distributions. This is another one where put them into two different classes, one is a spam other is not and how the averages in itself is sometimes at times, there is two times difference between one versus the other, which helps in generalizing the problem from a modeling perspective, but there is an inherent non-stationary distributions flowing within each one of them as well. So, see Twitter as a new language and you realize that this language evolves faster, significantly faster than our languages of how we communicate otherwise and Twitter is also a communication medium. So, this puts pressure on linguists to machine learning researchers as well who are building solutions on top of Twitter to be able to understand this dataset much better. So, here is one, I want to talk about this, but let us look at from a language perspective. This is, this is a Zipfian fit of a Twitter language and Zipfian again it is you know it is a it is people have studied Zipfian quite a lot and this could also become a philosophical context as well that everything in this world could be Zipfian, but the best way to look Zipfian is a Zipf's law is that anything which follows the Zipfian is perhaps normal. What is not normal? I will show you another one, this is another the same corpus essentially, but look at the how it violates the Zipfian fit as well and this is just because the pre-processing was not handled carefully. So, just a quick take away your pre-processing can increase the accuracies or what you expect from the models quite greatly. Let us understand this idea of evolving vocabulary, evolving dictionaries as well as we go ahead and Twitter. This is on the on the let us begin with the right extreme. These are the new tokens versus the unseen, new unseen tokens in the new tokens which we have seen each month as they are coming inside. On the middle is the new tokens, new unseen tokens which we are observing each month versus the cumulative growth of the vocabulary and this comes out around 8 to 12 percent between the months and sometimes shooting up to even 20 to 25 percent each month which means that if you have a batched learner on a accumulated high quality acquired data set, there is a possibility that you are going to see a lot of misfires or your features would not get activated because you have not seen those words of vocabulary before. On the extreme left side again is this percentage, it is essentially a 0-1 score which means percentage would be around 10 to under the bracket of less than, almost less than 10 percent, stabilizing over the last two years but still maintaining a 10 percent vocabulary change each month. It is outrageous. This is hard problem. So, there is an experiment. I could not get the time to build out a nice chart for this but whenever you approach, you can also see this. This experiment essentially encourages us to think if we were to do continuous learning, what does it mean? Which means go ahead, build a classifier on a predetermined pre-acquired data set and run the evaluations. In this case, we were running the evaluations on the quarterly snapshots of the two-year period which we built this today's docs analysis on versus compare this with consistently building with the new data which this is working on as well which means as we go ahead in the quarters, we are also taking that data and building a new classifier and seeing whether this classifier outperforms the batched classifier which was built on the previous data itself. And we consistently see that this way of including new data samples increases the accuracy scores easily by 10 percent. So again, capturing this notion that what if we were able to do continuous learning all the time? There are some possible solutions approaches. I am going to briefly talk about them and bring out some of the policies as well. Reinforcement learning has been championed. We have been recently hearing how popular this algorithm has become and how some of the large Googles in the Facebooks are capitalizing on it. This is the problem. For a binary classification, the MDP, the Markov decision process, it's too small for just consider two classes and it doesn't learn much. In case you guys have worked on it, please let us know. There is another thing called mini batches. Mini batches is where you can reduce and this is what you can do is while especially in the cases of deep learning, you can reduce the batch sizes to as less as 30 or 40 batches. I mean, you can fine tune from even a couple of dozens to hundreds as well. But depending upon your problems, you can learn it. The problem with this is that it assumes. Sorry. Is it because it's rubbing across the skin? I look like a cyborg now. So mini batches is another way where you can reduce the batch process and keep partially fitting over the existing model. What it assumes is that the batch size is enough so that you are not overfitting or you're not skewing the model towards a particular data samples. And just by putting an assumption that you need a minimum set of documents to update your models, you're essentially waiting for the documents to come in. This doesn't really work in online setting where either you don't have an estimation of when the tweets are going to come or it's just too late for you to act upon a conversation to be detected as spam or to be marked as spam. So in the third one, I'll perhaps come back again. I think I didn't do a good justice to explain this guy. If only you're teaching it with the one data point, you are definitely going to skew up the models and your models will stop learning over time. So that doesn't work as well. I just invented this name called tiny batches for this. So the last one also is Drift, which is also very interesting for us and I'm going to talk more about it. Drift is where the ideas can be periodically identify that there is a distribution change and can we reset our learning from some point? And the problem is that detecting a drift in itself is a hard problem and using the drift to essentially update your models is just a build upon it. So here is, I will give you a flavor of what worked for us but also let me quickly step back and talk about what exactly was the problem which we were working on, which is identifying spam. But the idea of spam also is that on social networks, on Twitter, the spam can just come with a tweet which was just went viral. And this is a conversation which perhaps a brand is getting flooded with tens of thousands of mentions but they don't really want to act upon it. So it's essentially, it's brands or a user's decision to whether call it an importance or not. So this imposes us to really go and build per user statistical models and this is what we are seeing. So this is a setting what worked for us here. What we have is a series of local models, which means as the new users on board, there's a local model which is working for them. There is a global model which works, which understands throughout the knowledge of this language of the data set whereas the local model is very localized to the learning, localized to the data sets which this particular user is observing. And also it's a very fast learner which means within a single feedback action, it should be able to learn whether this is a spam or not. And it also accommodates an instant feedback, which is where we talked about the mini batches and the tiny batches. Can we quickly update the model with the new learning? This come together, there is a drift detection which we do and using that we can change some of the finety on the parameters of how aggressively we want to update or how less aggressively we want to update. It gets together into a meta classifier which produces the final results. I'm going to give you a quick view of what each component is. Global model is a deep learning model. It's a CNN. We have tried with 1D as well as 2D nets as well. We have seen that right off the shelf architecture gives us a 86% cross-validation accuracy and is built on a huge corpus. So we don't intend, we have no intention to keep updating this model often. We have built the feedback loop where we model this in an online setting. And this is how it works. So the data is modeled as a stream. It's coming in. Model predicts an accuracy, says if this is the label or not. The environment which is essentially we wait for a human oracle. In this case, the guy who is working on the system to give us a feedback whether this was a right prediction or not because essentially it helps their workflows. So the movement, the feedback comes which means the movement, the environment reveals the correct label. We incorporate that and update into the model. These are some of the properties. These are objectives which we wanted to leverage from the local models. And if you're going to go and experiment with online models, there's plenty of them. But you need to set some of the objectives and properties of what you want to derive from it. This can help you choose one model over the other. So quick objectives, one we want to strictly improve with the feedback. The whole idea is to improve. Second, we want to have higher retention of various concepts which we have learned in the history as well. You can say it memory, but calling a model having a memory is a different, it's not really mathematics. So here are the desired properties of a local model. It's one, it's an online setting which means the feedback is coming. Second, it's a fast learner. There is a lot more aggression in the feedback. It can instantaneously learn from a single feedback point as well. And I'm going to show you some of the charts ahead as well how it works. Have this recency of the concepts. And we're going to, again, so what we're doing with these properties is we're using these properties to build our test cases, build our evaluation strategies. So when you see the last one which says don't forget a recent IE data points, we go into the last end data points which the model had already seen before and we ask it again, hey, by the way, did you remember all of these? And this is how it works. There is various online models out there. You can put them to a test. This is a very nice nifty scikit script which you can run against your data set and see what immediately performs better. Here is what we chose. We chose Kramer's passive aggressive variant 2 model where we have a lot more control on the aggressiveness of the parameters and the fine tuning. It essentially works with the losses and it can not only do classification, but it can do regression and ranking as well. Quick results. So you remember this, how we built this architecture? Here are the results. Oh, by the way, this is only online learning results. So if you run online learning model only on the data set, how it performs, we ran it on a 100K data points and gave it a test set of 50K points and it came across within, I would say, base classifier accuracy of 73%. Then we changed the whole approach. We said, okay, we are going to run the model in an online fashion which means with each mistake, we are going to immediately, immediately and in another experiment with certain delay which is accommodating the human environment as well. So overall in 50K data points, it went about and made 9,028 mistakes and these mistakes were fed back into the local system. Overall, we achieved an accuracy just in purely running in an online fashion almost 81.9%. So which is a gain of 9% accuracy just by incorporating an online feedback. So here it is. I mean if you see, so the top one is how many feedbacks which we received, overall around 7,000 plus feedback and this is with the different parameters. So this is where you can tune a lot of parameters based on what your data is. The below one learns each and every. So these are the updates which are happening and it flags one means that it has remembered the moment the feedback was made to it and it was asked, hey, do you remember what was the label of that data point? So it was able to remember it. In the above case, it had a success rate of 10%. With some fine tuning, we got a success rate of 100%. But there is a catch to aggressiveness as well. So what happens from a retention score of not just last one point, but let's go last 400 points in this case. I mean you can optimize it to in your case, maybe 100 points. So we are looking at 400 points back and we see that 74% of the time we are able to remember the different concepts. So it's not that how many documents which I remember from the last, but it's essentially how many concepts I'm remembering. It's possible that the score goes down if there are many concepts in the last 400 documents. But this is very decent. This is good. So what metaclassifier does on top is it's another online model and intuitively so because if you are in an online setting and your final output is going to come from a batch classifier then it's not an online setting. So you have to put an online classifier on top. So we use an online stacking. We use, behind the stacking there is a local classifier which is an online classifier and there is a deep learning classifier which is a global one which learns over a long time. The outputs are fed into an online SPM. We wanted to use online SPM because there is more stability from the weak classifier which we used earlier and there is a more stable classifier which we are putting ahead of it. We train the ensemble in the batch offline but we keep it, once it goes live it stays in an online fashion and the cross-validation of an online ensemble is around 82% here as well, which is pretty good on the kind of data set which we have. Here is what we are looking at over a time increase in the accuracy and the running error rates. If you see it's a good progressive, a slight good progressive I would say as you increase the number of test sets and take it to perhaps few millions as well there is a good possibility that you might be sitting on a really high requirement and accuracy classifier. Quick summary, I think this is the fastest 30 minute talk is it, 234. Summary, so this essentially is a new learning method which you can see of building a continuous learners. There is a lot of infrastructure which you need to support if you are running such kind of systems in production. This is also applicable to the domains for instance monitoring anomaly detection where there is a, you don't perhaps know what is the second class, it is a need for one class classifiers which means this is one versus all. There is in recommendations where there is a per user statistical model which is capturing the importance of the behavior of the users. It can also work on stock market predictions if somebody wants to bet some money on there and if they are able to understand this stock market drips as well, hard problem. There is a lot more work which you can build on top of it, think of using RNNs instead of the CNNs, think of using different variations of online learners which includes arrow, somebody is sleeping. See, I will take that blame. Use character embeddings where there is a number of changes, I mean look at Twitter. There is acronyms, there is morphological changes, there is induced spelling mistakes, your vocabularies are not going to necessarily match. Even you can run phonemes on top of it, they all become over added complexities. Character embedding looks promising, we should check that out. There is a lot more sophisticated methods of handling drift. We were using error based DDRs which we call, there is a more sophisticated way where we have not spent much time into. But essentially it is built for ensemblers of multiple, so if you have a huge ensembling system, DDD can capture different distributions and try to trigger that particular ensemble as well. It is an area of research actively. Thank you. I worked on this with Anuj, so I am going to call Anuj as well in case you have questions, we can both answer together. We work in fresh desk, we recently joined there. We were working in a company called Airwoot which FreshDisk acquired earlier this year. Thank you. Hi, can you guys hear me? So question, so actually it is a two part question, over here, over here, here. So two part question, first thing is since you are using cross validation, so I am curious did you maintain causality in your train versus test set? So that is your test set should be temporarily in a later point of time compared to your training set. Second thing is the feedback that you provided, the feedback. So you are supplying your response back to the system. Did you use teacher feedback or did you try to see what happens if you use a teacher feedback signal as an additional signal? Can you repeat the last part again? Have you tried to analyze the impact of teacher feedback? So teacher feedback would be, imagine that your classifier's correct response should be one for that particular input sample, but essentially your classifier predicted zero. So you are supplying not only zero, which is the classifier's prediction, but also a teacher feedback signal, which would be the correct signal at this point of time should be one. So it's a supervised feedback signal of actually what the prediction should be. So in most of the time, so the reason why you do that is you need to essentially model the distribution, the noise. So feedback is a noisy signal, right? So you have to have some kind of an opponent way to model the noise in your feedback, right? So that distribution. So did you try that out? Yeah, I mean, you're absolutely right. There is a lot of ways you can synthetically create noise inside your data sets. We were blessed with Twitter as a language. Twitter is a very noisy medium. So even the large corpus which we were using, there were a lot of noise samples within those two, and we captured them. I don't mean the noise in the data, the noise in your classifier's response. So the Twitter data is noisy, fine, given, right? But your classifier is going to have a noisy prediction. Just by the fact that you have a noisy label, you essentially have the same, the model also is going to become, it will capture the noise itself because the data is noisy. So when I'm saying the data is noisy, which means the label is noisy. All right. And the causality part? Yeah, so actually the whole data set is sorted in time. So it actually has the causality built into it. What we trained on it is I think the first half of the year and what we tested is on the second half of it. And to answer the second question, what we provide as a feedback is the right label for that text. See, in our environment, it's the user who is the ultimate decision maker whether this is a noise or an actionable. Because what is noise for you may not be a noise for me and vice versa. So when we, when the classifier gives for a text zero and the user says, Hey, this is not zero. This is one. What goes back into the fifth system is the data point and the correct label, which is provided by the That is the teacher feedback. And we also incorporate the delay, which could be caused in the in the real life. Otherwise the feedback is not going to come instantaneously. Thank you. Any other questions? Can you shed some light about the architecture of how you are passing back the feedback model? So, I think let's look at this slide. And actually I wanted to do one more slide. So here's the thing. The meta classifier is in an online setting too, which means whenever the feedback comes, the feedback flows all the way to the local feedbacks as well. So if you visualize the data flow, the data flow is it's coming right into the meta classifier for the prediction. The meta classifier, if the predictions are right, we are not making any updates at all. We are only making updates when the predictions are wrong. And each of this meta classifier chooses which local model to interact with. Because these are again, these are per user models. So we look up into our HDF stores and figure out which models is going to get into the ensembling here. You are maintaining the identifiers for each of the Absolutely. Absolutely. All right. Hey, thank you so much.