 And one more thing here, I would say is that also understand this curve, so the exponential curve that is the amount of accuracy you need, that is the amount of effort you require sort of exponential. So the numbers are not necessarily, you know, indicative, it's just more for approximation and so the idea here is that, you know, if you get 95% that you need to put x months, you have to put way more than, you know, twice of x to get next level. So that's it's a very exponential curve that you need to understand where your problem is and, you know, how much effort you want to put in. The last thing I'll say is important is that understand the deep learning machine learning life for software lifecycle, what do I mean by that? So the initial phase of any machine learning project would be, you know, you have a business problem definition, right? So I would, you'll have to solve, say, you know, phase recognition under very restricted conditions. And then you will say, you know, I'll give a bunch of meages, I'll train the algorithm, I'll test it on the real field, and then I'll keep iterating, right? Till I can actually narrow the business problem very, very precisely. And why do I say to do it this way? Now, another way you could have been I prototype it and immediately I take it to production, right? That actually would be a very, it probably is a bad idea. The reason is that the moment you prototype a production that requires a lot of engineering effort, right? And maybe this definition of the problem itself is different or the technology you're using is, you know, it might change, et cetera. Meaning it might be solving enough well enough for you, right? So don't make your prototype production very quickly. You can still take it out to the product, to the field. So the next anti-pattern is what I'll call crap and crap out. You can just have a look at that. All right, so the idea here is that, you know, people are just throwing that, you know, hoping you can throw data at it and it'll just solve it for you, right? Thanks to XKCD. So in general, this is, this is setting, you know, in under which I'm talking about. So most of it's under supervised machine learning setting, but it applies to other cases also. So the supervised machine learning setting is as follows. You have a bunch of images and their labels. So here image of cat and that is a cat, the information that is a cat. And you train the network to, to understand on the model to understand that it is a cat and this is a dog, right? So you give a bunch of images, you train the network and the network has information, this is the image of a cat, right? And it'll learn that. Now this is, this is what's called a training procedure. And the test is where you give a new image of a cat and you want to know, oh, what is this, right? I mean, you don't know it's a cat. You just, it's an image, it's one image. You need to figure out that it's a cat. Now, the symptoms of, you know, crap and crap out is that when you add a new, any new data, the system suddenly goes away. Okay. That is one symptom of this crap and crap out. Another would be where your model is on the knife edge. That is, you make a small tweak to your model and suddenly it's giving completely different results from what you had earlier. Okay. And the third one I would say is repeated, long repeated iteration loops. Meaning you developed a model to do, you know, one problem. And now the same problem you give more data and suddenly you need to spend the same amount of time you took originally, if not more, okay? In order to come back to the same one. So these all are symptoms of, you know, crap and crap out where you just threw someone through with that at it and then the model gave some output. It may not be the right thing, right? And, and so how do you, how do you ensure this doesn't happen, right? So first thing you need to make sure is that you need to look at your data, right? Use the tools to understand actually what is happening. Make sure there's nothing, you know, data is kosher. There's nothing off in the input or the output, you know, look at samples of inputs, samples of outputs, do a simple walkthrough of each stage, right? So for example, in this case, for example, segmentation, I'm going through each stage and showing what is happening. So step through and make sure that actually is doing the thing it's supposed to be doing, right? Now, so this is the very basic set of tests. Once you do that, then you make sure, okay, can you know fit this model? You know, can you fit it fit for a model? And, and can you at least overfit for the training data, right? That is, you know, your algorithm should be able to predict with very high accuracy, at least for the training data. You know, when I say very high, I say 99 plus percentage, right? If you cannot overfit, you cannot generalize the model, right? So first of all, try to make sure your model is capable of learning what you're trying to teach, right? And then the next problem would be also one more thing to look ensure is that make sure your test set and training set both have the same number, I mean, you have same number of classes for each. So for example, if I'm training on cats and dogs, if I have 10 images of cats and 1 million dogs, it will probably cannot learn about cats, right? And if, and at test time, also similar issue will happen. Now, at this point, you probably have already got enough data. Now, the problem is that you, you've got large amount of data. How do you debug with large amount of data? Right? So in that case, you know, you have to build tools to do that effectively. So for example, in our case, you know, we were, we were doing this handwritten characters. And so we built this kind of a tool where I can see large number of ones very, very quickly, right? And this, in this particular case, what happened was that we figured out these kinds of ones. It's very, very small here, but what we call continental ones, with the, you know, the top part being there. That was not in our dataset, right? And so then we had to augment those data and make sure the data was, that was there in the system. So you build tools, I mean, whatever the tools may be, maybe it's a simple tool like this or like the next one, but it's a more sophisticated tool. So here we are doing what's called the t-sne embedding, t-stochastic neighborhood embedding, which is actually the entire embedding of the original data into two dimensions, right? So this is, for example, the cluster of tools. This is the cluster of fives. And this tool, actually, it's a JavaScript-based tool, which you can zoom into that area. And then you can actually see in a small region and see what those characters are, right? I mean, what those, in the characters, in this particular case. And now we can see, oh, you know, why this is a case, why this is having a problem. So this could give you an intuition about, okay, where your model might be having difficulty, right? So build tools where necessary. And in this particular case, we build a web app in order to do that. Another one I'd say is that, you know, is also that I understand what is, I mean, do a decent debugging of your system, right? That is, make sure that when you are having a problem, when the network is not predicting correctly, why is it not doing that? So one good thing you can do, for example, is, you know, to debug your system, right? Using, for example, class activation maps. So these tell you, for example, what was, what regions of the image was responsible for making this prediction, right? So in this particular case, three, I said two is what is getting confused here. And in this particular case, it's saying, okay, these parts of the images, what is making it believe is a three, I said two, and here is three, right? And similarly for the bottom line. And there are other techniques, like for example, locally interpretable model diagnostic explanations like Lyme. Both of these, I think today we need, we'll be talking about in the afternoon. And other, the last one is also synthesize or augment your data as well, right? So if you don't have enough data, try to get them as much as possible, make sure that as representative. If you're unable to get it, and if you think your data is at least representative, meaning it can be, you can interpolate between them, right? In that case, you can use GANs or other techniques, et cetera, in order to you get that. So here's an example of, we using GANs to generate data. And this is it, for example, driverless, autonomous driving, they use quite a bit of simulators for creating data. One very important slide is this. So any key slide will have a star, okay? So now, if this is, the idea here is that you have to do what is called discipline machine learning. The idea of discipline machine learning is as follows. This is again by Andrewing. And the idea is, you divide, so usually you have your training set and your test set. So the training set is what you train your algorithm initially on. And the test set is what, on what you actually, you are going to test against, right? And so these two are your data that you use on a regular basis, okay? But you also need to keep two other data sets, you know, two, you need to divide your data into two more pieces, okay? And I'll tell you why in a moment. And these two pieces are used much less frequently, okay? This verify probably use once in, you know, a few weeks. And this you might use once in, you know, blue mode, right? And the whole idea of having these many of them is that when each time when you are training the algorithm with each of these models, for example, your test data, right? You might be part of the feedback loop that is overfitting the data to the particular model, right? So in order to avoid those biases, you need to keep the data outside. So for example, when we are consulting, we keep the gold data with the customer. We never see the data. We divide it to give it to customer and the customer only tests it and tells us whether our model is doing what we're expecting to do, right? Now, now the kind of problems usually you will have, how do you go about in discipline fashion about, you know, should you change your architecture? Should you train for longer, et cetera? This flow, this flow chart probably will help you. So the common problems we have is what's called bias and variance, that's the first common problem. And bias is where your model is too simple to learn the complex data, right? And the other is high variance, where your model is learning, it's using a very, very complicated model to learn the same data, right? Where you are, as you can see, it's overfitting the high, very high degree polynomial. And how can you know when this happens? You can know this by if you plot the training set size versus your error, right? So the green one is your training, the test error and your training error. If your training error is also very high, that means you are having high bias, right? If your training error is low and your test error is very high, that means that you are probably overfitting. That is a high variance case. So now what do you do? In the first case, you need to get bigger model, you need to get, you need to train longer or you need to get a new model architecture, okay? Because your model is not even learning what you need to learn. In the second case, what do you do? You need to add liberalization, right? Or you can get more representative data as well or more new model architectures. And more data, right? And the last two are in the case of, you know, when you have, you know, when you have verify and gold, right? In those cases, also what do you do? And so here I'm just showing the explanation, you know, where, you know, there's a gap between human and human and training set. There's a 10% gap between them. That means your model is not learning, right? Or here, for example, between human and test set. Your training set is only 1% error, but the test is actually about 10%, right? That's probably a problem that's a variance problem. All right, so, and so on and so forth, okay? So do have a discipline mechanism by which you go and you train models. The next anti-pattern is what I call about metrics. So one is using incorrect metrics. So the idea of using incorrect metrics would be something like, you know, I want, I actually kept, so for example, I want to detect fruits, okay? And what I'm interested in is number of fruits I'm seeing in a picture. Now, if I look at the overall, if my metric is something to cover, let's say, for example, to measure the area of each fruit, right, and I take the total sum of that, that will probably throw me off. That's an incorrect metric to measure, right? What will be the right metric here? The cardinality or number of different fruits you're detected, right? So be careful about metrics. More about metrics today, Harsh will be talking about in the afternoon. So this is a pointer from here. Next one is what we call, what is the bad loss functions? These do not allow us to train the model effectively, okay? So the particular problem is the problem of photo aesthetics, photo aesthetics scoring. What is the problem? So the problem is this, that given a bunch of pictures, I want to say, is this a great picture or is it a bad picture, right? Or is it a decent picture, maybe? So for example, I might say, you know, a four is a wow, it's a great picture, right? And one may be a bad picture, for example, right? So now what do I do? So the training would be us, just we discussed this earlier, I'll give a bunch of images, each of them with their corresponding labels saying, you know, this is a one, this is a four, and so on, I'll train the algorithm. Now, given a new image, I need to make a prediction, okay? That's what I'll train. Now, when I do that, right? The one of the important parameters is the loss function, which I mentioned, which is used to optimize your model, right? So what does it do? So now let's look at examples of three loss functions, okay? Which one, which is better for this particular case? So now here, let's say I, the actual image is a wow image, let's say for example, this one, and I predicted it's a bad, right? Now the law, one of the laws I could give is I can say, oh, you know what? Whether I get it bad or nice or good, whatever my prediction is, because I wanted a wow, I'll penalize equally, right? I just give a one. Now, another one could be where I say, you know what, no, no, if I, if I want wow and I got a bad, which is really far off, I will penalize more, right? Or I could even have one which is more in the squared of the B, B loss, which is even, you know, even, it's even got more power, right? So meaning more bad you are, it will even force you further. So choice of right loss function will decide how fast your model learns or how well it will learn. So it's very important to choose right loss functions, right? So in this particular case, this might be a loss, good loss function. Now I'll take another case, but it might not be, right? The similar example, let's say one of class four is, you know, is it an image of a cat or not, right? And the class one might be is an image of a dog. Two is image of a pig and three is image of a man, right? In this case, it might not make sense to say that, you know, oh, between four minus one, you know, they are two different classes very far off. I don't need to say that, you know, I want to penalize them too much. Whether you get, you predict, you know, dog, cat to say, dog or a pig or a man, it's all the same, you've got it wrong, right? So you just have to, you want to probably use a different loss function. That's the intuition here. Now here's some other properties here, you know, for example, it has to be differentiable and all that, but I won't go into them now. The next one is bad distribution of data. So for example, this is a typical example we have seen in many cases where the test data would be, it'll be non-uniform and you'll have, you know, one particular class being very high. So for example, imagine your test data has got 70 percentage class two, okay? And you run the algorithm. And the moment you run the algorithm, you will get 70% accuracy. Why? Because the model is just predicting class two. It is not doing anything at all. It is just saying I'm predicting class two and I'm already 70%, right? So if it was equal, how much would you get? Only 25%, right? So be very careful about data distribution. Look at the data distribution. The other one is, this is a common one. I face a lot of clients when I talk to them. They will say, oh, Facebook has implemented face recognition with 98% accuracy. And I also want it, right? But a lot of nuances of what the problem itself is lost. So for example, here, so let me give you three examples, three nuances of the problem. So one is what is called face detection and localization, which is, is there a phase and where are the phases in the image, right? Respectively. The next one is, next problem is, given a picture of Mother Dixit, and I'm saying a Mother Dixit, which is being this picture, then I want to say, is this me or not, right? Is this Mother Dixit? Which is actually given the stable of data, all you're doing is given this picture, you want to say, is this picture close to any of these in the first row, right? Now the next one is what is, this is the authentication or verification problem. And the last one is actually recognition, right? Where you're given an image, I want to off amongst all the images of all the people in the database. I need to say, is this any of these people, right? This is a much harder problem, right? So there are a lot of nuances about the problem itself, right? And that's why I said, you know, when you are coming up with a business problem, you have to make sure that your business problem is very tight with respect to the competition problem you're solving. And to solve a harder problem, it'll take much, much more time, right? So make it as tight as you can. And so, yeah, so, and that's the other problem. And next one is that, you know, they will, the original images would be trained under good lighting condition. And the algorithm will should work at, let's say, night conditions. Or this, for example, here, this image which is very blurred here, right? So the algorithm cannot predict as great if you're going to give very bad image quality. Similarly, for example, in face recognition, one of the common problems is 3D rotation. If you have a lot of 3D rotation, the algorithm really, really cannot work well, right? So the kind of constraints your system has got also has an effect on how much, how accurate the system would be, right? So you need to limit your conditions, you need to work under. And so a few other things would be like things like implementation problems, like incorrect math or implementation of the math, and other things like information leak, et cetera. I'll move on to anti-pattern three. So anti-pattern three is what I'll call, you know, this is a good technique, but sometimes as it has problems, which is divide and conquer. This is a very commonly used algorithmic technique, right? So what is the, what is the technique here? You take, you take a particular problem, divide into pieces, right? So for example, face recognition, I'll divide that into, I'll pre-process the face, I'll do face detection, I'll segment eyes, mouth, and nose, and finally I'll do face recognition, right? Now, why is this good? It is good because it gives better interpretability, and it's easier to debug. Something goes wrong, and you can easily go back and say, oh, what caused it to be wrong, right? And it's easier to improve, and also it's easier to develop also, because I could have each person develop each of these modules, for example, right? So it's really great. But what are the problems with the dividing conquer? So one of the problems is that, so the error gets accumulated. What do I mean by that? So each, let's say I want the overall accuracy of let's say 80 percentage. In that case, each of these stages should have at least 95 percentage accuracy, right? Because the total error is the multiplication of these accuracies, right? And so because of that, you have to ensure, because it's a multiplication, as you can see here, right? So therefore you have to worry about that. The next one is that many a time they are not independent. What do I mean by that? The phase, when you detect a phase, and that is given to the algorithm, the next time you improve the phase detection, that actually could have an effect on the each of the next components, right? You might assume they're independent, but they might not be independent, right? That is the next problem. Third problem is this, that what I'll call again relative independence, which is improvement gas phase. So for example, I've got 10 dozen images. I train the algorithm. I do what's called ablative analysis, which is to know which part of the algorithm has the most number of errors, right? So here, for example, I'll say 600 images failed on this phase detection, right? Whereas all others are much smaller. Now what will be the manager's first intuition? Oh, I should go and solve this problem, right? Because this is where I have most errors. But actually what will happen is that the moment you go and run the, you go and improve this particular piece, here his 600 images failed because there was 600, very hard 600 images, right? The moment you improve the algorithm here, you also have to improve the algorithm here as well, because these were really, really hard images to solve. So the improvements also have cascades. So general, these problems not only happen in machine learning techniques, but even in any of the dependent techniques as well. Wherever you have divide and conquer, where you have two stages, for example, two stage detectors or CRF on top of a DL or a CNN-RNM combination. In any of these places, these problems occur. Next anti-pattern is what I call general enough versus over general. And this particular problem happens because one of the reasons why it happens is that the sales, the sales people, right? They would, they, one of the important metric for them would be the conversion rate, right? You go to someone, you show it, you demo the product, you want to have a high conversion rate. But usually sales conversion will be like 5%. Now, which means that the more general the system is, that it can more, more condition can work under, the best it is, right? But they also want it to be fast and accurate also, right? Now you have a problem. You need the most general solution and it will also be fast and accurate, right? This actually is a problem, right? Why? Because the moment you want it to be general and fast and accurate, the development time needed is very, very high, right? And that is a, that is a major problem here. Now, so in general, one of the things people should really know about is, you know, what are the kind of variations you worry about, right? So for example, in case of, you know, computer vision, it would be things like scale, rotation. So scale is like, you know, do you care about detecting images or different sizes? Meaning the object could be either 10 pixels or it could be 100 pixels. Now that difference makes a difference, makes a computational difference here. Or other things could be, you know, it would be rotated, you know, would it always be a fair friend looking face, for example? Or would they be 3D rotation, for example? So the amount of, amount of this condition you need to handle will define how fast you can develop and how good a algorithm can you get very quickly, okay? The other thing I'll say is low light versus highlight and strong light conditions. For example, here, right? There's a strong light here or other things like occlusion. So all these things, the more you can constrain them, the better it is. Now, a bar story from this is as follows. So we develop this particular, you know, application where you wanted to read bank forms, right? Like the one here you see. Now in order to do that, what you have to do is first of all, you have to align the image. What do I mean by that? The image could be something like this, but it's not at all aligned. And I might have a template of how I, I actually wanted to be aligned. And so I want to align that. So, you know, so I will do this process of, you know, align alignment where I detect key points and then I do alignment, okay? So in order to do the now, we assume that our algorithm has to work for 360 degrees. Okay? So it can be the original image here can be at any angle, right? In which case now we spend about four to five months. We came up with a particular kind of markers that actually work well with this particular algorithm and all that. And we got about 98.8% accuracy, right? Now what actually happened in the field in the field view this scanner, we are going to use something like this. Okay? Now what will happen now? What is what changes? What is the change in condition? I only have either zero plus or minus 10 degrees and 180 plus or minus 10 degrees. Now in this case, I could actually use the old algorithm and still get 99.9% accuracy, right? Which means that all that four months of effort which is just thrown away, right? So had we done that homework of, you know, thinking then and there, the business one, the use case would be the product use case that would have actually solved the problem wasted and unnecessary effort. And in general, the product thing is that generality is expensive, choose wisely. And coming back to sales, how do you solve that problem, right? So you can tweak your demos for that particular application, right? You don't need to particularly necessarily solve the problem in all this global generality, in glory generality, right? And the moment, once you have a sales conversion or you have a customer at that point, you can go and generalize where need be. To wrap up a few more things. So what I'll say is that managers need to understand the nuances of the problem. Talk with your engineers, understand exactly what the tight problem they are solving, right? It's not, there's no magic sparkle here. Another very important thing I'll say is site testing. This is actually make sure that your machine learning team is actually going on the site to understand how the data has been generated, right? So for example, in this particular case, we were creating this model to understand how students are learning and the students would enter a particular data, right? And this particular case, the problem was that we were using words like intermediate, exceptional, and these were six to seventh graders and they did not know the meaning of these words. Now, because they did not know the meaning of those words that actually affected how they actually would enter the data, right? So actually being in the field is very, very important for anyone doing machine learning or DL or whatever it is, right? And another important thing I'll say is the importance of good UI UX. If your UI UX is bad, your reason you are getting bad data or different data could be just because of that, right? And last, human in the loop is very, very important, which is, you know, many a time the initial algorithm might not be as great and so you need to have humans in the loop in order to give you initial improvements. That's another very important. And the last one I'll say, which, you know, I myself, you know, self-confident more than anything else is don't over theorize, right? It's very easy to go and build a very, very sophisticated model and then actually the business use is not needing any of that, right? Or it never being solvable, right? So it's very important not to do that. So to wrap up takeaways for managers, there's no magic sparkle. Do systematic discipline development. Don't trivialize importance of clean data and representative data. Make sure the data is representative, right? If I, if I don't interpolate, if I don't have all the corner pieces of the data, I can't interpolate at all then, right? And all these techniques we have is all, are all mostly interpolation techniques. Trust your engineers and instincts, but you know, solve, make sure that you're solving the right problem. Prototype is not production. Understand that difference as well. And general enough is what is what we need, not our general. Testing and the importance of UX is very important. Man in the loop is very important. For example, one of the customers we have, the initial problem they solve, so this is about photos, you know, creating good albums. And for them, they initially solved it by having 1000 odd people in India and Bhutan solving it, right? And it's now that we are using machine learning. The previous company I worked for in Bay Area, we were trying to do this directly and we did not succeed, right? So there's a lot of, there's a lot of value sometimes in you having humans in the loop. And don't over theorize and be careful about, you know, people who are matting like who could over theorize. And for that, and deal engineers, crap in crap out. Check your data, go step by step, understand what is happening. Systematic debugging is very important. Build the tools that are necessary to look at your data and know the tools. And simulate or augment, discipline machine learning. Have a look at that. The good, bad, ugly of laws and metrics, be very careful of that. Go to today's talks about them. And also they talk about HUSH and beneath. Dividing conquer and its problems and keep up with the tools. I cannot stress about this also. There are quite a number of times you would implement something only to learn that there is a tool actually is a much more efficiently than what we are doing, right? With that, I am up. Testing very far as well as gold data, right? So what are the strategies that we should come up with to build an effective test data or like very fine gold data? Effective, gold data. Make sure that your data is representative. That's the most and important thing. What I mean by that is all the conditions you want to handle, you should make sure the data is actually having that, right? So if it's face recognition and you care about low light conditions, you should make sure you have your image, you have images from low light conditions, right? If it's data from other ones, it probably won't help you, right? This is what we call the distribution of data being similar to the real world as possible. Okay. In case if I want to build a test data, right? Like should I start off with like having a subject of the training data or like you should account for the new test data? I'll take it with offline with you. There's one more question. I'll probably take that also. Hello. As you mentioned, the tools are very important, right? And when there's a large set of data, it is images, which are the tools, you know, which help very effectively for curation or labeling of the images, image data, because there is any commercial or generalized tool. Okay. There are companies who are there now who help create augment data and there are a few tools so I can probably take it offline and discuss. Is that fine? Thank you, Samod. Thank you. All right. During the talk, I noticed and some of you might have as well that some phones are ringing. Let's be respectful of each other and put them on silent or better yet just focus on the talks and switch them off. All right. The birds of feather session will be starting on the first floor. The topic of this one is hubs and spokes of AI. Birds of feather sessions are general discussions where you can participate and get your questions answered. Also remember to fill your feedback forms. They were placed on your seats when you arrived. Thanks. Hello, hello. So should we start or do we give? So you can introduce. Next up we have Rishikesh. He'll be talking about building and driving adoption for a robust semantic and search system. Yeah. Thanks for the introduction. So this is a work which we did that into it and it was accepted as a paper at this year's NACL. So this is joint work with Vishwa, who is here today and with my manager, Amineesh. So the way in which I'll cover it is we'll talk about the problem context, then exploratory analysis, some previous work that people have done and then we'll go to what the business wants, right? Because in a lot of ML applications, this is more important than just the theoretical side as the previous speaker also mentioned. Then we'll go to the key idea, results and then some learning and Q and A. So a few things about into it. So into it basically has, it's a 30 year old company based out of US primarily and roughly 50% of people in the US who pay taxes, file their taxes using into it products, right? So TurboTax is used by a lot of people, yeah. And this is tax season in India, so that kind of resonates with me. And customers interact with into it through several channels. So we have telephone, we have web chat, we have even chatbots. And you need to understand that when customers come to into it, they come with a very different mindset, right? It's not like doing a Google search, right? Because compliance is not a luxury. You can land up in jail if you file the wrong taxes, right? So this is something which is extremely important and so the kind of questions that people come to us with are also very complex. And it also means that we need to give correct answers, right? I mean, it's not okay to give some approximate answer because if you tell me the wrong tax bracket and I go and file my taxes using your answer, probably IRS will come knocking on my door next day and say, you know, what did you do? So just trying to highlight the importance of, importance of, you know, the customer questions and even the importance of how accurately we answer those questions, especially in this domain. So what we did is we, even before trying to build anything, we said, let's look at the kind of questions that come to us, right? So again, this classification, again, this is just one way of classifying. I mean, there could be multiple ways in which you can split things. But if you looked at the questions at a broad level, you have two categories of questions, right? So what typically happens when you have a chat with a customer, in the initial phase, the customer agent ends up greeting the customer. They'll say, hey, good morning. How was your day and stuff like that, right? And then in the US, customers typically also exchange pleasantries that they'll say, okay, yeah, my day was good. Sometimes we even see customers actually chatting kind of at a personal level with care agents. So what happens is towards the beginning of the chat, we have these meet and greet kind of conversations. And then in the middle of the chat, you typically have the core problem that the customer talks about. And towards the end of the chat, again, you have these meet and greet stuff, right? Because if you ended up solving the customer's problem, they will end up, you know, they'll thank you. And if you didn't solve the problem, they'll probably bash you and say, okay, I'm going to complain to your manager and so on. So the point is there's a lot of these meet and greet or general conversation stuff at the beginning and end of the chat. And in the middle, you really have the core domain related stuff, right? So basically the first level split is whether it's domain specific or non-domain specific, which is general chat. Now, if you go into domain specific, there are again, again, you can split into two categories. The first category is where the answer exists. When I say exists, I'm saying the answer exists in our database or in our knowledge base, right? So we are a pretty old company. So we have a fairly, you know, exhaustive knowledge base. So most of the times the answer is found in the knowledge base, right? So if it is there, you just retrieve it and show it to the customer. But there are still some cases where the answer needs to be created because there might not be a ready-made answer to the customer's question, right? And the green kind of boxes, just highlight the different approaches that can be used because if the answer exists but cannot be found, the way to solve it is to use semantic search because you need to understand the meaning of the question. Essentially it's a search or a retrieval kind of problem because you already know that the answer is somewhere here. It's like finding a needle in a haystack kind of thing, but you still know that the needle is somewhere there. You don't have to bring it from somewhere outside. The second box answer needs to be created. There you basically need some reasoning over knowledge bases because now you don't have a ready-made answer. So you have to actually generate the answer using some logic, right? I mean, you might have to go to different sources. You might actually even have to do a Google search. You might have to go to the government website. When I say you, I mean, this is how typically human agents solve these problems, right? So the algorithm also has to kind of do similar things. So let's look at an example of, so why do I say that answer exists but cannot be retrieved? I mean, why would I not be able to retrieve the answer? So this is a customer query, example query. How to receive payment in Bosnia, Herzegovina, convertible mark, right? Now, in this case, I've just shown an example where there's a spelling mistake, right? And if I basically use just keyword-based search, if I just look for the exact words in the documents, I'm not going to be able to retrieve the correct document, right? Because this kind of these words will not even exist in any document, right? And actually, it's pretty conservative here because if I type this, I think I'll probably make mistakes in every word because Bosnia maybe not, but yeah, the second word, it's beyond my abilities. So, but what you, if you look at it from a semantic perspective, all that the customer is talking about is, if the customer is US-based, all that they are asking is, how to get paid in foreign currency, right? Because that's the real meaning of what the customer wants, right? So ideally, you should surface this answer. I mean, and this answer exists in your database. I mean, this is a real example. So this is really the answer that the customer is looking for, right? And maybe he will again drill down and say, okay, now you're giving me answer about foreign currency. Now let me drill down and they go to that particular currency, right? So just to, this idea is just to motivate why this thing can happen, why you would not be able to retry the answer even if it exists. So that's what this talk will be about. So I laid out the entire landscape of the three different types of questions and so on. But for today's talk, we'll focus on this particular sub-problem, which is trying to extract answers or retry the answers which exist in your database, but they cannot be found using keyword-based search. So, yeah. So basically just look at the last bullet in this slide. I think that's the most important. So what is required here are two things, right? One, the first thing that you need is, you need to be able to retrieve synonyms or similar words with ever similar meaning, right? So there's a notion of semantics, right? And then second thing is, the retrieval should be robust to aberrations, right? Because for example, if I type euro, so dollar might be a word which is very similar to euro because both of them are currency, right? But our problem is slightly more complicated because I'm not typing euro. Maybe I'm typing instead of EURO, I'm typing EZRO, right? So that's why the second thing also is important. So you should not only be able to retrieve synonyms, but you should be able to retrieve synonyms when there are misspellings or aberrations in the original word. So let's quickly look at what other people have done. I mean, this is not a new problem. So there are three different classes of approaches that people take. So the first approach is where you create word embeddings for that query. And then you, basically you use something like Word2Vec or Glow or whatever. There are lots of different ways of creating embeddings. And so when I say word embeddings, I'm talking about creating the embedding of the entire word, right? So you create an embedding of euro, you create an embedding of dollar, and what will happen most likely is that they will be close together in that space, in the embedding space. So the pros of these approaches will be fast, right? Because you are just doing a lookup. And if you do an approximate lookup, probably you can be pretty fast. You might get maybe a hundred millisecond latency. The cons are you cannot handle out of vocabulary words, because if I misspell euro, that misspelled word is not even present in your vocabulary, right? So how do you, so there's no way to handle misspelled words or let's say someone hyphenated it in the wrong way and stuff like that, right? The second kind of approaches are where you can directly modify the incoming query, right? So think of it like a machine, it takes in an incoming query and spits out a modified version of the query. And it modifies the query in such a way that the modified query can be suitably used for retrival. So in these kind of approaches, you can basically use standard, yeah, maybe RNN sequence sequence kind of models also, right? Where you feed in the input sentence and you can treat it as a translation problem. And then on the output side, you get a corrected version of the query, right? The problem with those approaches again is that in real time, you are basically, you are essentially pushing it through an ML model at prediction time, right? And that's pretty complicated because at least for RNNs, you have a linear complexity in terms of the number of steps. And then the third approach is that you could do away with search completely, right? So what you can do is you can, you can pre-create embeddings of your answers. So you take your answers and create embeddings out of them. There are different ways of doing it. You can create embeddings out of whole sentences by pushing them through RNNs or you could take embeddings of word and then do some aggregations. And then when you get a question, you again create the embedding of the question and then do a nearest neighbor kind of lookup. Again, the problem with this approach is that you still have to create the embedding of the question in real time, right? So the point I'm trying to make here is that when you look at the pros and cons, you have to look at it from an industrial perspective, right? We are not in an academic environment. So yeah, maybe in academia, people might not worry about latency because who cares? I mean, you have to publish a paper. So you have three months to get the results, right? So why I'm trying to say that is because all of this is in the light of those constraints. And yeah, I'll just give one minute to... Yeah, so let's look at the constraints that we had, right? Because like I said, we are an industrial ML team. We are not in academia. So these are the things that typically, you know, your partners. So your partners are going to be product managers, engineering teams and so on, right? So these are the requirements, right? So they'll say, at least in Intuit, product managers want to understand everything, right? I mean, when I say everything, again, they might not understand the maths behind your models, but they need to get an intuitive sense of how it is working, right? Otherwise they'll say, I mean, if you make it completely black box, they'll say, fine, I don't think I agree with this. The reason for this is because ultimately, there are neckies on the line if something goes wrong, right? So that's one attribute. So it doesn't matter how funky your algorithm is, you have to be able to at least give some intuition about how it works. The second is engineering systems might, so the engineering teams might actually have a very inefficient system of their own. So they might have a latency of 300 milliseconds, but now they are going to say that, you know, when you give your system, we want you to have a latency of 50 milliseconds, right? So, and then the reason they'll say is, okay, because it adds up and it, so we cannot have more than 350 milliseconds, but no one notices that they already have 300 milliseconds, right? But the point is I'm not trying to kind of make fun of anyone here, but the point is it all reality is in life, right? So you need to be aware of the latency requirements. And then the third most important thing is no one wants at least the big engineering teams, they don't want to change their systems, right? Because changing systems means you end up breaking something, right? And that's why most engineering teams will like you to give a plug and play kind of thing, where you just give them something. And so the best option is they'll say, okay, you just give us an API, we'll call your API and we get the job done, right? Don't ask us to change anything in our system, right? So these are the three constraints with which we were working. And now let's look at the key idea here, right? So because of these constraints, remember the main constraint was latency and the second one was that whatever we developed should have been easy to kind of plug into their existing system, right? So now let's go to the core idea of what we did. So what we did is we trained a skip gram model, but now with subwords as tokens. So training with whole word embeddings is a very standard thing that people have been doing for three, four years now. But in addition to that, what we did is we basically created character in grams out of words. And so what happens is that if you take a word like invoicing, the embedding of invoicing is not only the embedding of the whole word, which is invoicing, but also a function of the constituent subword embeddings, right? So think of the embedding of the word as a function of the whole word embedding and some function of the constituent two gram, three gram, four gram, whatever, character in gram embeddings, right? And then what we do is then you can basically, so that allows you to create synonyms, right? Because, for example, if you look at e have an invoice and invoice, right? In terms of the subwords, there are a lot of subwords which are common. If you look at two grams or three grams, especially except the first, except two or three, two grams, rest of the two grams are common between e have an invoice and the word invoice, right? So naturally these two are going to be quite close together in the embedding space if you create the embeddings in this way. And then why would bill also be closer to invoices because of the whole word embeddings? Because bill and invoice are used interchangeably. So even if you don't use subword embeddings, still bill and invoice will be close together just because of the basic property of whole word embeddings because they occur in the similar context. So what we did is we created a list of synonyms for every word, right? So if some of you know about elastic search or solar, these tools provide a mechanism for you to inject synonyms through a synonyms file. So this is the format of the file. So what we are essentially saying is we are creating equivalence classes. So we are saying that the words e have an invoice and bill, all these three are equivalent to each other and they map to the canonical form, which is invoice, right? So now what's going to happen is if there's a query, if there's this query and and query on send and e have an invoice, you will actually expand it to include the other words in the equivalence class also, right? So you'll not only retrieve stuff which has e have an invoice, we'll also retrieve stuff which has invoice and bill in addition to e have an invoice, right? And then coming to the question, how do you handle out of vocabulary words, right? So for out of vocabulary words, what we can do is let's say you got a word where invoice, let's say invoice itself and it was misspelled. So what you do is you break it up into its character in grams and then you know the embeddings for the character in grams, right? Because you have trained your model earlier. So you retrieve the embeddings of the character in grams and then that allows you to create the embedding of the whole word as a function of the character in grams. So obviously this word is not going to have an embedding for the whole word because it is out of vocabulary, but it'll still have embeddings for its constituent character in grams or chunks, right? So you basically what you do is when you see an out of vocabulary word, you compute the embedding, you basically retrieve the embeddings of the character in grams and then compute the embedding of that misspelled or out of vocabulary word. And then you can do a nearest neighbor lookup on the fly and retrieve the neighbors. And this process is still much faster than actually pushing things through RNN model, right? Because if you do a approximate nearest neighbor lookup, it can be quite fast. So this is one example of how it is done, right? I mean, again, if you look at this is the correct word and this is the incorrect word where you have some misspelling here. And if you look at the two grams, again, most of the two grams are going to be common, right? Between these two words except for, yeah, except for two or three two grams depending on where the misspelling is, right? So the net result is that we basically got a latency of typically less than 100 milliseconds. So 100 milliseconds I think was the P98 or something. So in most of the cases, only in 2% of the cases exceed 100 milliseconds. And when you compare it with deep learning models, you typically see latency of at least 200 milliseconds, right? So that's a clear win in terms of the latency speed at which we can do things. Now let's take a look at, so the earlier speaker talked about pitfalls in ML and yeah, overfitting is a very common problem in deep learning, right? So this slide is just trying to say that. So just to motivate this slide, we basically had a model which had probably like close to 100,000 parameters. So basically it is well known that the VC dimension for neural networks is actually lower bound by the number of parameters, right? And in most cases, it can be actually much higher than that. So order N is just the lower bound. And in our case, we actually found that 50,000 clearly was not enough. Even 500,000 was not enough because even at this point, the model was overfitting, right? And remember, this is not a very complicated model. This is just a basic deep learning model where we are training the embeddings. And only when we went to 500,000 examples, sorry, 5 million examples, I mean 5 million, I think somewhere when we went to 1 million or 2 million, it started, it stopped overfitting, right? So this slide is just to illustrate that neural network models do overfit. And this by the way is one of the slightly less complicated neural network models. So if you use RNNs and stuff like that, I'm sure probably you will definitely need to go to over a million examples. But it's slightly just to say, because I got to see this phenomenon practically in addition to looking at it in theory. So yeah, so now let's look at the data set and the ground proof, right? Because all of this is fine. I mean, there's a way to handle a lot of vocabulary birds, but how do we verify that things are working, right? So for this, for us to verify that things are working, we felt that any test that we do or any data set that we use has to satisfy these three properties. The first one is that the ground truth answers must be known because we need to know the absolute ground truth, right? We didn't want to rely on secondary feedback, like clicks or browse and so on, because that's again a weaker notion of feedback. So we wanted a stronger ground truth. The second property was that the query should be related to the answer, which is an obvious thing, but I'm just putting it there. And then the third property we wanted was that there shouldn't be direct word overlap between the queries and the documents. Because if there was a direct word overlap, then there is no need for semantic search. You can just do basic keyword search and retrieve your answers, right? The point is it's not easy to get a data set which matches all these criteria, right? So we thought a little bit and then we, basically if you look at books, right? I mean, all of us have read technical books in college and stuff like that. So if you look at a typical page of a book, this is what it looks like. So there's a heading or a section heading and there is some text underneath it, right? So this is the body and then this is the heading. Now, and what happens is that because books are obviously written by authors who are experts in their field. So the authors take care to ensure that the heading is a summary of the content because I mean, obviously you don't want the heading to be irrelevant to the content underneath it, right? So you see that the headings typically are related to the content or they kind of succinctly summarize what is there in the content below it. But lots of, because they are very short summaries, headings typically don't go beyond like four or five words. So because they are short summaries, there is not too much of direct word overlap, right? Because for example, the heading could be tax exemptions, right? And then the actual text might be following, expenses are deductible and so on. So maybe the word exemption might not even occur in the text, right? Might or might not occur, right? Because the author is trying to summarize the content here in the heading, right? And we realize that books actually satisfy all the three properties that I mentioned earlier because yeah, the ground truth property satisfied because these are expert authors who are writing the books. So obviously, because they are experts in the field, they clearly know that the heading is relevant to the text underneath it. The second property also satisfied because clearly the heading is related to the content under it. And because the headings are very short, typically there's not too much of word overlap between the heading and the text. So then the problem now becomes, given the heading, can you retrieve the matching body, right? And so that's the first problem. And the second problem is given the perturbed version of the heading where you make some mistakes in typing the heading. In that case, can you again retrieve the correct answer, right? So we created a set of around 800 questions along with the ground truth answers. And so we had to generate misspellings now, right? Because we also wanted to test out whether you can retrieve the correct answer even in case of misspellings, right? So what we did is we identified a set of 200 keywords which are important for our domain. And this was done in consultation with business teams who again are experts. So we even consulted with chartered accountants because we are in the accounting domain. And then what we did is we asked users to type sentences containing these keywords at a pace of 33 words per minute. So 33 words per minute is slightly beyond the average typing speed. I think I typed probably at 20 or something. So at this speed, so this is not a very high speed. So it's not unreasonable. This is something which people will type, speed at which people will type, but it's not a very comfortable speed also. Because yeah, I mean, if you type one sentence in one hour, obviously you'll not make any mistake. So we wanted to create a realistic scenario where we wanted to gather statistics on what kind of mistakes people commit. And then for each word we collected, created a distribution over its variants, right? Because for example, I mean, we looked at all the sentences typed by all these people. And then there are like 20 people who type these sentences. So overall we had like around 1000 sentences. And for each word, we get a distribution now around its variants, right? So, and then different people will have different ways of misspelling also, right? So this is the way of naturally collecting statistics around how each word is misspelled. And then we had to generate misspelled versions of the sentences now that we had a distribution over the variants, misspelled variants of each word. So for each sentence what it is, we toss a coin and if it, we have a bias coin which has a probability of 0.3 of landing heads up. And if it lands heads up, we choose one word from the sentence to perturbed. So again, that was decided uniformly at random. So which word to perturbed. And now that we decide to perturbed the word, we draw a sample from this distribution, DW, which will have variants of the word, right? So this is just a generative process of basically creating misspelled or aberrations of sentences. And then what we did is, then we tried to create a simulated EB test, right? So what we did is, so now at the end of this, what we had is we had around 800 queries and we knew the ground truth answers for each of those 800 queries. And we also had perturbed versions of these 800 queries, right? So now what we did is we created two elastic search instances, right? The instance, which we called control, was created using basic word to vacuum bedding. So we created synonym file using basic word to vacuum bedding. And the treatment was, the treatment was created using our method. And what we are trying to measure here is, basically we are sending the same 800 questions to both the instances, both the treatment and control. Ideally, the ideal performance is that, if something, if I get the right answers for all the 800, that is ideal performance. So obviously none of the two systems is going to perform ideal. So we want to compare, like if I send in 800, for how many of those 800, am I getting the correct answers, right? So that's what I'm looking at in this graph. And this X axis basically indicates how many top answers I'm looking at, right? So if I only focus on the top 10 answers, right? The blue bar basically indicates what happens with the control and the green bar is what happens with treatment. So you basically see that in case of the baseline, you, if I send in 800 perturbed queries, I get answers only for 436 queries, the correct answers. Whereas for, if you send it to the instance using our approach, you get it for 526, right? Which is a lift of roughly, sorry, lift of 20% in terms of the number of answers we write, right? And why we can be very sure about these numbers is because we know, we exactly know the ground truth here, right? I know for sure that this is the ground truth answer for this query. So if I don't retrieve that answer, I say that I was not able to retrieve the correct answer, right? And you see similar trends. If I go to, if I look at the top 30 answers, if I look at top 100 answers, you see similar trends. But what's more interesting is when we went from 30 to 100, the green bar has still not saturated. So what it really means is that there are answers which are ranked below 30 and less than 100, which we are still able to retrieve, because it has not saturated. So going beyond the top 30 is helping us. Whereas in case of the blue bar, why it is not helping is probably because the answer is not even retrieved. So even if you go to the 99th ranked answer, you'll still not get the correct answer because it's not even retrieved, right? So that's another important takeaway from this slide that it does not saturate quickly, right? And so the next step that we are planning is it also motivates the next step, which is that we need a way to re-rank things, right? So the basic ranking that elastic search or solar provides is not working perfectly well for our case. So we also, so in addition to this model, we need another downstream model, which will be able to re-rank the answers in the right order. But the purpose of this talk was to just focus on the recall because our purpose here was to focus on the recall. So the question is, can you retrieve the answer? Now, whether it comes as ranked number 50 or ranked number 60, that's a second level question because if you're not even able to retrieve, then the ranking is not going to help you, right? So just a bit of intuition, trying to compare it with other approaches like edit distance and so on. So we did not perform extensive experiments here, but intuitively it seems that the embedding based approach might be superior because in this approach, there's a notion of semantics, there's a notion of sub-word semantics also. So for example, easier ways to look at an example. So learnings and earnings, right? I mean, in terms of edit distance, probably depending on how you look at it, I mean, this is a misspelled version of learnings, obviously. So the edit distance is very short here, right? So this might just zero onto earnings. And these two are completely different words, right? There's no relation between them. So the problem with edit distances, it doesn't have any notion of sub-word similarity, right? Similarly, arbitrate and barbiturate. So barbiturate is a drug if some of you are interested in. Yeah, and then arbitrate obviously is a very legal term which involves someone sitting in judgment on if two parties have disagreements, a third party sits and tries to resolve the disagreement, right? So the point is we'll get, yeah, again here, the edit distance is very small here, right? So you will tend to zero in on these random kind of synonyms which are really not synonyms if you just use edit distance. And similar things can happen with phonetic similarity and another challenge with phonetic similarities that you need a lot of handcrafted rules because different languages have different morphology and so on. So in our case, we don't have the luxury of creating handcrafted rules because our language is quite dynamic. And we want to adapt with what customers do. So this slide talks about some tips for working with business. Just read through this slide. The key message is that there are multiple stakeholders that you will always have. You will have to talk with your superiors who are VP or directors. So they are business people. You will engage with product managers, you'll engage with other data scientists and each stakeholder will have a different set of requirements, right? And they are conflicting also, right? So don't try to please anyone, just try to do the right thing. But you will get these kind of questions and it's important to have an answer, right? Because yeah, it is fair for these people to ask the questions and you have to give an answer. But don't get too bogged down into it, right? Because I mean, your VP might say, hey, where to work was done three years back. What's so great about it? Like you're doing it now. Maybe they don't realize that we are doing something different, right? And so the point is that they read something in a business magazine and then they'll come back or they attend some five minute lecture that Andrewing might have given and then they'll come and say, okay, why aren't we using the most funky technique? So just be careful of those things. And then what we learned from a data science perspective is interestingly given the state of chatbots currently, people are smart. So people quickly understand whether you're talking with a person or with a bot. So there are interesting things that happen. So we have seen that people ask questions like, are you smart? When they know that they're talking to a chatbot. Whereas you wouldn't ask this kind of question unless you are completely angry, entirely angry with that care agent. And then people invent words, right? So Nehmer Act is a very famous thing which happened during the recent World Cup. And we want to adapt with this, right? So Nehmer Act has nothing to do with football. It basically just means throwing a tantrum or some drama. And we need to keep up with these things, right? That's why, again, I'm saying we can't go with handcrafted rules. So phonetics-based approaches and so on, where you have handcrafted rules, clearly cannot keep up with the pace at which language changes. And then finally, we realized that we need to be humble because there's a whole category of queries that cannot be handled with current ML techniques. And this is one example. So this is an Indian citizen who lived in Bangalore from this period and then as working in US on an L1 visa. And then I traveled to Germany for one month and my brother has a small enterprise and I'm a 20% partner. And then the final question is, I'm allowable to pay income tax in India, right? So to answer this question, this is really spread across multiple domains, right? You need to understand immigration laws. You need to understand taxation laws of both the US and India. You need to understand what happens if the person has visited Germany for like, is one month, does one month change anything? Like, you know, or if it was there for like three months, would it have changed anything? And we clearly are not able to answer these kinds of questions because you cannot, it's not possible to have a ready made answer for these kinds of questions, right? This is a very contextual and very specific kind of question, right? So clearly ML is not at a stage where you can directly answer these kinds of questions. And that's why the humility aspect comes in because a lot of more research needs to be done. To answer these questions. And these are the questions that we call as the answer does not exist, right? So the answer has to be created by your model. Thank you. So if you have any questions, I'm allowable for, yeah. So yeah, the key idea, isn't it very similar to fast text? It is similar to fast text, yes. Okay. And the second thing, when you're adding corrective diagrams, I'm just curious that it must be losing some semantics compared to the words because you are... So, okay. So what we do is we basically the embedding of a word is a function of the biogram or trigram embeddings, whatever, some n-gram embeddings as well as the whole word embedding. So the whole word embedding also is there, right? So it's not only a function of the... Last thing, did you explore trigrams instead of just biograms? Yeah, yeah. So biogram was just an example. So by the way, we explored with different size n-grams. We tried ranges from two to six, I think two to 10 and I think three to six work the best. And the reason for that is if you go to fine, if you use single letters, obviously, you'll have a lot of support, but that doesn't tell you anything. On the other hand, if you have... If your minimum n-gram size is five, probably you already lose out on a lot of words, right? So the sweet spot was somewhere between three and six. Yeah. Okay, I don't know who is there. Okay. Yeah, please. So I believe the word embedding was used. So the question is with respect to if the word was misspelled, you talked about you use character-level embedding. So that's where I'm confused. So you used word embedding. Now the entire model is based and learned on the basis of word embedding. How do you use this character embedding in... I mean, I didn't get this part. Sure. So the way the model is built is that the model uses both word embeddings and character embeddings. So the output of the training is embeddings for each word that was in the vocabulary, as well as embeddings for the n-grams. So let's take a simple example for a concrete case. Let's say we focus only on two-gram embeddings, right? So what I'm going to do is I'm going to split each word into its constituent n-grams, right? And then I'm going to say that the embedding of the word is a function of the whole word embedding as well as the character embedding. So for example, if you had invoice, there will be a separate embedding for the n-gram in, right? So you already have those embeddings for in, n-v, v-o, and so on, right? Created as part of the model building process, yeah. I understood. So now the function would, the input would be the word-level embedding and character-level embedding. And the final vector, whatever it is, would be used. And in case of a misspelling of words, so did it actually work like the cosine similarity or whatever similarity distance you use? So the character-level embedding of whatever thing is misspelled and the actual final output, like, did it have some correlation similarity? Yeah, it had. That is why we are able to retry. So what was the function that was used to merge the two embeddings? Yeah, so we tried different ways. So one way was to just use averaging. The second way, the second one was for out of vocabulary words, clearly there is no embedding for the entire word. So we gave more weight to the character-level embeddings. So we use these kind of functions. And if your specific question is about cosine similarity, yeah. Yeah, but I mean, what I'm saying is, I don't have any numbers on what exactly was a similarity, but I think it was beyond point data or something. So, and in practice, we did see that the synonyms made sense. And in fact, that's why you see a lot of extra questions being retried in practice, because if there was no similarity, then you wouldn't see a difference of 20% in terms of the extra answers retried. Hello. Yeah, hello. So my self, Rakhar, I'm working at Nikky.ai. So one of the questions that you told about handling the out of vocabulary words. So one of the slides told about examples of learning and earning. So consider earning was there in the vocabulary, but learning was not there in the vocabulary. The way you would make an embedding for learning is by a character level skipgram that model that you have already trained on. Correct. So basically, earning is the common part of it. And L is just the missing out while giving a full embedding for the word of learning. Now, you told that you intuitively saw like the Levenstein distance won't work over there because only just single word L is not there. Rest, all the words are in common. So does it make sense to say because more or less the earning embedding would be again similar to what learning has a part of it and the Levenstein is also giving you the same results? So, yeah, but the point is that it depends on what n grams you have. So in terms of the edit distance, there's a difference of only one probably, right? If you go to the word level embeddings, so you might actually have the words earn and learn in your vocabulary, right? Though you might not have learning, right? Okay. So what I'm saying is if you have, if you let's say have four grams. I'm just giving a concrete example because if you have earn and learn, and you know that earn and learn are very close together. So those subword level semantics are going to influence how close the final embeddings are, right? Because you will have some information from four grams, some information from two grams, some information from three grams, some information from five grams, right? So if you have learn and earn and, so basically what I'm trying to say is that learn and earn obviously are not close, right? They don't have any relationship. In fact, if you learn more, sometimes you learn less. So the point is that those kind of things will take care of separating things out, right? I mean, I'm just talking intuitively here, right? Whereas for edit distance, there is no notion of a subword similarity, right? Or dissimilarity. Yeah. In case of out of work cap, for a given word, so how do you decide actually it's a new word or it's one of the spelling issues of the existing okay words? Yeah, so we don't have any way of differentiating that at least as of now, because one ways to use an expanded dictionary and so on, but for now we don't really distinguish. So we see if it exists in the work cap, if it does not exist, we try to create the embedding and then map it to the closest word. So in the current approach, we don't make a distinction between a misspelled word and a word which is not in our vocabulary, which might still be in the general English vocabulary. I think one way is to solve this, like you could apply some pruning techniques like edit distance, the word context itself. Yeah, you could do that. I mean, the point is there are different ways to do it. And in this case, we wanted to understand how subword embeddings work, right? So this is a very focused kind of work, exploring subword embeddings. And the second motivation was, we wanted it to be done in a very data-driven way, right? So yeah, you could use a combination of edit distance and this approach and all those things. So those you will do in practice, but for this talk, I wanted to focus on subword embeddings because it's a slightly new way of doing things. Edit distances has been pretty much in there from like a lot of time. And so yes, you can always combine different approaches, but you got what I'm trying to say, right? This is a different way of doing it, yeah. Just coming back to learning and learning. Yeah. If I'm having a chat on the chat board, I think even if I do a spelling, based on the context, you should be able to retry whether he's talking about learning or learning, what I mean to say. Yeah, based on, see, again, it depends on how complicated you want your models to be, right? If you actually want to use a full-blown RNN kind of model at very time, you can do all those things because that model will actually capture the context. It might, you might actually even feed in the earlier utterances that are there in the chat and so on. So yeah, you could do a lot of ways, but here you have to understand that the trade-off is between latency and accuracy, right? So the reason why we did not include the entire context is because we did not want to complicate things at this stage. So if you use an RNN model, actually you can actually push in the entire chat history till that time, I mean, in that particular session, and that will give you even more context and that might give you context to even correct the word, because maybe you know that when you're talking about admission to IIC or something like that, most likely you're talking about learning and not earning, whereas if you're talking about IIM, probably you're talking about earning and not learning. So yeah. Guys, we have a morning break in this auditorium now. So the next talks will start at 12.05. Since we don't have anything, we can just keep the questions going. But if any of you want to leave, you can. And thanks to Shikesh. Yeah, so I'll be available here. Yeah, you can also be available offline. You can always come. Thank you. All volunteers, please come and see me. Who else? Hello, hello, test, microphone testing, one, two, three, hello. Hello, hello, test, test, hello, hello. Hello, hello, check, hello, hello, hello. Mic test one, two, three, four, hello. Hello, hello, hello, hello, check, hello, hello. Hello, hello, hello, hello, hello, hello. Mic test one, two, three, hello, check, check, hello. Hello, hello, hello, check, hello. Hello, mic test one, two, three, hello. Okay, yeah. Am I audible back there? Okay, okay, is this better? Is this okay? There'll be someone back there. Just be tighter. Welcome back from the break, guys. We have a couple of announcements. The t-shirt counter is open now. You can go and buy t-shirts. If you've bought t-shirts or your ticket includes a t-shirt, you can use your ticket and collect your t-shirt. And another reminder to fill your feedback forms. They help us make the conference better. And again, the phones are still ringing. Like during the last talk, also the phones are ringing. So please silence them. Also the boff on ethics and privacy is starting on the first floor. So if some of you want to attend that discussion, that is starting as well. Next up, we have Vinith Balakrishnan. He's a professor at IIT, Hyderabad. He'll be talking about going beyond what and asking why, explainability in machine learning and deep learning. Over to you. Good morning, rather good afternoon now, okay? So can you hear me back there? Is it all right? People behind? Great, so my name is Vinith Balakrishnan. I'm a faculty in IIT, Hyderabad. And thanks to the organizers for inviting me here to share some background that we have done, background work that we have done in explainable machine learning. So I will give some overview of the work so far in the broader community and then talk about what we have done in this space. So generally I get asked wherever I go, is there, I mean, is there an IIT in Hyderabad? So that's generally a question I get. So I think it's my duty to perhaps start by saying that there is an IIT in Hyderabad and it's actually not too young. We are about 10 years old now. So we have graduated about seven batches of students. So we pretty much have all the departments that are in any other IIT. And we currently have about 2,200 students. So those are some of the buildings that we currently have on top. And I'll keep this introduction very brief because I want to focus on what I need to cover. So the CS department at IIT Hyderabad, which is where I work out of. So we have about 20 faculty, just kind of close to stable state for most of the IITs, probably among the newer IITs that came 10 years back. Probably we are first to reach this number. And our opening and closing JEE ranks, which is one way in which people measure how well IITs do are improving each year. So this year actually are opening and closing JEE ranks at somewhere in the vicinity of 450 was our opening and 770 was actually our closing rank for the CS department at IIT Hyderabad. And we have projects with government, academia, industry and several student and faculty awards. So if you've not been that side, yeah, please do visit. I mean, for those of you who didn't know there was a IIT Hyderabad, yeah, please do visit. Just a quick brief about the kind of work we do at our group at IIT Hyderabad and then I'll quickly get on to the topic. So very broadly we work on machine learning, deep learning, computer vision, those are probably the key words that would get associated with our group. And within the group, we do both algorithmic kind of work as well as applied work. On the algorithmic side, some of the things we do are non-convex optimization for deep learning. I think the focus of this entire day is deep learning. So how do you help train deep learning methods faster? How do you prove convergence guarantees for those kinds of gradient descent methods that you use for deep learning is an important question that we try to answer. Explainable machine learning, I'll talk about that more as we go. Deep generative models for settings where you don't have too much supervision, deep graph representations, those are some what I would call data agnostic research that we do. I mean, we're not looking at particular applications but just trying to solve fundamental problems in that space. In the applied side, most of the work that we do is on the computer vision side. So broadly, human behavior understanding, recognition of expressions, poses, gestures, all of that is pretty much what we do. Off-late, we have a couple of projects on doing vision with drones, both applications, both in agriculture and defense security as well as disaster management and those kinds of applications. And we also have a long-term project with Japan where we are doing deep learning for agriculture. So where we actually have experimental farms, it's a collaboration with an agricultural university where we have drones flown over farms in Hyderabad and we record those videos that work with that. Just some of the venues that we try to publish our work is CVPR, ICCV, ICML, NIPS, those are the venues that we quickly try to target. Just with that quick background, let me step into what the focus for today's session is explainability in ML. So broadly speaking, I would cover three aspects in this particular talk. The first thing is to give you an overview of what explainability in ML means, what are the efforts so far? So we'll talk about that. Then we'll have a second segment where we talk about visual interpretability in CNNs, convolutional neural networks. And lastly, very briefly, we'll talk about future directions in this space. So that would be the outlay of this talk. So some things to keep in mind, it's a semi-technical talk, probably some technical details, but broadly it's a high-level talk. So if any of you don't want math, you can still listen. It's okay, you shouldn't miss much. It's intermediate level talk. So I'm going to assume that you have a basic background in deep learning. Anyone who's completely new to deep learning here? Anyone? Okay, okay. So I'm going to assume, I do have a couple of slides where I'll just breeze through that, but otherwise the assumption is most of you know, CNNs, RNNs, and at least basics of neural networks. Okay, nothing more than that. And obviously your focus will be computer vision because we're going to talk about CNNs in this particular context. All right. Okay, so I think all of us now, I mean, this is a redundant slide, but more to bring things into the context. I think all of us know that machine learning has been very, very successful. There have been applications ranging from science to web to marketing to manufacturing where machine learning is used on a daily basis. So they're not going to go further. And if you look at deep learning as a subset of machine learning in a broader sense, it is a group of algorithms that comprises a subset of machine learning models. You can very broadly, the successes of deep learning have been in the perception space, right? You look at say vision, text, and NLP. Very broadly, these have been the three domains in which deep learning has been successful and the number of papers in this space have been increasing. And this was an interesting graphic that I found to understand the proliferation of deep learning. This is a graph that shows the number of deep learning models on Google servers. Okay, an interesting way of studying how deep learning has been progressing. So apparently the numbers of number of deep learning models that are on Google servers which are deployed in various applications that they have is also exponentially increasing. Okay, just one interesting way of looking at how deep learning has been growing. Okay, this is something perhaps all of you know. Let's now look at the thing a little bit more closely. So if you look at where machine learning is today, right? If you look at all the machine learning applications that we use in practice today, and you would look at problems like what is the product relevant to the user, what is the sentiment of this tweet, what are the objects in this image, depending on which domain, which application you're going to work on, okay? And the general abstraction out of all of these problems is what is X, right? That's the problem that you're looking at in all of this context. And if you see the applications in which you and I use machine learning in today's world, some of the observations are the cost of making a bad decision is not much, right? A bad movie recommendation, you lose 500 rupees, you lose three hours, that's all right, okay? That's completely all right, okay? Accuracy is often the most all important metric that you're looking for, okay? Variance of accuracy, I mean F1 score, I'm going to group all of them into what I call the accuracy, whether you call them F1 score, precision, recall, all of them are one particular kind of a metric that you're actually looking for. Why a particular prediction was made doesn't matter at all. As long as a revenue is optimized, as long as the monetization is not affected, you really don't bother about why a particular prediction was made, okay? This is where things stand at this point in time. And it's highly one-dimensional, you're trying to optimize for one particular metric, you're trying to, I mean, you look at any machine learning work, the results presented are one particular metric where you see accuracy or F1 score or pretty much, that's the overall framework in which most of machine learning applications operate today, okay? And if you look at where does machine learning, where is machine learning yet to fulfill its promise? You look at complex real-world systems, okay? Where I'm not gonna trust if, let's say I take one of my family members to a hospital and then there is a reception desk which uses a machine learning algorithm that says, oh, your family member is free from cancer, you can go home now, okay? I'm not gonna take that, right? I do want, I would want to insist that I meet somebody who has the manual expertise and then get that and then only go. Why? Because I want an explanation for the decision that's being made, okay? So if you abstract this out again, places where machine learning is yet to fulfill its promise are complex real-world systems. Examples would be risk-sensitive systems that's medical diagnosis, financial modeling, prediction. I think those are examples of systems where you don't directly use machine, I mean, there could be places, components, for example, in medical imaging, of course, yeah, machine learning is used to highlight particular regions in an MRI or a CT scan or things like that. So in subsystems, it's used, not for the final decision-making, okay? That's what I'm trying to indicate here. And a particular project that I'm personally involved in is in safety critical systems. For example, cockpit decisions of what we're currently working with Honeywell, where, I mean, the idea is to see if decisions have to be made, I mean, a system like that of an aircraft has plenty of sensors with lots of data coming in at every point in time, right? Humongous amounts of data. So, and it's often very difficult, I'm not sure how many of you have seen the interface in front of a pilot, it's extremely difficult to parse. You have to be trained to be able to parse that subsystem in front of you. And often only an important part of it gets highlighted to the pilot, okay? So now, how do you make decisions on the fly? Let's say a particular pilot is asked to drop 1000 feet, okay? Now, if that decision was given by a machine learning algorithm, does the pilot, should the pilot follow it, not follow it, would the pilot ask for an explanation why it needs to be done? Those are some things to think about here, okay? And that's the focus of explainable ML because in these kinds of applications, the cost of a bad decision can be very, very high. The cost of a bad decision can be life and death. Those are situations in which we don't trust ML yet, okay? Those are situations in which we don't trust. Accuracy may not be the only objective. It's okay if the performance is not as good if the explanation is something that's rationalizable, okay? That's something that you definitely want, which is what is good with humans. Humans can, I can recognize the objects in front of me, but if you ask me why is that a monitor, I can give an explanation, okay? How I give the explanation even we don't know how the human brain works and how those connections fire. We still have not understood, but at least I can rationalize it to some extent, which is something existing ML systems cannot do. And there is a need for a multidimensional perspective rather than look at just one particular performance metric like an accuracy or an F1 score, you probably need a different holistic way of looking at how an ML system is performed, okay? So now the question is what then do we really need beyond all of these accuracies and all of these metrics that we use today? What do we really need in ML? So some of the wish list here are, we want a human understandable rationale in decision making, okay? We want when a decision is made, some rationale which is human understandable. I think for those of you, I thought I should definitely mention this because I'm sure many of you in the industry have heard of the GDPR that's now out, I think by the EU, am I correct? I think EU is already passed it from May of this year. So one of the important contentions in that particular, in the GDPR entire documentation was if there was data and that data was acted upon by an automated system, then how does that deal, I mean, if that data pertain to a particular person was, the decision was made by an automated system, how do you provision for data protection in those kinds of scenarios? That was a very, very important point. And I mean, the only way to handle those scenarios is if your prediction models can explain its decision rather than blindly take data and just give an output, okay? So trust or confidence in the system is something that we want, it's not, in addition to rationalizing, you want, I mean, I'm sure in any kind of a system where you use machine learning for a new product or for a new service, in any upper level management, in fact, I think I have a statistic where PWC recently conducted a survey and apparently two thirds of CEOs around the world felt that stakeholders will start losing confidence in companies using AI, okay? So it seems surprising, mainly because you really cannot have confidence when a new system comes up, whether the ML decisions that you're getting out of that system are really useful or really worth trusting in is something difficult to promise. Compliance with ethical principles. So there's actually a, you can look up this link, there's actually a statement from Gartner that says that by 2018, half of business ethics violations will occur through improper use of big data analytics, which is a pretty tall claim. I'm not sure exactly when this link was out. I'm not sure if these statistics is true right now, but that was the claim from Gartner, okay? So compliance with ethical principles is another reason why we need explainability, right? So if you use machine learning as a black box, you're not sure, I mean, this is a huge topic in machine learning now. I'm not sure how many of you are aware. Bias in machine learning is a huge topic today because remember that machine learning today is completely dependent on what datasets drive machine learning. And if datasets are Caucasian, maybe it's not applicable for Indians, right? So that kind of a bias that's built into datasets is a huge problem. And now if you're going to rely on that for deciding on people's lives, okay? That's going to, that just doesn't make sense. So that's where ethics comes into the picture and bias in machine learning is a huge topic in machine learning conferences like NIPS and ICML at this point in time, okay? So enhance control and robustness. So if you understood how a system works, you know how to play with it. You know how to control it. If you know it doesn't work in this setting, because of this reason, you know that you probably have to give data of a particular setting to that system to make it work better. And finally, openness and discover, openness of discovery and scientific research. If you understand the system, you know how to improve it. So all of these things are something that's important for us to take ML to the next space. So for those of you who have read more in this space, you would see various kinds of terms that people use in this context, explainability, trust, interpretability. And it's a new topic in machine learning. It's been around, I think it became popular when DARPA introduced a huge initiative a couple of years back called explainable AI. So in the last two years, it's taken off and there have been a lot of efforts around. I do have a slide that summarizes many efforts, all the effort so far to a reasonable extent at least. So there's been a recent work just released last month on archive where they tried to categorize all of these methods and define the terms more succinctly. There have been two, three papers in this space, but I'm going to subscribe to this particular definition here. So they define interpretability as trying to understand what a model is doing or has done. So if a machine learning model gave you a prediction, what did it do to get to arrive at that decision is what we're going to call interpretability. Explainability is a little bit more abstract. Explainability is can your explainable model give the reasons for neural network behavior? Can it gain the trust of users? Can it give you insights or causes for decisions? That's slightly a higher level of abstraction rather than try to understand just what went. Did this neuron fire that neuron fire? That would be interpretability. If you want to give a human interpretable explanation for a decision, that would actually become explainability. So that's the definition we're going to go with. I think that's, it's definitely debatable. I think this is still a new nomenclature. I think the field is growing at this point in time. These kinds of terms and formalizations for these approaches are still being defined. I'll talk about that when we come to the open problems at the end, but this is the definition that I'll go with at this point in time. So if you look at today's machine learning models, you can, so this is the statistic that I was talking about, 67% of business leaders that took part in a CEO survey in 2017 said that AI and automation will impact negatively on stakeholder trust levels in the industry in the next five years. And one of the reasons is you don't know whether to believe whether the model will work or not. So everybody wants to use ML in their products and services, but there's no way to find a robust way to check whether it will work or not, other than metrics that could be biased themselves. So if you look at the existing machine learning models, this is a rough chart of how they perform on the accuracy which is interpretability trade off. So neural networks and deep learning is somewhere here. Most accurate today have won most challenges, but not interpretable at all. Then you have support vector machines, random forests, canaries, neighbors, decision trees, linear regression on that particular spectrum. As you can see, decision trees may not be as accurate, but they are some of the most interpretable models for obvious reasons because decision trees, you can actually look at the variables and say why a particular decision was made. So in some sense, a random forest is a medium ground between these because an ensemble of decision trees, random forest performs a little better than a single decision tree. But again, you cannot give as good an explanation as decision trees because you have to aggregate your decision trees and come up with an explanation, but it performs a little better than a decision tree in terms of accuracy. So your current, today's frontier of machine learning is on this spectrum. If you want to perform better on accuracy, you're gonna lose on interpretability, but what we want to go for for tomorrow's machine learning models is improve both of them simultaneously. So you want models that perform well both on an accuracy metric as well as in terms of explainability. That's what we want to go towards. So where do we stand today in terms of efforts in this space? So again, in the last two, three years, there have been plenty of efforts in trying to do different ways of trying to explain machine learning models. I should say that there have been efforts even in the 80s and 90s in trying to understand how neural networks can be converted to decision trees in a manner in which you could explain your decisions and so on. So some of these keywords here talk about those kinds of models. I'll probably briefly go over them. So I'm again going with the categorization in this particular paper in terms of looking at all of these work and trying to put them in bins. So there have been a few different models in recent years that I've tried to look at explainability in terms of processing. And I'll explain what that means in a moment. And there have been a few efforts in terms of trying to directly produce explanations. And finally, there have been a few efforts in terms of trying to understand representations and so on. I'll briefly go over all of them. But before, if any of you are actively working in this space, so maybe in this categorization, for those of you who've used Lyme, anyone who's used Lyme here, okay. So if you've used Lyme, Lyme is, it's pretty popular. I know it's pretty popular in the industry. I've heard many industry use cases of it. So that can be bended to a proxy kind of a method. Then for those of you who looked at CAM, grad CAM and things like that, which I'll talk about in more detail as we go, those would go into the saliency maps kind of, in that bin of these kinds of methods. So let's look at some of these methods. So linear proxy models stand for, so there could be many other categorizations of these explainable models too. So one traditional way people look at various efforts so far in explainability is model agnostic methods and just methods that depend on the model. What that means is model agnostic methods are, you use whatever you want for a machine learning model, SVM, neural nets, decision trees, whatever you want. Then you use something as a meta method to try to understand what the model actually tried to, or rather what were the relationships between the input and output of that particular model, be it SVM, be it decision tree. Those are model agnostic methods. And then model dependent methods are where, if you use an SVM for classifying, you try to go deeper into that particular SVM and try to understand why it made a particular decision. Okay, that's another way of classifying the machine learning models today. So linear proxy methods refer to the models where are broadly model agnostic methods where you've trained a machine learning model and it's a black box model to you at this point in time. It's trained using whatever ML algorithm you wanted. So now you try to play with your input and output and try to find which space in your input corresponded to a particular output the most and so on and so forth. Okay, so in a broad sense, these kinds of models, there's also something called SHAP, SHAP and so on and so forth. Broadly these kinds of models are, you can say are an extension of, are a formalization of sensitivity analysis methods. So, because in traditional statistics, what you would do to get explanations would be sensitivity analysis. You put up your input a little bit and see what happens to the output and so on and so forth. Those are what these kinds of models do. Then decision trees, you already know, I mean, decision trees are what help make decisions directly, but by decision trees here, I'm also referring to methods in the past where people have converted neural networks to decision trees. There have been methods in the 90s and there've also been some recent methods. You can probably look at this particular paper down. They go deeper into this kind of a classification. So that's any method that tries to convert a machine learning model to a decision tree to give an explanation falls into these kinds of methods. CLE and C maps is more in, I think in vision and text, perhaps, where you process the deep learning model on a particular vision or text input and then you try to find out which part of the image or which part of the text was the network focusing on while making a particular decision. So that's another way of doing it and that's the kind of models I'm gonna talk about for the remaining duration of the talk. Then automatic rule extraction is, it could, it's in some sense related to decision trees, but there are also other kinds of methods where you can extract rules from a given model based on various kinds of methods. Again, I'm not gonna step into each of them. If any of you are interested, I'll refer you to this particular paper here. Then there are also explanation producing methods where people try to come up with scripted conversations. So you try to, or these are also called generative models where you train your model to generate an explanation for those of you aware of GANs and generative adversarial networks. So instead of generating an image, you generate an explanation for a given prediction. So assume you have a data set where you have a particular problem that you're working on as well as the corresponding textual explanations for the predictions. Then you train a GAN to be able to give an explanation in real time. So that's another family of models there. Attention-based models are as a, where you have an attention module within your neural network and that attention module tells you which part of the, it's slightly different from saliency maps because you, image captioning is a good example for attention-based systems here, where if you're trying to take an image and you want to train a deep learning model to give a caption for that particular image, then you use an attention module inside your deep learning network and that attention module tells you which part of the image the model was looking at while throwing out a particular word or a phrase in your caption. So those are traditionally been into attention models and broadly speaking, all these disentangling representations methods. So how many of you familiar with VAEs? Variational auto-encoders. So variational auto-encoders has also something called info GAN with GANs. So these kinds of methods try to learn a latent space where the data belongs to and then go from the latent space towards a prediction. So the latent space can be considered as a disentangled set of representations. For example, typical example that's used in VAEs and info GANs is if you had a bunch of face images and let's say you want to generate more first face images, you try to come to a low dimensional space, a latent space, where a particular dimension corresponds to a beard on the face, a particular dimension corresponds to glasses on the face, so on and so forth. So then when a decision is made, you can try to see which latent variable fired and then try to say, I classified this person as so and so because he or she had a beard or had a glasses on or so on and so forth. So that's how all these methods for disentangling representations fit into the scope of explainability. And lastly, there are also methods that try to study what are the roles of particular layers, what are the roles of particular neurons, what are the roles of particular vector representations that you get out of a network. All of those also can be considered as trying to interpret or explain neural networks. So I'm not going to step into all of them, but that's another broad space that can be looked at as explaining deep models. Okay, I'll just talk about one of the models which is very popular, LIME. So LIME is a model local, it stands for Local Interpretable Model Agnostic Explanations where given a model and this is your prediction, irrespective of whatever model you used here, it doesn't matter whether you used SVM decision tree neural networks, whatever you used, you give it to LIME and then LIME gives you a decision of what feature could have led to a particular decision. And the way it goes about doing it, they show in their paper, they actually show results with text as well as images. So the way they go about it is you take a particular image or any data point, then you perturb in the neighborhood of that data point and then you try to see what output you get and then you regress on those perturbed instances in the output and then try to get which feature is most likely to have led to a particular output. So for those of you who don't know about it, I think they have the code and it's very popularly used for anybody who tries to, at least I have heard several industry use cases of people using LIME to explain decisions. All right, so that's the brief overview of explainability in ML as it stands today and I'll come back to what are open questions at this point. So we'll take a detour to visual interpretability in CNN, so that's some work that we have done in that space also. So just for those of you who are very new to deep learning, just a quick two to three slide introduction. So that's a neural network and it's trained using back propagation and gradient descent and you have a loss function that's used and you back propagate using the loss, use gradient descent and back propagation to train based on the loss function and update the weights of the neural network. And what's a convolutional neural network or a CNN? You don't connect weights directly and you don't use what's called an inner product between your first layer and the next layer. Use convolution rather to do it and use a concept of sharing parameters and the main operation that you use in a convolutional neural network is convolution. Convolution is where if you have a bunch of parameters that are defined by a matrix, then you have the weights also defined by a matrix. Convolution is nothing but at least 2D convolution is nothing but inner product of this matrix and a doubly inverted version of this matrix. So that's the simple definition of convolution for those of you who don't have a signal processing background. So you have a matrix, you have another matrix here. You invert this matrix both horizontally and vertically and then do an inner product between every window here and this particular matrix and you'll get a value as an output and that becomes your output of your convolution. I'll leave it with that simple definition. So a convolutional layer neural network is where you have an input image. It could have multiple channels. For example, if you have a color image, you would have RGB that are those are three channels there. And then the accepted field is the size of the convolutional filter that you use, the filter that you have and based on that you come up with an output map. The output of convolution is an image. If you have an image and you applied weights to it, unlike a standard MLP, you get an image as an output and you could have multiple feature maps that lead leads to a depth in your output. So unlike traditional MLPs where you have to decide the number of neurons in each layer, in this case, the neurons are pre-decided once you decide the number of weights and a few other parameters like valid convolution and so on and so forth. And you also use activations like reluze and so on and so forth. I'm mentioning this because these are some things that we'll use as we go forward. And in a CNN, you also have something called a pooling layer where you take a patch of an output of a convolutional layer and then you aggregate it. The aggregation could happen using a max average L2. You can use any kind of a pooling. And for example, if you had an image such as this and you do a max pooling, so you take a region two cross two region here and then you take a max of it, put it here, so on and so forth. That's your max pooling layer. And then to put together your vanilla CNN, you have a convolutional layer, a pooling layer, a convolutional layer, a pooling layer. There could be blocks where you don't have a pooling layer, have only a convolutional layer. And finally, you have a fully connected layer and then you have a softmax classifier at the end. So that's your overall architecture for a convolutional neural network. Again, this is a simplistic one. These days you have normal batch normalization, normalization layers. You have skip connections and residual blocks, many other variants of this, but this is just your abstraction of what is a convolutional neural network. Okay, so now trying to understand CNNs, one simple way people have always tried to understand CNNs is you look at the weights that you have learned in your neural network. Remember weights and CNNs are also matrices. So if you plot them as images, it gives you an idea of what the neural network is trying to do. That's one of the simplistic ways in which you try to understand what's going on. And it's typically known that if you take the first layer of weights and visualize them, you get filters such as these. So you, and these typically look like your basic image processing systems of your visual cortex, where you look at textures and edges and edges of different orientations and textures and so on and so forth. Unfortunately, these kinds of weights are interpretable only in the first layer. If you go further down the network, you really cannot understand what's going on in these kinds of weights. But to interpret CNNs the last few years, there have been a few different efforts. So one of the efforts has been to use back propagation methods where you assume that your CNN is a black box, which is a function, which means you can back propagate straight from the output to the input. Instead of back propagating with respect to the weight, you do doh f by doh x instead of doing doh f by doh w. So it's a simple chain rule change. And once you do that, you can now try to back prop with respect to a particular class and try to see where the gradient was most activated in a particular image. So once you train an AlexNet kind of a model for the image in a database, then you take the cat class and try to see where was the gradient most activated for this particular image. It shows these regions. That gives an indication of what part of the image the network was looking at for a given image. So there has been an extension to these kinds of methods and there have been a couple of efforts called deconvolution networks and guided back propagation where you use a relu activation function, a rectified linear unit activation function and a deconvolution network. What you do is, I mean, when you back propagate, you change wherever the inputs were negative, you don't allow those gradients to come back back to the image. So in the same back prop approach that we saw in the earlier slide, wherever you had a relu activation, you don't allow those negative gradients to back prop to the image. And then there was an improvement called guided back propagation where both the activations that turned out to be negative as well as the gradients that turned out to be negative. Remember, relu is an activation. Gradient is backwards. If both are negative, you don't allow any of those gradients to go back to the original image. So that was called guided back propagation. And they showed that if you use deconvolution or guided back propagation, it helps you understand what the network was looking at more clearly. I mean, the main idea is, don't worry about the negative influences in your network. Look at only the positive influences in your network and you'll get a better idea of what the network was looking at while making a prediction. That's the idea in this particular framework. But there have been some limitations of guided back propagation. If you use a cat class, this is a dog class, it still ends up focusing on the same region irrespective of what class you try to do. And that led to the proposal of one particular method called class activation maps. So in class activation maps, this was proposed in CBPR of 2016, where they said that when you need to explain, you need to look at it with respect to a particular class. So what they do in that particular model is you have a convolutional neural network and then after all your convolutional layers, you take the last convolutional layer and then you do something called global average pooling. You take each of those feature maps in that last convolution layer and just average all of the values in one particular feature map and represented by one particular value here. You take the blue feature map here, average all the values in that particular image represented by one value here. So on and so forth for all the feature maps here. And then you train a linear classifier to learn the weights between each of these aggregated feature maps and each of the classes. So you have to learn if you had 1000 classes, you would learn 1000 regressors to be able to get these weights for each of these aggregated feature maps and their output. And once you have that, if you take a particular class called Australian Terrier, you know how it was weighted with respect to each of these feature maps. And that if you put them together, you get an idea of what the network was looking at, while giving a particular class as the output. Using this approach, they showed that, I mean, if you take this particular image, this is an image of a dome and you can see that if it's classified as a palace, it looks at this part of the image. If it's classified as a dome, it looks at this part of the image. If it's classified as a church, it looks at only this part. It doesn't look at that entire facade there and then so on and so forth. So it gives an indication of what the network was looking at while making a particular class prediction. Unfortunately, the problem with the CAM approach that we saw on the previous slide is you have to, after you train the CNN, you still have to train thousand regressors for each of your classes that you have in your model. That's a problem. And that led to the definition of what is called GRAC CAM, which is called gradient-based CAM, where they realized that these weights that you're trying to learn in CAM actually can be obtained directly from the gradients. The weights that you have is nothing but DOE-Y with respect to the activation map. If you take DOE-Y with respect to the activation maps, that is directly your weight. You don't need to train regressors further to learn those weights between your activation maps and your output. And that led to this method called GRAC CAM, which is again popularly used for saliency maps today. This is your overall architecture for GRAC CAM, where you take those weights from the gradients and then you do a relu on those gradients, which means any negative gradient, you don't allow it to pass. I'll end up in two minutes. I'll find more minutes. So, and then you take the, you don't allow the negative gradients. Again, the same idea of not taking the negative gradients is to look at only positive influences. And why are negative gradients not being involved? I mean, just empirically, they saw better results when you use only positive gradients. The interpretability was much higher. That's about it. And they also combined this with guided back probation, which we saw on the earlier slide, which means you don't have any gradient corresponding to a negative activation or a negative gradient. Both of those are set to zero when you go back to the gradient. And with this kind of an approach, they actually found that you could actually get discriminative saliency maps with respect to each class. And these are some sample results of their work on image captioning models. So you can see that they showed that as you caption, you can look at different parts of the image when you give different words as output at each step. So one of the limitations in our work that we found on this particular model is that whenever there were multiple objects in the scene, or when there's a full object, the network ended up focusing on part of the object or could not look at multiple objects. And that led us to come up with a different formulation from our perspective. I mean, I'm just going to give the conceptual overview. I mean, you can ignore the math. What we realized with the grad cam approach was that it was looking at the weights as an entire aggregation of the feature map, not looking at it pixel wise. While we felt that there was an explanation that was required pixel wise with respect to an activation map and the output. And we realized that you could change the formulation a little bit by making your weights that you're trying to learn between your activation map and your output. And remember both grad cam and the work that I'm presenting right now can be done with respect to any layer of the CNN. Those activation maps can be with respect to any layer. I'm just demonstrating it with respect to last layer. So we found that that's gradients can be weighted with respect to each of those pixels in the image. And it happens that you can get an easy closed form solution. If you work out a map, it turns out that you can actually get an easy closed form solution, which means it's just a simple one line code where you can evaluate this expression and then try to get a better activation, better weight of those activation maps with respect to your output. And using this, we could find that if you have multiple objects in your scene, it gives a more holistic saliency map. I mean, these are examples of different. Here is a flamingo. You can see that grad cam actually gives only a part of the image, whereas the grad cam plus plus, which is our work that was accepted at WACV this year. So that could give a more holistic saliency map in this particular context. Here are more examples. Okay, here is an example where you have a gray whale that's peeking out of an ocean. You can see that grad cam gives a pretty hazy representation of what your network is looking at, whereas our model could give better results. I'm not going to go too much. You obviously know that if you got a paper, we probably had some good results. Yes, I'm not going to go into that too much. If you're any, if you're more interested, so we had some results on image captioning and we had some results on doing this on videos too. So we had some, how do you explain decisions on videos? That's also something that we have in our work. I think ours was the first to be done in the space of videos. So we're interested, our code is available and our archive paper is available. So we also have some ongoing work on how do you do this for, we are trying to use causality in trying to do this for, how do you explain decisions of RNNs? That's some ongoing work that we currently have with time series kind of data. If you're interested, I can talk offline. I won't go too much into that. Let me spend the last couple of minutes on just trying to see open directions here. So this is the space of methods that we briefly saw last half an hour or so. So now what are open questions at this point in this space? What are research directions? So one of the first problems here is, so now we all know that machine learning has a certain formalization. You have F going from X to Y where X is input data, Y is the output. You try to learn a function F that can be learned using a maximum margin classifier. It can be learned using a neural network. It can be learned using a probabilistic framework so on and so forth. What is the formalization for explainable ML? We still don't know. There was a recent paper that's trained to, that was published last month that tried to do something in this space, but I think there's a lot more work to be done in terms of how do you formalize machine learning when explainability is very required? And then how do you balance the accuracy performance with this interpretability tradeoff? Is interpretability always required? I think it's a question. And maybe it's not required in several applications. There are only certain set of applications where maybe there's a need to stratify applications and say these kinds of applications, a certain quality of interpretability service is required. Those kinds of things have to emerge in this particular context. What kind of data, what class of problems are more amenable for explainable systems? If you add data on the real space, is that better? If you have data in the discrete space, is that better? Those are all open questions at this point in time. If you had a multi-class problem, a binary class problem, what kind of problems would explainable systems be needed? Where would they work better? And how to, an important question I'm done just for most like, how to evaluate explainable systems? What kind of metrics do you use is all open at this point in time? I think I'll stop there. So these are some reference and resources if any of you want to take away. And yeah, I'm here for any questions and discussion. So we're out of time. Since we have a break, we can continue with the questions. I'll make the announcements first. So those of you who don't want to ask questions can go for lunch. Walmart has a contest going on at their booth. So if you want to win a wireless headset, you can participate in that. BoF on AI and product will start at 140 in the BoF area, which is on the first floor. And remember to fill your feedback forms. Okay, we can continue with the questions. Anybody? Yeah, any questions? Hi, nice talk. One question based on your previous slide. How do you evaluate the current explainable systems? So you said that that's an open problem on... How do you evaluate the current explanation? The second last point is about the evaluation. Yeah, I mean, so I skimmed over it. I mean, that's a very, very important point. I mean, how do you evaluate explainable systems? We still don't have. I mean, those are mechanisms that you have to develop. So again, there has been some recent work which have proposed some metrics. So in the work that we have published so far, I mean, you typically use some human subjects and see if the explanations were good enough. I mean, use a broader range of subjects so that you know that across the population, you felt the explanations were good enough. That's one way of doing it. Since we were doing the saliency maps kinds of approach, there are multiple things that you can do. You can hide the rest of the image and look at only that part of the image that our saliency map produced and then run it through a CNN and see how the prediction was. Did it improve the accuracy? It's a question to ask. Then if it did improve the accuracy, yeah, perhaps that was the part of the image that was most relevant. So you can play around a little bit with those kinds of methods too. So all these are in some sense hacks, I would say. I wouldn't call them evaluation formalisms at this point in time, but those are some methods people use. So it depends on, I think broadly at this point in this space, it's all application driven, whichever application you have. But I think that's an open question at this point in time. Hey, we need, it was a very nice talk. So what we see is that for most of the explainable models that we have, we are trying to reproduce the same result using a simpler model. For example, we are piggybacking on decision trees. That's one approach. That's one approach. That's one approach. But that gives the more accurate explainable part where we can get the direct result. But what I'm trying to ask is, do you think that this is a good solution? I mean, there are many non-linear relationships and other models, okay, which actually did the hard work. And then we are trying to, you know, all of that into a simpler decision tree. So as you mentioned that it is a hack, but do you think that this hack is something that we should even try out and... Got it. I think that's a good question. So firstly, I mean, decision trees are not the only way to explain. I mean, it's just that it's more human interpretable. I mean, we're all pretty easy to look at if then else rules and make decisions at this point in time. So that's the reason that people prefer decision trees. But definitely, no, I mean, that's not the only way. And I mean, the answer to that question, what your question is again related to that is, we still don't have evaluation formalisms for explainability. Right? So another way of asking your question is decision trees the right way to evaluate? I mean, is that kind of an output the right way to look at whether a particular explainable model was useful? Right? So that's another way of asking the same question. We don't have an answer to that question at this point in time. I mean, all of this is very nascent. I mean, this entire space is, I think growing as we speak. Hey, Vinit will be available offline. So you can continue the discussion. Yeah. Okay. I don't know if this has to be done offline or right now. So the only thing was, oh, okay. So it's more about causality in RNN that you just talked about. Sure. Maybe due to shortage of time you couldn't talk about. So just wanted to do, wanted to know more on that. Okay. It's actually under review at this point in time. But maybe I can talk offline, but the overall idea is we use causality. I mean, so there's a fundamental difference which I think machine learning today does not do. Machine learning, deep learning, none of them do is to segregate the difference between causality and correlation. All machine learning models today are correlation driven. So I think there's a very popular example of correlation versus causality. I think there's a statistic of people buying ice creams in summer. You plot a graph over the months of the year like Jan to December, you plot the graph. And then when sharks appear on the beach of an ocean or something like that, I think this was done in the Bay Area of M, right? And then you have the same chart on the same chart you plot a graph of that. You find that the charts are very, very similar. They have similar charts for both of these cases. Does that mean that you eat ice cream because a shark came up on the shore? Not really. So the data is correlated, but they're not causal. One doesn't cause the other. But often in explainability, what we are looking for is what caused a particular output to happen. And that's where causal models come into the picture. So we've been trying to look at, can use causal models. There is, I mean, entire literature on that by Judea Pearl. I think he's a curing award winner if you're not aware of that literature. It's actually has been a niche area in machine learning for many years, but I think explainability is kind of reviving it at this point in time. So our idea there is to use, you assume that each node of a neural network is like a random variable. And it's a graphical model, but it's what we call a structural causal model where you can study the causal relationships between those random variables. And then we try to see which input random variable had the maximum, maybe we can talk offline more about it, I don't want to spend too much time, but that's our overall approach to that particular problem. I mean, it's required we can talk offline for. Yeah, hi, here. So you have shown an example of grad cam for CNN on image data. So CNN can be used not only on image data, but there are some applications in text data and all. So can this grad cam be generalized for CNNs which are used in different domains, maybe in text, maybe in some financial data representation. Methodology wise, there's nothing that restricts it from, I mean, I know people use CNNs in speech for as spectrograms and even in text as 1D CNNs and so on and so forth. Methodology wise that there's nothing that restricts it from being used. I'm not, I know there have been some efforts recently on CNNC maps for text and things like that, but I'm not very sure if grad cam has been applied per se. I mean, I know there have been similar efforts, but not grad cam per se, I'm not very sure. Yes, I mean. Follow-ups offline, follow-ups offline. Okay, sure. I wanted to ask because most of the times our models work, but some of the times it doesn't work. Sure. So do you think we have to also think about that aspect of the solution and can we use existing explainable models or solutions to explain those kind of scenarios where model is not working? Because one of the scenarios like we have a stop sign and if we change the bits of that image that represents a model as a triangle, that is quite a big problem, right? It means we have to explain also that why it did not work on that one. Sure. I mean, I have two thoughts to that. I mean, firstly, I think in some sense it's related to the, because when you say a model does not work, often when you say, the moment you say it does not work, you have to talk in terms of a performance metric, right? And often you talk about that in terms of an accuracy, right? And that goes back to this accuracy is interpretability trade-off and trying to see where is the balance between those. That's one angle. But the other angle, perhaps that I did not speak about is in this entire talk is the relationship of this entire discussion that we've had to adversarial data, right? I think many of you who have active and deep learning you probably read about the adversarial attacks and defenses and so on and so forth, right? I mean, that becomes an important relation to explainability because if a small perturbation in the data is going to get a cat classified as a dog, I mean, the neural network is a good luck to neural network to explaining that. I mean, it's not going to be easy, right? So explainability is going to be very hard when you have adversarial attacks. So actually there's a very close relationship between that space of work in deep learning to explainability. And because let's say it grad cam works and you get an excellent explanation for a cat being classified as a cat. And the tomorrow you just add a simple Gaussian noise and then it gets classified as a cat. Where did that explanation go, right? And the image is almost the same from a human eye, right? So there's a close relationship which has not been studied at this point. It's an open problem at this point. So I had a question about Lyme. So Lyme usually deals with the local space and the sample that we are providing. It's a individual specific explanation. Is there any research that is going on right now which is extrapolating it to the global space? Yes, yes, very much. Yes, I mean, I would again recommend you to look at that particular paper that categorizes all of these methods. I think one of the categorizations they talk about is also global methods. So in fact, the method that we proposed with the causal RNN, I didn't get a chance to talk more about that. Can do both local and global. I mean, that was our main USP of that work. We try to do, so given a particular instance that you want to explain, but you want that to hold for your entire dataset also. So there are efforts in that space. So far, I think we have not seen work with that the same explanation method that tries to do both local and global. We have not seen the simple feature selection methods are global. Any feature selection method is global. I mean, that space has been active for decades together in machine learning. But yeah, Lyme is local. Okay, thanks for Neeth. Sure. So we are actually cutting into lunch. I know this was a very well attended talk and a lot of enthusiasm. Let's catch Veneeth offline. Sure. There'll be a talk here starting at 140 about machine learning in browser. It'll be by Amit Kapoor. Please come back at 140. Yeah, sure. Welcome back from lunch, guys. Couple of announcements. The BAF on AI and product will start at 140 in the BAF area, which is on the first floor. We have Amit Kapoor talking up next. He'll be talking about deep learning in the browser. Explorable explanations, model interference and rapid prototype. Over to you. Yeah. Okay, good afternoon, everyone. I can make a gesture. Okay, there is a bird of feather session on AI and products by Vijay. It'll start at what time? Sorry again, Vijay. After the keynote at 320. 320 upstairs. There is a seating area there. Should we start? Yeah. Hi, everyone. My name is Amit Kapoor. The topic that I've chosen today is deep learning in the browser. And I wanna talk it through the lens of what I do, which is I work in this intersection of data, visuals and story. So I teach data visualization. I teach data science. I've been teaching machine learning. I've been teaching storytelling with data over the last eight years. So I, and I teach that in a number of different contexts, including academia, including industry, including people who are really starting off wanting to learn and enter into this field of what we call data science or in general, AI right now is this. And one of the challenges that I really find in helping people learn is actually to move from what is a traditional classical programming or a traditional data analytical programming or background to what is a learning paradigm, right? So what we're doing right now with machine learning or deep learning is essentially trying to bring in this learning paradigm of including probabilistic thinking, including how to deal with uncertainty, how to deal with things, newer algorithms, newer. And we wanna build that knowledge into people and help them actually do that, right? So there is a big transition from how people have been doing things to when they start to transition and start to use these machine learning algorithms and deep learning algorithms as users, as creators, as builders, there is a different requirement that is there, right? Some people refer to this as software 2.0. I refer to it as just another module that will go as part of your overall stack. But the more challenging part is not building, this is challenging is kind of understanding what the learning paradigm is, right? And when I think about as we move to this learning paradigm, we are not really trying to remove humans from the loop, right? So we're really trying to augment intelligence. We want them to use these new tools, these new techniques in a way that helps them argument their current workflow, in a way that helps them make better decisions, right? And the way I define AI and I actually don't like using the word AI is just really augmenting intelligence. And what we really want to do is we want to augment intelligence is think about the human in the loop. So the person who is trying to do this, so trying to use this deep learning or machine learning algorithms and who actually are going to be working with it. And there are three kinds of audience from my experience that I found. One is the users, so the end users who are actually going to be seeing the output of this, who will be either seeing the output of it or there's a user who is actually interested in learning this whole technology, right? So how do I target these users in terms of helping them understand what's happening, right? So there is a user. There are creators who want to use the output of this, output of these topics and build tools on top of that, right? So there are these creators who want to use these new algorithms and build on top of that. And then there are, I'm calling them builders. We probably call them coders. People call them deeply learning engineers. But we really want to address this third people who are really trying to build something using deep learning, right? So there are these three different type of use cases that we have, right? And in all of this, they're trying to either learn, which is what the first is. I want to really learn and understand in my use case, whether it's business, whether it's my learning, how to do deep learning, I want to really learn. I want to play with it. I want to take the output and actually do something with it or I want to create this new things, understand the problem space and create new things with it, so I either want to learn, I want to play, I want to create. How do we do that, right? How do we do that right now? Right now what we do is really hard. I make this joke when I teach that I don't really teach anything which is less than five-year-old. Like I said, less than five-year-old, yeah. Or sorry, it has not been invoked for at least five years because till that time people don't really know what they want to do with it. And trying to teach deep learning or trying to get people to understand deep learning has been hard, right? Setting up an environment to do this, setting up ways to use this into production, the tooling around that, all of this is really hard and it's only now we are reaching in a stage where this has become easier, but still not easy enough, right? The reason we want to talk about the browser is in the browser, is it possible to do some of this stuff? Is it possible to learn? Is it possible to play? Is it possible to create things? And what I want to talk about is what are the possibilities right now? Because we are still at the start of this. We can think of this as doing deep learning in JavaScript, but that's not really it. It could be some stuff done in the JavaScript but some stuff done in the server. But the idea is basically can we move some computation, some part of this learning into the browser and really make it possible, right? And is this really possible? And for a long time this was not really possible. It was not really possible. But if we can make this happen right now, we provide people an immediate access so you can actually do something with it. I can go to a browser, I can start to do any of these activities. I can reduce the friction that it takes to right now do anything. I've been doing a two-day workshop on deep learning for the last two days and just getting people to start running machines, to start learning has become easier but still not easy enough to do this stuff, right? And it can help us reach a wider audience which may not be just limited to deep learning, to engineers, but a wider spectrum and helping into trying to think about democratizing or making it at least accessible to other people who are interested. Does that make sense? Yeah? So, for a long time we didn't have a lot of stuff in the browser to really do things that were possible earlier, right? So most of the earlier libraries for doing this were CPU based. They didn't really have any kind of accelerations really slow on that. Mulk of deep learning is tensors or numerical operations on high dimensional matrices and there wasn't good numerical support to do all of that. And, but it's just not only those things there's a huge ecosystem you also need around how to get in data into the system, how do you read these libraries? How can you visualize them easily? Not by doing hard stuff like D3, but more easily. And how can you actually play with it in kind of a notebook kind of environment that we are used to right now in when we do, when we do this in a kind of Jupiter type environment? So this was really hard question, right? So let's look at what has happened. So we now have WebGL Accelerated Learning Frameworks. So WebGL is the closest to compute, closest to GPU compute that we have at the moment on the browser. So browsers do not have a direct compute. You do computation on the browser. You have to translate the code to WebGL shaders, which is called, so you need to write a little low level code WebGL code to translate your numerical operations into shaders and then you can start to use the same kind of GPU acceleration, whatever is provided with your browser that is there, right? So you get access to it. There's people are still talking about how to get direct compute access and when the specifications go through and that happens, it probably will accelerate a lot more of what we are now trying to get compute access through WebGL, which was originally designed for graphics, much more for graphics rather than direct compute, right? So we now have WebGL Accelerated Learning Frameworks and let's see what the potential for doing this is, right? I'm gonna just mention four of them at the moment. There is TensorFlow.js, which has just come out or recently come out in the last couple of months. When I talked about this talk a long time back or when we were thinking about this talk, there was a previous framework called Deep Learn, which was much more academic, but now TensorFlow came out, which is not yet in future parity to everything else you see in Python or C, but it is improving some extent and you can start to use it today. The code to write this is fairly similar to what you would write in a TF Keras style. So at least if you want to use the Keras style, you can do that. If you want to use core layer style, you can do that, right? TensorFlow is another project which has shown very performance, good performance, but it's still, I don't think there's an open source, it out there, but that's potentially one of the ones that could come out. And the two other ones that I would recommend is WebDNN, which is basically allows you to, which is even compiled to Wasm, which is, let's say, compiled closer to writing C code and compiling it and making it run on the browser. So it has got much better performance on running things. So if you want to use for inference only, you can use WebDNN, or obviously there are some wrappers on top of other libraries which do that, which is for inference for Keras or MXNet, very specific Keras.js, MXNet also I think has an implementation on JS, but they are all, the lower part is all inference only, which means you can run them for all model inferences, but you can't really run them for training on that, right? Training on the browser, right? Okay, so yes, there is some possibility that we can, trying to see now right now, deep learning if you want to do the browser. So let's go back and address, if you believe this piece is that there is a possibility, or if many of you are like, I just want to stick to the server side, that's fine. How can we do it? So if we take the three aspects, learning, playing, creating, what is the possibility? Right? When I think about learning, doing things in the browser has one really key aspect which we want to harness. We want to really talk about explorable explanations. How many people have heard of that term, at least explorable explanation? You know, Rasage has. Okay, the idea of explorable explanation is that when I want to learn something, I want to really be able to interact with it in a very active type of way, right? So I want to discover how things are working. I want to discover by active learning, right? And this active learning is not the same as the deep learning, active learning. Let me make it, this is active learning in the sense that I as a learner can interact with the medium and learn from it, right? That's basically the idea. And this whole concept was done by Brett Victor or at least coined five years back where he really talked about when we do programming or learning any system, can we make it interactive? You know, can we make it really interactive, allow people to learn and understand the system in a deep way? I would probably have, if no one, if people had not aware, I would have put more slides on it. But the idea is how do you discover active learning? So what do you want to discover in deep learning? What do you, when we are trying to learn? We're trying to learn, we want people to develop intuition around what's happening in the deep, right? Whether I'm a user, end user of it, or I'm actually one of the builders, creators who are a new student that's really trying to build this, right? I want to get that intuition to go on, right? So how do we build intuition? We want to build intuition on these three levels, right? We want to build intuition on the algorithmic level, right? There are algorithms that are running. How are they running? What is the possibility of when the algorithm is? I want to build intuition on the data that is going to go into these algorithms. And I want to build intuition in what we call the model, which is the interaction of the data and the algorithm, the state of possibilities that are there when the data and model plays. You can think of that as simulation scenarios, but I want to interact with the algorithm data, data models if you want, right? I want to interact with all these three levels, right? So let's look at a few examples of what is possible right now, right? Or what we started with. This is a very simple example to read of visualizing algorithms by Mike Bostock, which is this is showing a random depth search algorithm. And this article is a long one, and this is a very big tradition or when people trying to learn algorithms is how can I learn algorithms by understanding what's really happening, right? In the job, so, you know, this is a maze generation algorithm, but in technical term is a randomized depth search, depth first search, which is what the algorithm is really doing, right? So if I want to apply this to deep learning concepts, this was the first one that many of you may have seen, which is done by Andre Karpati on this. And he's going to CS 231N Stanford course, I mean, COVR also has a demo. It was a COVNET JS, which was written in really allowing people to interact with, really interact with it. I think the JIP is not working, but it should be really interactive, right? In 2016, we had this option, which was now allowing me to look at the algorithm and understand what's really happening, which is the tinker with neural network. This is, all of these were done with what are called toy deep learning libraries, in a sense, written on top, which are not really applicable or possible to use if you want to use for yourself and try and do that, right? But now with some of these other libraries, you can actually write an interactive model very easily with TensorFlow, with like 10 lines of code, and actually using a few other interactive run frame environments, make it really interactive for people to play with. So I can actually put in the number of iterations, put in sliders, in an easy way for people to learn and interact and say, what's really happening in this algorithm, right? Writing this was really hard. We can now actually do this and build many more of this example. So we can help understand and build, use the JS bot if you want on the browser to build these explorable explanation which help get people understand what the algorithm is doing or algorithms are doing. And I'm only focusing on deep learning algorithms, but there are many algorithms and including probability, not only looking at the frequentist approach but the probabilistic approach and other parts of learning and statistics that we don't really talk about so much in this conference. Okay, we also want to be able to explore the data, right? So how do we explore the data really fast? And two of the ways which I really like, which are really very powerful ways to explore the data that's gonna go into this is, I need to build an intuition on that, right? So this is facet dive. It's again, WebGL accelerated visualization, but really focused on understanding individuals, individual elements that are in my tabular data, right? Still tabular data, machine learning, key part of the input, I can actually run this on the browser, input data into it, and it takes data in type tarry, which is different from, it can load data in type tarry, which is different from typical JavaScript which is written in the JSON format. And I can actually start to read it, interact with it, and I can load up to 10,000 data. I can also do images on it and I can see all the images displayed and I can segment them by category and start to look at my data space that I'm really trying to do. So that part is really important. And if any of you have used dimensionality reduction, then TSNE projector or PTA projector that runs, if anybody has experience in doing it, using TensorFlow, there is also embedding projector as a stand-in where I can enter my tensors and it can look and visualize the data really fast, right? And one of the benefits of doing this is we actually, they've improved the algorithm for doing TSNE much faster now. They figured out a linear approximation, do it so it runs much faster on the browser. The challenge when we're trying to teach, and I teach this a lot, is why would I prefer to teach TSNE for one is because it's easier to visualize, it's easier for people to interact, right? There are other algorithms and a lot many algorithms, but I don't have an easy way for helping people understand what happens there, right? They either have to go and write code, but here I can provide easier access for doing this. And this is all running in the browser using JS, right? We don't know enough about how, so we know, we wanna look at the data, we wanna look at the model, but at the end of it, we wanna use that output for people to interact with, right? And learn something. So what is really happening inside my neural network? This is the most common question I get. Can you tell me how the neural network is really performing? And there are concepts that we talk about feature visualizations, spatial activation, that can be done, but it's really hard to do that element by element and there is no either easy framework to do that, right? But this is an article written, which runs in the browser talking about building blocks of interpretability. So it is not really, it's a research article on distal, but is interactive all the way through, running TensorFlow, JS, you can interact with different images and see feature visualization and understand this concept of what is happening when I do spatial activation. What is happening when I'm doing feature visualization? Can I move from looking at a neural network at the level of layers, but instead segmenting differently in different channels? So it builds a whole different framework of thinking about this. And yes, you don't necessarily have to use this, but the framework out of this, you can also go and use in your other environments if you want. So the framework is called Lucid, which is for visualizing the feature visualization. And you can actually use this also in your traditional environment if you want. Or you can build something now that I can actually understand and explain to people what's really happening when you are using an image network, right? So moving from just algorithms to data, but then moving from the interaction of the data and the model, which is where what we really want to understand and explain to users, to learners, whether they are people learning deep learning or whether they are actually gonna be users of it and they have this question of, can you help me understand a little more of this, right? I'm gonna touch upon one more topic, which is in user, which is, okay. Yeah, so this is again trying to explain what's happening in my model in terms of my business outcome, right? So I wanna build a model and there are lots of questions around fairness. How do I make my model more fair? I try to do this visualization of trying to people understand classification just for pure ML and try and get them to build something like this in their exploration of Jupyter Notebooks or in their work that they're doing. But if we can actually get something like this out of the box or easy to create and it actually allows me to tune the model, run it and I can experiment with fairness or different strategies that I wanna do to check whether my model is biased and all. That's actually something I can, as a user, as a business, use and then interact with it, right? So it opens up also for my user to actually see what happens when I make different decisions on it. Because the deep learning part is not just about interacting with, creating the model. It is how the model is going to be used in the end to do this, right? And the end user's questions on how to interact with this is possible to make it. So this is loan decision, but you could actually apply this to any kind of scenario and with the availability of a neural network, you could run a very simple MLP which is driven, which is running probably in the network, also on the browser to help them to interact with, right? So the possibilities are really high on that, yeah? So this is building these visual explorations, explorable exploration is actually requires multiple skills, right? So A, their visual in nature, which is why probably it's my bias. I'm an instructor with them. I use them as a way for helping people to build intuition, but they are visual. They are reactive. I can play with it and understand what's happening. And they provide me an imagery of actually interacting with it and being able to get that output. So if they're not only reactive, I can actually go to my browser and understand what's really happening around it. Whether I want to understand algorithms, whether I want to understand my data, or I want to understand the data model and the outcome and impacts of it, right? The challenge here is obviously none of us may be fully equipped to do this one. So then because it's a multi-disciplinary skill, but there is a possibility if you are doing this in JS because of the availability of a lot of visualization capability that's been built for custom visualization, it is easier to integrate what you're doing on the browser with this, right? And I'll talk later about how to do this because that's the other important piece around doing it. But it's possible to build this much more fast now than we can. We could do earlier. But it still requires us to think beyond just as a people who are training deep learning networks and doing hyper-optimization, hyper-parameter optimization, right? We have to expand the scope of what we're trying to do and use this as a tool to actually do that, yeah? Okay, how do we create, right? So I'm gonna, how do we create? So how do we create is really allowing me to do model inference, model inference, right? So how do I get create? I wanna be able to run whatever I've created as my neural network somewhere. I'd be able to get people to access it, right? Easily on the browser. It flips this whole question of instead of bringing my data, sending my data to you, I want to bring the model back, model to the browser, right? Which is actually a very pertinent question in these days because we have a lot of topics on privacy, we have a lot of topic on how we do it. Is it possible for me to just send the data model, just send the model to the browser and actually pass the compute there and the compute for the inferences is in orders of magnitude lower than what it is for training it. So the compute can very well run it. Why did we do some adjustments to it? So we'll talk about that, right? So I'm gonna talk about only abstract data, but the real interesting stuff of doing deal in the browser is actually happening with perceptual data. Where people are, let me say, so abstract data is like tabular data that we think about, but perceptual data, video images is actually a very active area and I didn't wanna go too deep into it given the focus of the conference, not on creative coding, but I'll mention that at the end, right? So model inference and text, there's a recent competition also on this is using Keras GS, but you could actually deploy this as a common tagging system to flag comments as either they are toxic, not toxic, and you could build your model and you could actually deploy it and people can start to expert. You can give them immediate feedback as the type, whether the text that they're doing is really helping the conversation, not helping the conversation, it's a sentiment of it, right? It's possible to flip that around instead of saying I want you to send this information and run the model on the server. I can actually, in many cases, run this on the browser itself, right? The standard MNIST example, I really thought I should drop in one MNIST example in a deep learning conference, but you can also do inferences in this way where you kind of type this, you type this and you can actually start to see the output. And there are many use cases of this. Airbnb has a product in which you are actually sketching out as you talk, a user design, and I have a very small neural network trained on about 150 components. And on the right, I can actually start to see the component design, you know, fully build up UI with React, working prototype coming up. So my sketches, ideas on ideas of what I'm done, of what I think the UI should look like can be prototyped in real time because I have a component architecture and I can train a very small neural network and start to build that, right? So the exposure of going from here to that emergency impact really helps me also build very, very different products than what we have, right? Neural transfers, style transfers are obviously far easier to execute. There is a library called, even now, which is ML5.js, which is based on P5.js, which is for creative coding. And a lot of these standard examples of loading data and running it on the browser are even more simpler than what is for TensorFlow.js, because they're designed for a very different audience who are dealing with creative coding. And it really has a very easy way to actually start to build these products if you want. Two things, two thoughts that I will add to this, which are slightly, which have not been done, is can I do data augmentation in the browser itself, right? And use that as a way to train it, right? But even more interestingly, can I collect data, which is a very hard problem that we all face? Is it possible for me to collect, you build a simpler model, which is semi-supervised, and as people are, I think the image is not shifted right, but this is how Quick Draw was collected. In the sense, I got people to draw interactively, and as they were drawing, I would give them feedback of whether the drawing is correct or not, and ask them later to provide their own input on it. And that really helps me to collect data in it, right? So can I run model inference, not just for running inference, but as a way of running semi-supervised learning on my browser and allowing people to resegment data and run it? That's a very, very important use case, and data collection, data labeling, which is a huge problem, can be solved by this, right? You were doing those workshops yesterday about idly-dosa-vara classification, image classification, and we had to go and collect 300 images on ourselves, for ourselves, label them, write them just for doing a one-day workshop, right? And in real-life products, you have even more challenges to do that, right? So labeling, which is a real problem, can I run this? Can I run this for same for text? And run semi-supervised models is a way of getting information from the user to do this, right? I wanna mention these two things, because the far more exciting work in model inference is done in the art tomehin, which may not, but if you really are interested in looking at vari-chill and vari-chill auto-encoders and gans for music, for art, and what people are creating, check out magenta.js, check out ML5.js, where people who are much more in the creative coding sites are actually doing this, right? There are two issues on this. One is obviously data versus model privacy. When I'm emphasizing data privacy, I'm basically de-emphasizing model privacy. In the sense, people may not be willing to send their model to the browser. So that's a big issue, if that is really a problem for you. But it helps in data privacy, but obviously model privacy is kind of not damaged, is not maintained, and there are no easy solutions for that. You also have to think about how to make the model size smaller. So we won't run an image. We will run smaller models. We may have to quantize them. So example, like from, instead of using a full vector, I want to use word to bit to quantize it, or other ways to quantize it, which also come with the libraries to do that, right? It also allows me to build applications rapidly if I want. So if you are into this space and you're building web-based applications and you can deploy them as electron apps, you can deploy them as mobile apps and all that. So if you are in that space where that model is part of your output that goes into it, then it's easier for you to use it, or easier for you to build, use some of these libraries to actually build products, right? Especially important for low latency requirements, where you want to capture the data immediately, get the inference, and because it can run even if the connection is not there to the browser, you can save the inference and then send it back later, right? So low latency applications, requiring low latency or where there's no connection, you can still make it work because you can run it in the browser. I'm gonna cover one last thing, which is how do we build this, right? The best, how do we build this? We want a tool to rapidly prototype, right? I want to learn this, I really want to learn this really fast. I want to experiment and try new things, and that's easy to do. You can obviously build a UI on top of it, so this was the original deep-learn.js UI for running. You could just add convolution networks, put some hyperparameters, add your own data, and you could train it. It was just like a demo, but you could build specific UIs for it if you want. But for a lot of us, at least for me, code is very expressive in a way that probably building UIs may not be. Building UIs will work for your domain, so if you have a domain-specific problem, you can go and do that, but building, can I get an immediate environment to write code? And I really want you to check out Observable, which is a reactive notebook, reactive notebook on the browser, free to use. You can just go and start typing. You can actually explore a lot of the stuff already. You can look at how it runs, visualization, explorables, maps. If you search for tensorflow.js, you'll find enough examples there on tensorflow.js. It is a reactive notebook which runs, which just gives you full access to the entire JavaScript environment for one. You can load in TensorFlow or any other library you want, and you can start to run it in it. And it is very similar to Jupyter if people use, but two things. Instead of linear execution, it is a reactive execution. So it maintains the state of your data and updates every cell whenever the change happens, which is really good for asynchronous programming. Anybody who's done, it maintains the state, it's easy to do it. And it basically allows you to share those notebooks and build stuff and really prototype. People have really implemented new papers on Observable to understand if something new is coming in deep learning. How do I understand what's happening and can people explain it? I'll just quickly wrap up. Training on GPUs can be done if you integrate with Node, so you can get both the benefits if you really want to. Obviously this is a very young ecosystem, so some of the custom stuff that maybe you may be building or feature parity with what is the APIs in Python and see may not be there. But the ecosystem is really improving. You should check out Vega and Vega Lite, which is the easy way to do visualization, Observable, if you really want to do the reactive runtime environment. And if you're really into spending a lot of data, then you should check out Arrow, which is a columnar in memory format to send data very in a compact way. So I'm really excited about it because it fits my way of helping people learn, helping people create, helping people build things. And as the ecosystem goes better, we get more stable libraries and I can show you better use cases to convince you to do this or start experimenting with it. Then it would be even better as a talk next year when we talk much more about the possibilities and people have built stuff on it. Thank you so much. I'm pretty much at amitcaps.com and I'm pretty much Amitcaps everywhere. So if you want to ask me any questions right now, be great. If you want to do afterwards, you can reach to me at amitcaps.com or at amitcaps.com. We have time for like one or two. Yeah, sure. Questions? Yeah. There you are. Yeah, go ahead. I think, yeah, it's on. In creative coding part, you said how you could do art with images and you also mentioned in passing music. So I was really curious how that works. Right, so I think you should check out really, I didn't want to cover this because I'm already talking about a topic which is a little more out of maybe people's interest, but check out and I want to go to, check out magenta.js. They have a library called music.js which is basically in trying to integrate deep learning way techniques and deep learning models like variational autoencoders and GANs to generate music, right? So it's really focused on art and music. So that's what I would really recommend or you should really look at P5, sorry, not P5, ML5.js which is focused on the creative coding community and you will see a lot more of interesting stuff. I could put 50 examples from that space but given the audience, I didn't really want to go too much into creative. What are your thoughts on Distil Pub and Colab by Google? Yeah, so a few of the examples I showed from Distil, I really buy the idea of research debt that there is a huge, in a deep learning field especially which moves really fast, new things come and unless we have an interactive environment to understand research papers or what people have done, it's really hard, like people will put up code even, reproducible code on GitHub, but just to get that working for my own machine is really not that easy, right? So if something new comes and people have done some tweak, how do I really understand that and how do I see that as effective? I have a simple example which kind of runs very easily and it's executable where I am. I don't need to install anything, that's great. Colab is great, like Colab is really great. It's a little slow if you run GPUs for sure but it's free. I think for me the question is if that becomes like available for always Gmail equivalent then maybe that'd be interesting. I don't know how long it'll be free, right? It kind of takes care of your infra for this and it is really collaborative and all. I really like it, but I don't really use it because I just find it at the stage very slow to just do normal stuff, not even GPU, just normal stuff is fine. Hey Amit, great content. I wonder what is the audience best benefited from these UIs? Because let's say if there's a data scientist, he might not need this UI to just tweak the hyperparameters and to see the output, right? So what is the audience which is best benefited from this UI? No, so I think that's my bias in teaching is your job as a data scientist or data engineer is not always tweaking just hyperparameters. Yes, that's part of it, but building the case of trying to explain to people how, why what's happening because those questions will ultimately come back to you, providing people a way to understand the models that you've built. They are, I think we are, they're not just UIs, yes. I mean, the UI capability can be enhanced because we're talking in JavaScript, so we have far more, we don't need to write a wrapper that we would write in Python on R to access it, right? But I think of it as more as communication which is an essential job for any data engineer or data deep learning engineer. How do I communicate my results and how do I communicate in a way that people understand that and how you build that as a way, I think really allowing people to play with it in a simpler scenario. We haven't even explored, you know, the search space of this possibility at this point. Thank you so much. All right, thank you, Amit. The next talk in this Audi will be at 320. Another reminder, the BOF on AI and product will be at 320. It's been pushed from 140. You guys should all go to the main hall to attend the conference's keynote, which is starting, and I think it has already started. Thanks. You're welcome to stay back though, like if you want to work or something. Good afternoon, guys. Welcome back to Audi 2. We have Gunjan Sharma. He'll be talking about neural network field aware factorization machines for online behavior prediction. Wow. I don't even know what that means. Over to you. The name of this entire presentation. Thank you, guys, all for coming and thank you, organizers, for having me here. My name is Gunjan. I'll be telling you a little bit about neural nets and how you can... Hello. Does that work better? Is it breaking in between, is it? Is it better now? Yeah, is this better now? Awesome. Sorry, guys, technical difficulties. So I'll start. Again, thank you, everyone, for coming and thank you, organizers, for having me here. Today I'll be telling you a little bit about neural nets and how you can combine them with factorization machines and actually have a breakthrough prediction system that can actually do a better prediction than any existing models that are out there for digital behavior predictions, right? And I'll also explain you why it works, like why combining neural nets with factorization machines even makes sense, okay? So let's start. Okay, a little bit about the authors. My name is Gunjan. I've been working for InboBee for over three years now and my good friend, Varun, over here, has been working for about five years as a senior research scientist. Both of us, actually, it's done by both of us. So moving on. So this is the lay of the entire deck. I'll start by setting up the problem context, give you the motivation, why even we wanted to try neural nets, why even try factorization machines. Then I'll actually go to the fun part, which is building the entire model piece by piece and take you through the journey as to how we actually discovered the final model. Then I'll actually let the results talk for themselves, show you the results and check for, see for yourselves whether they are better than or how better they are. And then I'll explain you why this particular model, the final model that we are trying to present works, okay? And last but not the least, we want a practical model. We don't just want a black box that's just sitting there. We want a model that can be taken to production. So train every day and be able to serve billions of requests, right? So we'll also talk about how to actually do that, okay? So let's start. As many of you might know, InMobi is one of the largest mobile advertiser platform across the globe. We have major dominance in China. We have a good awesome presence in North America, South America, SEA and EMEA markets. We have more than two billion active users, monthly active users across the globe. And we are the pioneers, one of the pioneers of mobile digital advertising, okay? So what are the problems, right? For a typical ad system, you have to do a lot of behavior predictions. You know, basically your requests are coming in real time and you need to be able to give out a prediction or give out a bid for each and every request, right? So every bidding strategy requires a prediction of some sort, some sort of user behavior prediction. So there are three major components here. One is your typical click-through rate whether the user will even click on the ad or not. And given that he has clicked, so what is the probability that he'll actually install the app? Let's say if it's an app install advertisement. So that is your CVR. So the probability of install essentially, right? And similar to this, there is another prediction behavior that we try to predict which is called VCR which is video completion rate. So let's say the user started to see the ad. What is the probability that he'll complete the video, right? So all these predictions are very important when it comes to an advertising platform. So imagine you are running a multi-million dollar business and every request you are trying to predict the bids as close as possible as accurate as possible. So any improvement in this can actually affect your profit margin significantly and you can actually gain the interest out of it, right? So just to give you an idea, basically we are here to do this prediction for a bunch of requests, billions of requests coming into our system and we want to be able to do it as accurate as possible. So putting up the context, like how it's currently, like traditionally being done it in movies. So traditionally at InMobi, we use a lot of linear models with a bunch of feature engineering and a bunch of tree-based models. Your tree-based models could be like gradient booster trees, random forest or decision tree for that matter. And both of this model have their own weaknesses and strengths. Like if you think about it, alarm models are a bit able to generalize for unseen combinations. Whereas a tree-based models like a decision tree will be very rigid about the combinations that you can predict for. Similarly, alarm can at a time underpredict whereas tree-based models at times have a burden of, they can actually overfit. So for example, GBT or BET or decision tree, right? Another problem, another thing is alarm traditionally or like typically will require lesser RAM. So again, we are talking about a practical system here, not just trying to put an objective that is just looks good on paper. But so we know that okay, alarm requires lesser RAM when you load it into memory compared to a tree-based model where you have let's say a huge random forest where you have a decision tree which has features like with the cardinality of hundreds of thousands, right? So essentially your tree-based model starts to explode as the number of features and the cardinality of those features increases. So we wanted to find a model that sits somewhere in the middle and enjoys the beauty of both of this world. So basically should be able to generalize for unseen combinations but doesn't, you know, explodes a lot with adding our features. At the same time, you know, is able to actually beat and prediction both of these models, right? So, okay, I think we move far. Okay, anyway, since we are set up the problem, we know like what we are trying to do, what are the current problems that we are facing? Let's try to understand why even think of neural nets, okay? So, LR, you know, again, as I said, we try a bunch of feature engineering, we do cross-features, we do GBT-based features and whatnot. The thing is at some point of time, it starts to become, you know, very cumbersome to keep adding this cross-features both at the practical, you know, training sense as well as in the serving sense. And it's prone for error, right? As you keep adding this cross-features, it starts to become problematic. So we wanted a model that, and we know that the curves that we are, the predictions that we are trying to do are not complex curves, they are not linear lines. So it's, so LR essentially left much to desire compared to the tree-based models, okay? Yeah, so, and after this, you know, since we are, you know, trying different models, we actually tried two models called factorization machines and failure factorization machines, which I'll be covering later. But those two models were also not able to beat this tree-based models that we had, right? So, yeah, that's where we started to think, you know, what we need some sort of a powerful machine that can actually, you know, predict higher order interactions. At the same time, be able to find this cross-features on its own so that we don't have to feed it into it. And, you know, essentially, that's when we actually started about neural nets, you know, let's see what's out there, see if we can actually use them, okay? So, but first thing first, you know, before you start neural net again in a practical scenario where we are actually training it every day and, you know, trying to serve all these requests at a high throughput, we actually started thinking, what are the challenges involved, right? So, typically, you will see that neural nets are usually used for classification problems and here we wanted to model a regression problem. Another problem is that most of our features are, you know, highly categorical, so it could be the cardinality could reach up to like hundreds of thousands at some attempts. So, imagine like if you use a one-hot encoded neural net where, you know, basically every feature value becomes an input and you just switch on one bit saying that, okay, this value is in, this value is in and then everything else is zero. That's like a one-hot encoded neural net and it actually spews out very bad results. So, from a practical scenario, again, this kind of model is very slow to train and it's, you know, it takes a lot of time to even predict. Plus, again, as I said, we want a model that is, you know, productionizable, we want to be able to, you know, take from training and put it into production. So, it's very essential that this model is deeper capable. Now, imagine like you put something out there, right? And you are running this tight business and something goes wrong and you don't know what exactly is happening in that black box. So, it's very difficult to explain your product managers or your business folks as into, you know, what went wrong, right? So, you should be able to argue what happened or at least, you know, understand what is happening, why it's doing better and why it's not doing better. So, it's very important that the model is deeper capable. So, basically, this was some of the hurdles that we were facing before using neural nets and, you know, generally, we wanted to prove that, okay, using neural nets can actually give you a huge breakthrough before we actually even take it out into production and, you know, give it a spin. Anyway, boring parts over, I guess, the most fun part, I guess, building the model piece by piece, I guess that's the most of you would like, I guess. So, imagine like you have this dummy data set. So, where you have this three publishers, you have this four advertisers and gender values and then the corresponding historical CVR values, right? And you want to write a model that can do the prediction at runtime. The first model that we tried is called factorization machine, FM for short. What factorization machine says is, you know what, let's take every feature value and represent it into some sort of latent space, okay, latent feature space. So, imagine a K dimension latent feature space. You have all these categorical features and what do you say is, you know, ESPN will have a K dimension representation in that vector. Similarly, CNBC, Sony and all these values will have K dimension representation, some numerical values, okay? Now, consider this training room where you have ESPN Nike mail and this is the thing that you want to predict. You will take ESPN's K dimension vector, let's call it publisher latent vector. You'll take advertisers K dimension vector for Nike, which is, let's call it advertiser vector. And similarly for gender, for mail, you will get a gender vector. So, what factorization machine says is, you know what, take these vectors, take the dot product of, take the vector dot product of this vector, publisher vector with advertiser vector, take the dot product of advertiser vector with gender vector, take the gender vectors dot product with publisher vector, sum them linearly and that is your predicted output. So, each of these values is a scalar and some of this will be again a scalar and that's your predicted output. Now, you can write your own loss function. We used a weighted RMSE, but anyway, you can use any loss function and essentially run back propagation and you can learn the values of all these latent features. So, initially all these are unknowns. Basically, you put a random value on this, you try to optimize your loss function and essentially you can learn the values of all these feature vectors or latent feature vectors. Let's try to see in words like what we're trying to do. So, basically we are trying to represent every feature value in the same K dimension space. The second thing we are doing is we are taking vector dot product. So, when you are taking vector dot product essentially, it means that how close these vectors are in the latent virtual space as well as what is their strength. A simple example that I like to think about whenever I'm thinking about factorization machines is, you know, that of a movie. Like you want to predict how much a movie will make in revenue, right? So, here let's say these three features, movie, city, gender and this is what you can imagine your latent features will look like. So, this will be like K equal to four. So, horror, comedy, action, romance. So, essentially what you're trying to do, fine for every movie, how much horror, comedy, action, romance, percentage it has, content it has. Similarly, for every city and every gender value, you're gonna try and figure out how much they like horror, how much they like comedy, how much they like action, how much they like romance. I mean, like some of you might be thinking about collaborative filtering as of now, but yeah, it's kind of similar, but collaborative filtering is very limited in the two dimensions space, whereas here you can have multiple features, right? Anyway, once you have this latent features, what second order means essentially when you multiply those two vectors, what is happening essentially is for every latent features, so assume horror, for every pair of combination of features, let's say movie and city, you're trying to figure out how much this do come together and affect the revenue in the horror space. Similarly, how city and gender comes together in the horror space and affect the revenue. Similarly, for movie and gender, and then you do it for each and every latent feature and you are claiming that, okay, the final revenue impact will be linear sum of all these features combined, okay? So that's the whole idea of second order intrusion, how they come together and affect the revenue, okay? Another way to think about it is, you know, the way it is different from LR is, now LR assumes that, okay, every feature is independent of each other, versus what FM assumes is, you know, the way a particular feature gets changed in a latent feature should affect the way other features are represented in the same space. So whereas in LR, you just independently have these weights for each and every feature, here you say that the weight of this particular feature will affect the weights of other features as well, okay? So having said that, it's not a linear model, basically you are taking the cross product of two features, so essentially it's a, and you're using second order interactions, so it's kind of bunch of hyperbolas coming together to make your final curve. It definitely works better than LR and, but it was not able to meet our tree base model still. So the next model that we came across was called FFM, which is a slight upgrade of FM, which is field of effectorization machines. So assume now, so this is a white paper by Krithu, by the way. So I'll talk about FFM. So what FFM does is, you know what it says, instead of representing every value with one single K dimension vector, let's find a K dimension representation for every other cross feature. So for example, for ESPN and CNBC and Sony, we'll have one K dimension representation for all the advertisers, one K dimension representation for all the genders. Similarly, for every value in advertiser space, you'll have a K dimension representation for a publisher and a K dimension representation for gender, okay? And similarly for gender. So now, and now instead of multiplying the vectors as it is, what you do is, you take advert publishers advertiser vector, multiply by advertiser publisher vector, gender advertiser gender vector, take multiplied by gender advertiser vector. Similarly, take the dot product of gender publisher vector with publishers gender vector. And that is your final predicted output. Again, you can write your own cost function. We used RMSE, a weighted RMSE, and you can actually do the entire back propagation and learn the values of each and every interaction. Again, so let's try to look under the hood what it's trying to do differently or what's the intuition behind this particular model. We are representing every value with a K dimension representation for other cross features. So if there are n features, every value feature value will be represented by n minus one K dimension vectors, right? We are still trying to capture the second order interactions. It's just that now we have given it more degrees of freedom to express the interactions. So what it does essentially is it's better able to beat, it's able to beat FM even better, right? Because you have now, so the intuition behind it's pretty simple that the way, let's say the movie feature will interact with city will be very different from how this movie feature will interact with an age feature, okay? Sorry, gender feature. So similarly, like a gender feature will different very differently with a movie feature versus a city feature. So we are just giving it more degrees of freedom to express itself. At the same time, doing the cardinality reductions that we are doing in FM, right? So that densification into the latent feature space. Yeah, this model works definitely was better than FM but still was not able to beat the tree based model in certain cases. So at this point, you know, we actually collected our thoughts together and started thinking about, you know, we need a model that can do better than this, so basically it was pretty clear that adding higher order features was helping us but we still haven't find the final solution, right? We are not able to beat the tree based models. So that's when we started investing into couple of papers that are out there which uses neural net for behavior kind of predictions. The first paper that we came across was called deep neural net with factorization machines, deep FM for short. Don't get scared by the graph. It's pretty simple. So it's just uses neural nets and factorization machines that we just learned and tries to figure out the outputs. The way it does is, so this dense embed is nothing but your latent features. So for every feature, you'll have the K vector and then you run it through the FM machine and you get some sort of output, let's say. What it says is, you know what? Use these latent features that you have found, append them together, so create this huge vector and then assuming that this is the input layer to your neural net, just run it through neural net and try to find higher order interactions from here and get some output. So your final prediction will be the sum of factorization machines as well as the neural nets come together run through a sigmoid activation and basically that's your final prediction output. So we're just putting it in formula. So you have some output from FM, you have some output from neural net which you got by appending all the latent features together and running it through neural net and that's your final predicted output. Again, let's try to understand a little bit. So basically there uses both neural net and factorization machines. So remember, I told you that one hot encoding is very bad for neural nets. So now what it's trying to do is using these dense features that we got through FM and then running them through neural nets so that you have a denser input into a denser input layer for neural nets, right? And not this 010101 kind of earlier. Plus it's using factorization machines to find the second order interactions and telling neural net to use this latent features and find the higher order and nonlinear interactions. So the neural net we know is able to find higher order interactions, we know factorization machines find second order interactions, the sum of together is your final output. This is the whole intuition of the paper. Like if you go through it, it's basically that's what they claim. So FM is able to find neural net, second order interactions and hidden layers can find the higher order interactions. One important bit to note here is there is another model called FNN, which this particular paper talks about, which is, you know, you train the factorization machines separately, you get some sort of latent feature representations and use those latent feature representations to run through another neural net and get the final output. So basically you're running two different optimizations, one for factorization machines to run the latent vectors and then using those latent vectors, you're trying to find the higher order interactions or your final predicted output. The idea is that this entire model is trained together, so basically you define your cost function, you run the back propagation, it goes through all the hidden layers, it goes through all the FM and all the latent features will be basically learned together instead of doing two different optimizations. The reason it's better than FNN is because both neural nets and factorization machines get the say into how the features are represented in the latent feature space. Whereas if you train them separately, only FM has a say is how the latent features are represented and then you use them statically, like you basically freeze them and then run them through neural net, so neural net won't be able to affect the values of the latent features. So it's very important that you train this entire model together, which is the deep FM paper talks about. Cool, yeah. So then, so anyway, one of the things that I forgot to mention is this model was able to beat FM, but it was not able to beat FM, let alone, you know, tree-based models. So again, we started, you know, it was a little low point, but we started looking at the papers out there. The second paper that we came across was called a neural factorization machines or NFM for short. It's pretty curious model. It's similar to the way deep FM is because it's using both FM and neural nets, but the architecture is a little different. So what this model says, you know what? You have the K vector representations from every feature. Instead of just taking them directly into neural net, do first take, change the formula for FM, just instead of using the dot products and linear sums, just use vector dot products and vector summations. So basically this will, and this is vector dot product publisher with advertiser. You'll get a K dimension vector. Similarly, you'll get a K dimension vector and a K dimension vector. You take element by element by some of this K dimension vectors and essentially end up with final K dimension vector. You take this transpose of this vector, which is the bi-directional pooling and then assuming this is your input layer, run it through the neural net. It was pretty curious in the sense that just when we run this model, we got better results than deep FM. Then we started to understand what this model is trying to do. So let's try to put this in words. So architecture is again FM plus neural nets. The intuition here is we know that FMs are able to find the second order interactions in that horror, romantic comedy, romance, feature space. Instead of using the latent features, why not use this, instead of using the latent features, why not use the second order interactions as the feature space and try to find the higher order interactions. So you already have the second order interactions. Use them to find second order interactions. Some of you might wonder, you know what it seems like deep FM is some sort of subset of NFM, or sorry, superset of NFM and deep FM should be able to completely do the, both NFM part as well. But the thing is practically, this model works a lot better than deep FM for two major reasons. One of them, since the size of the neural net is very small, right? So you just have K inputs in the input layer instead of K into number of features. And the second and the most important reason is, it's easier for neural nets to take the second order interactions and find higher order interactions versus, you know, just taking the latent vectors and try to find those higher order interactions. So all in all, practically, this model works a lot better than NFM or deep FM, but it was not able to beat FFM still, right? So at this point, like the things were a little bit clear to us. Like, you know, we knew that FFMs are doing better than FM. They are able to find better second order interactions versus we also knew that NFM was doing better than deep FM, right? Because it's easier for neural nets to take the higher order interactions, sorry, second order interactions and find higher order interactions, right? So the next step was pretty obvious to us. Basically just take these two papers and, you know, replace all the FM machines with FFM machines and see what happens. So that was like the curious things that we tried to do, like, you know, just trying to experiment with different kinds of models. The first thing we tried was called deep FFM. So in this particular model, again, just replacing the FM machine with FFM. So these are all your cross-feature interactions and then you run it through FFM machine to find second order interactions and just append all these vectors together to get the final huge vector input vector for the hidden layers and then you can get a certain output. In the interest of time, I'll not spend a lot of time on this particular model because it was very slow for practically training. It was taking a lot of time. And again, as I said, we wanted, we don't want a black box or, you know, just a paper, just a model that looks good on paper. We wanted a model that can actually be run into production, right? So the training was very slower. So it was pretty obvious, but we still saw the results. The results were better than FFM. So now we actually try out beat FFM. But again, one of the major problems that this model was suffering from is that most of the time, the predictions were actually controlled by this rather than this. So this was an interesting thing to find out because when the gradients flow from here, it's easier to, you know, affect, for gradients to affect this latent vectors coming from here rather than from here, right? So if you remember how the gradients flow, so it takes a lot of time for gradients to come from here and essentially they start to die down. So they cannot affect, they are not able to affect this latent vector as much as this FFM machine is. So essentially most of the heavy lifting of the prediction is done by FFM machine compared to hidden layers. And anyway, the results were not good in the sense that it was very slow and we wanted a better model than this, right? So the next, again, upgrade was pretty intuitive in the sense that we took the NFM paper and just removed the FFM and inserted FFM into it. So basically, so essentially you have these three features, you have this cross features, you have FFM machine which we know is able to find better second order interactions. And we just take the second order interactions, run them through the neural net and try to find the higher order interactions and get a predicted output. This particular model was the one that actually just sparkled our eyes. Basically we instantly saw a huge improvement in the prediction model. And this is when, you know, we started investigating more as to what it's trying to do, how it does so much better than anything else and started to check the results across other models that we had. This again, intuition is pretty simple. We have established that FFMs are better able to find second order interactions. We know it's easier for neural nets to take the second order interactions and find higher order interactions. So this is the whole intuition of the entire model and it does better than every other model. And of course it converges faster also because your size of the net hasn't changed compared to NFM. Versus whereas in deep FFM, you had a huge input vector, you still have a K vector input here, okay? Okay, before I actually explain to you why this model works or what are the major pieces of this model, I want for all you to see the results and see it for yourself. So this was one of the small data sets that I was testing this model against. You can see FM was doing pretty good than the linear model. But then NFM came and actually just blew all the results away compared to every other model that we had tried. And we had used RMSE, weighted RMSE for the loss function. And this is our accuracy function. After that, I actually, you know, when I saw this results, the next obvious thing was to actually try it for the entire production set. And I got a huge improvement both for linear models as well as for the tree-based models. Just don't compare these numbers, it's the data set is a little different. But all in all, what I want to establish is that we are doing better than both linear models as well as the tree-based models. At this point, I actually met with Varun and I told him that, okay, I have this model, this is doing pretty good. And at the time he was actually working on the VCR problem and he had tried this bunch of models and I asked him to try the NFM model on the top of VCR. So I'll tell you like, what are these three models? So this is your vanilla logistic regression. Essentially you have an X features, you run it through logistic regression, you get some certain output and you get a predicted output. The second is second order order recursive features. So basically, let's say you have N features, you take every two pair combination of this, so NC to combinations, assuming this every combination is a feature, you run IGR or information gain ratio on top of it to get the top X features and use them in logistic regression to get the final predicted output. The third model that he had tried was GBT-based LR, so essentially you train a gradient booster tree. Every path in that gradient booster tree becomes a feature and put feature to LR and then using that, you can do a final predictions. So as you can see, he did a lot of feature engineering and he was able to get some sort of improvements but then he just tried vanilla NFM, essentially just took the same inputs as logistic regression, ran it through NFM and the results were staggering. I mean like for a second, he had to actually just look at this results to understand whether they even make sense or not. But anyway, as you can see, without doing any sort of feature engineering, he was actually able to simply beat all these models together. There are some of the details for his implementation, how he implemented all those things. One of the cutest thing he found out is basically as he increased the number of layers in the neural net to three, he's got another 20% improvement on top of the results that you saw before. And after that, there was improvement but it wasn't significant enough. So in the sense that depending on your production requirements, you can keep adding the number of neural nets, hidden layers and get better prediction. So at this point, of course we were clear that okay, we have something in our hands. Let's try to break it apart because we again want to understand what it's trying to do. Let's try to figure out the things that it's trying to do. So here are the three major pieces of this model that actually makes it work. First thing first is a factorization machine which takes this high cardinality feature vectors and converts them into dense, more dense, less cardinality data. So essentially you took like hundreds of thousands of features and converted them into K equal to let's say 16. So you just represent them into a 16 numerical values. So this we are calling latent features. So basically now we are actually just changing the representation of trying to represent all these features into a numerical space. I think a lot of people talked about embedding models and stuff like that. So basically this is kind of similar to that but you are learning this as part of your entire graph. The second important piece of this particular model is the field aware part. As we said that factorization machine are awesome but it's even better when you actually give it more degrees of freedom. Basically when you do field aware you are actually letting every feature represent its own interaction with every other cross feature. Hence you are better able to find the second order interactions by giving them more degrees of freedom, right? Third definitely last but not the least part of this particular model is the neural net part which takes the second order interactions and is able to find higher order interactions and give you a better prediction compared to just linearly summing them as we were doing in FFM, right? So that's the whole addition of the neural nets. Basically take the second order interactions instead of summing them linearly. It tries to sum them non-linearly through higher order interactions. Another important bit again is basically we are training the entire model together. So again as I said in deep FM it's better to train the entire model together let both neural nets and factorization machine affect how the latent features span out. And it actually also saves you time because you're not training into different optimizations but you're only learning one optimization and learning the entire graph together, right? All the neural net weights as well as the latent features. Okay, at this point you might ask Gunjan it's good that this particular model works and everything like how we are even implementing it and you said that you want to implement it for a huge scale, like so tensor flows are generally not known to, okay, I think one thing I forgot to tell you is basically this entire model was made in tensor flow. So at this point, so I want to just give you a little bit of details about the entire model. So basically K lambda, which is your regularization factor, regularization parameter, number of layer, number of nodes and the activations functions that you use becomes your hyper parameter that you can do train with cross validations. Then we implemented the entire model in tensor flow. So again, one thing is traditionally at in Mobi we had been using Spark for majority of our prediction models. This was the first time that we actually tried to tensor flow model for like a production like system. Then we use random optimizer. So again, and we didn't use dropouts or batch normalization. So NFM paper actually does talk about using dropouts here both for the latent feature of space as well as for the neural nets. We haven't tried that L2 regularization work just fine for us. But this is some of the future things that we want to explore both batch normalization and dropouts. We used a single layer and we're able to get a huge improvement in the prediction of compared to FFM. And as I said, if you increase the number of layers it just keeps improving but at the cost of training time. We use regular activations because they converge faster than any other activation functions out there So again, we want a model which is more practical. So it's better to use if you can get performance improvements using RedLU it's better, but even prediction improvements where they are to use RedLU. We tried various powers of two essentially two, four, six, eight, 16, 32 kind of that. At 16 we kind of were happy with the results in the sense that after that the training time again starts to increase at the cost of very insignificant improvement. We stopped at K equal to 16 but the paper and FFM does talk about increasing the value of K actually giving you exponentially more better results. So there's one thing that you can always play with and actually get better improvements depending on your use cases. We use rated RMSE as our last function as I said for both the use cases. So rated RMSE would be like rated value minus actual value whole square into number of times you saw that particular combination. So that's your rated RMSE. One of the curious thing about this model is let's say you want to predict for unseen values. So instead of, let's say you trained your entire model with three publisher values ESB and CNVC Sony and at the runtime you got a new publisher, right? So sometimes at certain use cases what you do is essentially do a default prediction. So you know, okay, we'll just predict this value but what this model has a specialty here is it says, you know what? If you want to do an average prediction take the advertiser space vectors for all the public across all the publishers and do the average of them. And similarly take all the gender vector representations across the public, all the publisher space and take the average value of it. Use this as your average feature interactions or later in feature representation and run them through your FF machine and then through neural net and whatever you get it, get as output it should be a final predicted output. So practically it actually turns out this works much better than using a simple default value. So this is one of the benefits. If you have a use case that can, you know requires a default prediction, you can use this. Anyway, so again, we want to implement this entire model at a very low latency because we have to reply back with prediction within few milliseconds as well as we want to do it for high scales. We are talking about billions of requests right every day. So there are three major pieces that come together to make it happen. So first is MLEAP. So as I said, traditionally, we have been using a lot of SPART models. We wanted a framework that can actually support both TensorFlow models and SPART models. So MLEAP is an open source framework which actually supports this. So as far as your prediction systems are concerned, they'll just talk to MLEAP and under the layer, MLEAP will take care of what kind of models you have, right? So anyway, so basically this held up actually to be able to support both kinds of models in our production system. The second thing was to actually be able to train this model. We wanted to train it on a GPU because training it on a CPU was very slow and training it on a yarn cluster was again not possible for us at that time. So what we did is basically we made, we got a GPU machine, made it the gateway to the HDFS where our data resides. We pull the data through HDFS on this gateway and then just train the model on this GPU and return back the train model to MLEAP for serving. Third very important part here is if you have used TensorFlow serving for a real-time system or something or MLEAP serving, you might know that, okay, the throughput of this particular TF serving is very, very low. Essentially, if you bombard TensorFlow serving with a lot of requests very, very fast, it just starts to crap up. Essentially, it will take a lot of time to respond back. So we did a simple append on top of MLEAP. We just added a LRU layer or a caching layer on top of it. So every time a request comes to MLEAP, it first checks whether it is in the cache. If it is, it just returns it in a one millisecond or something like that. If it's not, it can go back, look through the cache, go back to the TensorFlow serving system behind and get the prediction. So the idea here is basically most of the time you'll hit the cache, if you don't, very few requests will actually reach TensorFlow serving and given that the number of requests hitting TensorFlow serving is now less, they can reply back in much faster time, right? Anyway, these are some of the future works that are currently pursuing. As of now, I can tell you that we have kind of crack distributed training on a YAN cluster. Take a TensorFlow model and train it on a YAN cluster, a Spark cluster, basically. We are still evaluating in the sense if the results are similar or not. We want to try dropouts, fashion normalization, some sort of hybrid-pinning NFFM model, which is a combination of some sort of feature engineering with pinning run through NFFM. And we, as I said, like debugability is very important to us, so we want to be able to understand what this feature vectors represent. So there is a well-known algorithm called Disney, which given a high-dimension vector vectors can actually project them on a 2D space so that they can visualize how this particular vectors are grouped together. So we are still pursuing this. We have kind of cracked this. We are just still evaluating it. So these are some of the references, NFM, DFFM, and NFM. And thank you guys. That's all I had if you have any questions. Feel free, guys. Can you just go back once on the field-aware factorization? I mean, just on where you showed how you construct the vectors. Sure. Like maybe this one, yeah. So like, so I actually, yeah. So why do you have, like, for example, for ESPN, two different vectors? Like, what does that exactly mean? So the intuition here is that, so in factorization machine, you're assuming that every feature vector. So we established that, you know, the ways another vector changes should affect the way. So basically the way this vector changes should affect the way this vector changes in the factorization machines. But the idea here is the intuition here is, you know, what it doesn't matter. It could be very different how a particular feature interacts with another feature to give you the final sort of prediction. So essentially we give it, we tell it, you know what, the way this particular, the ESPN is represented in K dimension space will be very different when you are trying to find the interaction with advertiser versus the way ESPN will be represented in the K dimension space will be very different when you are trying to find the interactions with gender vector. That's the whole idea. Okay, okay. Why are you only including like these? Hello? Hey. Yeah, so I wanted to ask. That in your example, you have taken three features. And then the combination comes out to be two to the path three. I mean, something like that. Nc2, yeah. Now what I wanted to understand is that if I want to scale it to 10 features, because my problem statement has 10 to 20, I think it has more. So will this be useful? Absolutely. I mean, like this is just a toy example, but we do train it with, I can say with certainty we train it with more than 10 features. So it definitely works. So this is just a toy example. I didn't want to confuse, right? But yeah, it does work. Could you elaborate a bit more the number of features you have, like publishers and... So I don't know if I'm legally allowed to say that, but we use... It's the total number. Yeah, we use a bunch of features. I mean, this is just a toy example. We use a bunch of user features. We use a bunch of, you know, demand site features. If you know, if you know what I mean, like publisher advertisers and, you know, device IDs or device sorts or something like that, right? And did you use any vectors or representation for seasonal variation in preference? Right now. So the idea here is when you are using the latent feature representations, they are not, they cannot express the seasonal variations. Maybe you want to use some sort of RNNs or something like that to capture them. But essentially, as far as this model is concerned, this is more reactive rather than proactive, right? So, yeah, it doesn't capture any seasonal variations. Sorry. Hello, Gunjan. Hey, yeah. Okay, sure, sorry. I'm still wondering, so why are the sizes of the embeddings? Let us call it embedding. Right. So why are they all the same? Is it the way or the... I can see that they are being multiplied as their vector has to be multiplied accordingly. Yeah. Is there any other reason? It's simpler to understand in that sense, like... See, first of all, you cannot do vector.product if they are not same unless you maybe give them 00, which essentially doesn't help. So imagine like if this is of size K and this is of size L, right? So whatever extra values this guy has is essentially nothing but a linear, the output from the linear regression kind of thing, right? It doesn't really helps a lot. Like it's not an addition on top of a lot. You're getting my point, right? So basically, if this is... Oh, sorry. If this is smaller in size, this is larger in size, so whatever extra values does it add, it's just linearly getting added. So in theory, or mathematically, you can prove that you can just simply sum them together into one of the features and get the same amount of value out of it. Getting my point, it's like a edit constant, right? So basically all you are saying is C plus this vector is equal to something. Getting my point, right? It makes sense, right? So if they are of different size, essentially they will be loose and they can linearly combine together as a constant. All right. Thanks, Gunjan. Some of you had the questions remaining. Gunjan will be available offline, please catch up. Talks here will be resuming at 4.20. So you can either catch the talk going on in the main audience. Hello. Okay. Should I start? Okay. So title of my talk is What You Cannot Do With Machine Learning. Some strange things. Next up we have Harsh Gupta. He'll be talking about What You Cannot Do With Machine Learning. Hello. Hello. I'm Harsh Gupta. I work at Nillenso. Nillenso is India's first software cooperative. So I'm from an engineering background and many of my friends are also engineers. And of many of these friends, they believe that late, they believe in this God of data science. They believe that given enough data, any problem is solvable. Of many of those people, they would happily replace lawyers, judges, doctors, pretty much everyone with machine learning. You might be aware of, there are already companies who are using machine learning to determine who should the hire or who should, or even who should be put in jail, right? In U.S. they are courts which are using machine learning to determine criminal risk assessment and all. So I'm quite skeptical of their claims and I think these things can backfire a lot. So let's talk, think about the process of machine learning. So the way you do it is you have some data, you put it into machine learning algorithm and you come up with some predictions or output. But the data comes from somewhere and the predictions are used for something. And not taking into account how the data was generated or how the process or how your predictions are being used can be fatal. It can have like a lot of hard consequences for society and innocent people. Let's talk about data. So here's an example from World War II. You see a picture of fighter aircraft from World War II. These planes would go to the battlefield, fight with the enemy and come back and they will come back damaged and some won't come back at all. So the task of a prime world and his team was to determine where to put more shields on the aircraft and they could not put more shield everywhere because that might make aircraft too heavy. There are also cost constraints, right? So they had the data of where there were more damage in aircraft where there's less damage in the aircraft and their team like had this opinion that places which are more damaged are more likely to be hit by the enemy. And I guess a naive machine learning algorithm will also have the same consequence. So, but Abraham Wald was smart. He realized that not all planes come back and these planes are able to come back even after being damaged. So the places where we see no damage are the places where aircraft were hit and those aircraft didn't return. So Abraham Wald decided to put more shield on the places which are less damaged. What do we learn from this? We learn that the context of the data is really important and taking the context into account, the same data you reach different conclusion even opposite conclusion. Let's take another example. This is from the US presidential election of 1936. The candidates were Franklin D. Roosevelt and Alph Landon. Literary digest was in the business of conducting exit polls that is to predict who will be the winner of the election even before the election happened. They had a sample size of 2.4 million. There was another person called George Gallup who had a sample size of just 50,000. Literary digest predicted that Roosevelt will win and George Gallup predicted, sorry, literary digest predicted Landon will win and George Gallup predicted Roosevelt will win and Gallup also predicted that literary digest prediction will be wrong. Who do you think was right? Well, Gallup, even with a 50 times smaller sample. And why was that? It was because literary digest ignored how their data was generated. The way they generated their data was that they were a very popular magazine and they sent out exit poll sample forms with their magazine. People filled out the form and sent them back. At that time, more Republicans were likely to subscribe to the magazines than Democrats. So the sample over-represented Republicans instead of Democrats and even with a way bigger sample, they produced wrong results. Another thing to note here, like is the wrong bigger sample can be worse because it might lead you to make wrong predictions with more accuracy or more confidence. So data, getting right data can be really hard. And if you talk to any statistician, they'll tell you that getting like how hard it's to get the right data. But like, let's assume somehow you have got the right data. You are very careful with it. You like, you can do all the complexities of the process and like how, what kind of things which can affect your data quality. But even after you've done that, you are using the data to model some inherent system, right? So, and that system can be really, really complex. Okay, so there's some data points. You see a pattern. Do you? Yeah, so I see two lines diverging. There's some more data points on the same underlying function. Here's some more. Here's some more. There's some even more. Do you see a pattern now? Like I see a sine curve. And that was, and the data points were from the same sine curve. But from the amount of data you had, you could not have predicted that it was a sine curve. So the amount of data you need corresponds to the complexity of the problem you're trying to deal with. What does it mean in real life? Well, there are problems for which you'll never have enough data. Why do you think cryptographers aren't worried by development and deep learning? Because you'll never have enough data. To solve an ace algorithm of a key size of 256 bit, you'll probably need data of order of 2 to the power 128. And I guess like the whole humanity doesn't have the processing power to process that amount of data. We'll never have enough data for this kind of problem. What happens when you don't have enough data corresponding to the problem you're dealing with? So the complexity of the problem increases with the number of variables you have in the process and also on the interactions they have. Now, when you don't have enough data corresponding to your problem, you'll find patterns which don't exist. So here is a graph of number of variables, for a fixed amount of data and number of false patterns you can find. And you see the number of false patterns you can find increases with the number of variables you have. So how does it translate into real life? So I like food and I guess a lot of you also do. My friend Jyoti will tell me that not eating non-veg is unhealthy. But my mom will tell otherwise. Keto appears to be the fad diet at this time but like five years later, 10 years later, there are some other diet which was the most popular diet and they would say scientifically proven that this diet is the best diet. I'm sure five years later there will be some other diet which will claim that the current diet is not the right diet. Why does it happen? Because there are so many factors which goes into how your body reacts to the diet you have. Is that it's very hard to get the right amount of data. So you just, you need to take the right things, right quantity in the right form. For example, taking vitamin pills is not same as taking vitamins through food sources. It also depends on what geography you live in, what your body type is, what your genetics are. And you claim, and all of these diets essentially takes only certain things or certain variables of the equation and claim that they have got the right thing. But when you take only certain things in isolation, in complex systems, you'll always, you might think that you have got the right thing but you probably haven't. And diet is not the only complex system, right? Social systems are complex, finance is complex, genetics is a complex system. In complex systems, you cannot take certain variables in isolation. So complex systems can fool you. So let's talk about the output. So like you are using this machine learning to predict some output and predicts. And you are like essentially making some like complex metrics, complex things into reducing the complex things into simplistic metrics. For example, health is a very complex, complex phenomena and you often reduce it to BMI. You reduce health of economy to GDP or like knowledge of someone to the grades. And when you like reduce very complex goals to simplistic metrics, these metrics can backfire. Why? Because people respond like they are not rocks. So they respond in the ways you try to judge them on. So there's this law called Goddard's law, which says when a measure becomes a target, ceases to be a good metric. And their history has lots and lots of examples for that. So in 18th century France and England, the government wanted to tax people on their wealth. But it's very hard to determine how much wealth someone has. So some smart person in the government realized that the wealth someone has is probably proportional to the size of their house. And the size of their house is probably proportional to the number of doors and windows they have. So they started counting the numbers of doors and windows someone houses have. And they started taxing people on that. Well, what happened in result? People started having these weird buildings where they would like remove the doors and windows. So the government eventually removed the stacks, the stupid tax, but they still have these buildings. And these have like a lot of repercussions in terms of the like amount of exposure to light people have the health and ventilation and all. And there are like lots of examples like that. So when we believe we'll be just by silly criteria, we adopt in silly ways. So to recap, need to take into account how your data was generated. Because everything which fall, that your data is the core of the thing. And if your data is flawed and you're not thinking through it, everything which follows might fall apart. Even if your data is right, the inherent system can be very complex. And these complex systems can fool you. You also need to be very careful about the metrics you use because they can backfire. Cool. So some questions, which I think you might ask. Well, do I mean machine learning does not work at all? Well, I don't know what means though. But I think like there are so many things which can go wrong that it probably doesn't relate unless you are super careful. And good fit on a test set does not prove anything. Maybe like rigorous field testing does. But yeah, like you cannot just take static data set and say come up with some predictions. They might not make any sense. Why doesn't good fit on test set prove anything? Well, the reason is the way you do testing is you like usually machine learning, you have static fix data set and you divide the data set into test set and training set. But the test set is supposed to be used only once. But what if your algorithm doesn't perform well on that? You change some hyper parameter, train the algorithm again and test it on test set. So if you do that enough number of times, you might end up overfitting on a test set itself. Also, like if your data is bad, a good fit on a bad data does not mean anything. What about machine learning at Google, Facebook, Microsoft, and Netflix? Well, the same thing. If it works for you, doesn't mean it will work for you. If it works for them, doesn't mean it will work for you. Also, I guess there's like this survival bias in the machine learning stories we see. So I guess there'll be a lot of companies which use machine learning and they failed and we won't hear their stories because well, media has no incentive to popularize them. Even they don't have any incentive to popularize them. Machine learning can work and they have worked in certain scenarios but you should think about your case. But doesn't it hurt to try? Well, it depends who is hurt, right? So if you are some e-commerce company and you are trying to, you have a fancy machine learning based pricing model and you blow yourself up. Well, you blow yourself up. No one else is hurt. But if you are like, say, putting people on jail based on machine learning, you should be like, I'll be very scared because you are putting innocent people's life in danger. All models are wrong, but some are harmful. Some further readings, like, I really love this guy, Nassim Nikola Salaf. I would recommend you read everything by him. There's this book called Weapons of Math Destruction by Katharine O'Neill and she talks about like, a lot of areas where people have tried to use mathematical models, including machine learning in scenarios. I have done a lot of damage to people. And there's some other things which you can read. Okay, thank you. Questions? Questions? Yeah. Raise your hands where I can see them. Nope. Okay, what's that? Where? I see that, whatever you are telling, it's like the data should be generated in the right way. And we're also telling that the complicated systems may not work sometimes with machine learning, right? No, I'm saying that like, you won't know if it is working or not. You'll think it is working, it's probably isn't. But from the context of this talk, right? Like what cannot be done through machine learning, you are telling that it's kind of risky to like be there or like machine learning may not work with many things, but is it not, will it not post challenges that okay, we have to resolve them? So, I hope my question is clear. So like, okay, so to paraphrase your question, like the question is, don't we see these things as challenges to resolve instead of things which you cannot do? So you see the cryptography example, right? So that system, you don't have enough data at all. Like, and you probably have never enough data. They are so, for very complex, so there's this algorithm called no freelance theorem, which is like as if you assume anything is possible in the underlying system you are trying to build, like you cannot have any predictor power at all. I guess that also applies to complex systems because the complex systems are so complex and the underlying functions can be so varied that you might not have enough data. So like, I think of this in this way that like, the problem is you might think that things are working, but like you have no proof that they are, right? So unless like, at least from just from the data itself, like so, and you can do field testing and go there and like actually see if things are working. That's different, but like from the data, you can always find patterns which are probably not, which are there and not true. Does that resolve your questions? Any more questions? Okay, thank you. Next up, we have the paper discussion. I see that's in demand. Why don't some of you guys who are standing at the back take the seats which are remaining? All right, we have a paper discussion up next by Ashwin, Subha and Smoot. They'll be discussing attention, mechanisms and machine reasoning. Testing, all right. So after the contrarian point of view on machine learning, we will bring our attention to attention. All of you are so tired, aren't you? All right, so the format of the session is as follows. We will, I mean, so this is a paper discussion session. So all of you, if you have a mobile phone, go to the talks funnel. You just go to the link, you can click there and you can look up the paper. And so what will happen is we'll introduce ourselves and then we'll start off with the first paper and the second paper, we'll do a quick introduction of the main idea of the paper and then we'll open the floor to everyone. And in the meantime, we'll also just do an introduction in terms of what are mechanisms out there for attention and then we'll take questions and we'll continue the discussion. So that's the big picture. All right, so my name is Smoot. I lead a group in doing machine learning and NLP and computer vision at Soliton and I run my own consulting firm as well called AutoInfo. And over to Ashwin to introduce himself. Hi, my name is Ashwin and I work at Fidelity where I lead a group on cognitive computing. And Subba? Yeah, I'm Subba Swilnendran and I work at Soliton Technologies. I'm working on image classification and NLP related tasks for two years. All right, cool. So we'll start with the hierarchical. Sure. So how many of you have here have used regarding neural networks? I mean, have you heard of attention? So I don't need to actually explain what is attention to answer. So in brief, whenever we try to say answer a question, we focus on things around us and then reevaluate our thinking based on the question and then we try to answer it. So that's the basic concept of attention. So how it works in a recurrent neural network is you have two RNNs and you take the output of one RNN and use that to actually focus the attention of the other RN. So that's how it works at the most basic level. So now think of it in terms of, say, answering questions. You are having a conversation with you and now I need to understand, say, what's the context of this conversation or how do I capture the context of this conversation? So this can be done using something called hierarchical attention networks. So this basically takes multiple RNNs and just stacks them on top of each other. So consider, say, a conversation or a chat that you are having with someone and this series of chats is actually a series of sentences and these sentences can be broken down into words. So now if you look at it from a very high level, your overall conversation itself is a context, your individual sentences are nothing but utterances and your sentences are made up of words. So what you finally end up doing is you build one RNN to do your source, basically encode your words. You train another RNN to actually encode your utterances and finally you have the context which is coming out and you have the decoder which actually decodes your responses according to the context which is available. And throughout this, you have a feedback which is nothing but your attention. So what's the advantage of this attention? So the main advantage of this attention is that at each point, say even a context, you can say that based on this context, these are the utterances or words that are interesting. And based on your utterances, you can say that these are my words that are interesting. So it's very basic at its most basic level. Attention actually opens you up to interpretability in neural networks. Now you can actually explain why my neural network output is the way it is because of these attention mechanisms. So that's where the next step comes in. So if I know why my neural networks behaves this way, now can I actually change my neural network to capture machine reasoning? So can I use these attention mechanisms to actually capture how neural networks or how I can train my machine to do the reasoning? And this is where the second paper comes in. So what we'll talk more about the second paper. Yeah, first, I'll explain about the reasoning problem that you have taken. So the problem here is we are given with an image and a question describing that image. So let's say they're given a photo with a four to five person standing and you ask the question like how many persons are standing and how many persons wearing regular T-shirt are standing. So this needs attention towards the, like attention of the question towards the image. So this is called this kind of reasoning problem. And most of the neural networks we have used so far will not perform well on this reasoning task because they consider them as a series of words. But here like this paper introduces a concept called Mac cells, which will attend questions. So we will first look at a part of a question and attend an image related to the part of the question in the image and then answer the question. And it will keep on repeating it. So it's kind of a, it's similar to LSTM or GR you kind of a recurrent cell. And in each iteration, it will attend to a particular part of question and then gives the answer. So I'll just give little more details of the example here. So for example, you know, you can ask how many people are sitting on the dice here and how many people are in front of the violet colored, I mean the person wearing the violet color shirt, right? So here it's not a very trivial question of, you know how many people are there with this particular color? There's a very, there's a dependency on one information to the next one, right? And that's where the challenge is. And this particular paper comes up with idea of how to solve such non-trivial visual question answering. So there's a whole area itself is called visual question answering. And this particular one needs, you know more reasoning than a simple immediate reasoning, right? And that is where the challenge is. And this particular paper is coming up with the new architecture, which allows you to do that in a efficient way. And a few of the related works with this paper or so any traditional like R&N LSD themselves fails miserably in this one because they have two different inputs. So few people try to encode symbolic structure into this so with the inputs. So instead of giving only the image and the text they'll add more symbolic information like how many questions are mentioned in the question and some expert crafted information to the, as input to the network. And only those kind of networks which uses this symbolic extra information seems to perform well. And there is no particular network which takes only the image and the text and that can perform well. So these guys introduce this network, these cells called memory and tension and composition which can do this kind of super is an iterative reasoning without any like handcrafted super is. So, and the details. So one thing to mention about this whole MSC structure is that they derive the whole thing from like decent principles of computer architecture. So most of them will be related to computer architecture terminologies. So there are three important units to this back to this architecture. First is input unit and other is a Maxwell and other is output unit. So this input unit converts the whole the both in inputs image and the text into a presentation that can be given to the Maxwell. So first it takes the image and it converts the images into some feature representation using a ResNet or not one architecture and that's taken us in features from the image. And similarly the question texts will be encoded with word embeddings first and then a bidaxon embedding is used to get the word representations. And this is the input unit and these two inputs are fed into the Maxwell. So the Maxwell is the main component of the paper and there are two, so there are two states in the Maxwell. So if you're familiar with LSTM you guys know about the memory state. Similar to that, here we have two states control state and memory state. So the big picture here is this that you know so you've got you've got the input and the output layers specific kind of inputs and lay output and just like a normal RNN but you have these Mac units which are tied one after the other and that is a big picture. And we'll talk about a little bit about the input and output in a moment but this Mac is the real key component and the Mac itself has got the three units. One is the control, the read and the write and three specific kind of units and that is very similar to let's say for example during machine and that's the basic idea here. How do you simulate some of the architecture principles in order to do reasoning on these networks? So now what they're going to do is given a image and given that question you need to somehow do a translation of the information and then give it to the Mac unit for it in order to do reasoning on it. And what is really interesting is that if you all have your phones with you and look at the visualization in the paper you can see how it's actually very precisely able to narrow down and show what the what when you ask for a particular question for example, who is in front of something? It will show you that attention is actually on that thing you asked it for. For example, if you asked for example who is how many people are there in front of the violet colored, person of the violet colored shirt it will show that attention is on the unit during that question it will say one part of it will say that it is actually on me for example, right? Because you're asking who all are in front of it and the next part of it it will show the both of them for example, right? And so that is what is the interpretability that comes out of the network is very, very interesting and in terms of the network now we'll talk about a little bit about the so basic I mean building block of this the maxel is that say you're asked a question so how do you typically respond to a question? So you think about it you access your memory you go back through your knowledge base and then see that I have this specific information which I have which might be relevant to this question. So I take that information and then try and compare that oh this may not be informed enough so I'll add more information to this memory itself and then I'll try to answer it again. So this is what basically happens in every maxel and it just you chain a bunch of these maxels together and they keep trying to refine your knowledge base and with the question every time. So that's the basic concept of how each of these maxels are built up. I just like how we reason, right? Divide it into many different small pieces and each of them you try to reason about, right? And then use the previously solved sub-problems in order to combine all of them together in the later stages. So that's essentially what they're also doing. And to do that each unit has got control, read and write. So for example, you're saying at the infant of the violet shirt, you're saying, okay, what is, you don't understand the meaning of what is the infant of, right? And then you do understand all the things. So you have that in text form and then you to connect that with the visual input and say that oh, you know what infant of has the meaning that things that is in front of and you also might have the context that, okay, here you are looking for people, right? So the network automatically sort of had to figure out that that is what you're looking for, right? And so it's breaking down into many different pieces and then you're using reasoning along with that. So there are three basic units in maxel. Control unit, read unit, and write unit. So control unit is kind of applying attention over the given question. So first the control unit sees the entire question and apply some attention over the question. That means it attends to a specific part of a question and then it takes that specific word and then it gives the same to the read unit. So read unit, so in read unit also we apply attention but this time over the images. So with the output from the control unit, which is the part of question, we select, we focus on a part of image. And then so if it is in front of the in front of layer of the like in front of part of the image will be focused by the read unit. And these two will be given to the write unit. So write unit is the one which updates the memory state. So write unit along with the information from the image and the text. That means control unit and the read unit updates its memory state. And from that memory state, we'll be able to come to a conclusion for a question. So let's say if in front of is the word focused by the control state and in front of is the image focused by the image. So the write unit will be able to provide the object, the answer to the object in front of the like particular guy as the question and it will be able to answer the question. So the whole memory will be stored in the memory state by the write unit. So this summarizes the basic working of it and it's the paper nicely describes it on how it's actually applied to VQA. But what they don't talk about is how do you apply it to say text? How do you apply it to NLP? So how does this actually translate? So that's like a very good open area of research that these MAC units can actually be used for concept reasoning in text rather than in images. I mean there's a lot of work done on images on it but I'm not demeaning that but text is that much more harder to actually get the context out of. So that's where my idea of research or where I'm working on and how to apply this MAC sets to NLP tasks. Yeah, another important aspect is the interpretability. So since we are using attention in each states so when you're attending the question when you're giving input as the question you're having attention mechanism over there and when you're like looking at the images you also have an attention mechanism. So it kind of gives somewhat interpretability to the whole network. So whenever if you try to visualize the attention so they have given many examples for those kind of visualizations. So based on the questions it will be able to attend to the particular part of the image very well and that answers most of the questions that deep learning arises like whether you're able to explain what the model is doing. So we'll start taking questions and then we'll go from there. Yeah, go ahead, at the back. Could you say that loudly, we cannot hear you. I was wondering where you use these models in real business. So could you give some examples? Oh, sure. Okay, okay. So there's a data set. It's most of the experimental setting. So there's a data set called clever data set. So the experiment is to find how much an deep learning model will be able to reason some things. So there's- Yeah, this particular one is about the visual question answering, right? So any, so you could, for example, one of the problems that we worked on earlier was questions in school curriculum, right? There's a question about using an image, a reasoning with an image, for example. Now, this is particularly for this particular paper, but in general about attention that is used in a lot of places. For example, so I have a question and again, because I'm taking this back to one of the projects we worked on where you have a question and you need to find the answer, right? And the answer could be in a given from a chapter in the textbook, for example, right? Now you can use attention mechanisms to figure out what narrow area of the book is actually coming from, right? And so attention mechanisms help you here figure out in text, which region in the book are we looking at? If you just do a simple, so for example, let's say you're doing word embedding, right? And you just do a simple cosine, some sort of sine distance on embedding that doesn't work well. So that's only all it's gonna look for is a few keywords and their, and how many other frequency of their usage. You need more sophisticated mechanisms for which actually attention helps quite a bit. And we have ourselves used it in production for the surfaces. So that's one example I could talk about. So one of the things that we can use it is abstracted summarization. You have huge amount of text. You have, say, a legal document, but you want to summarize the legal document and send it out, I mean, as short as possible. So attention is very good for, that's a particular area of research in this called point generation network. So that can actually be used in abstracted summarization. Yeah, can you hear me? Yeah, so I was just thinking that this can be a control problem, how much to attend probably can be a reinforcement learning problem than a supervised problem. So for the specific question is how do we train? How do you train the model? For conversational networks, I mean, I've not worked on MAC specifically or hierarchical networks, we can use your entire conversation history itself as our training. So your last line in the conversation can be your response and correspondingly you can just go on building on top of that. Yeah, there was a paper called Neural Turing Machines. So is this concept really much different than that paper? So there are some differences, right? In terms of how the units are laid out itself. In the Neural Turing Machine, there is importance given for the computation unit itself, where you do certain kind of, I mean, it is able to simulate computation itself. Whereas this is more of information extraction from, I mean, from the previous toward state, right? So for example, a question could be like due to Neural Machines could be, what is the sum of two apples and three, two apples, I mean, John gave two apples and three apples, four more apples. How many apples do you have totally? So that's in a Neural Turing Machine like question. Where this is about given that image you are interpreting and understanding from that, what particular things are there, right? With more of you extracting information and then giving that out, but from, you know, using two modalities and figuring out how to do them. So there's a slight difference there in terms, but in general, yes, they are from the general area of reasoning, right? Now, maybe going back to what if the question he was asking, it's likely different from the reinforcement setting, right? In the reinforcement setting, what happens is that you take an action and because of your action, the world changes, right? That does not quite happen here, right? This, your input and your output given is it's really in supervised classification setting, right? So it's really not in a reinforcement setting. You could potentially put this, you know, if you put a very complicated state of, you know, you are asking a question to reach a goal and all that, but it's so complicated, no one really models it that way today. So I was wondering around that image-based questions. What is the sort of data set? Because I don't have the paper right in front of me. So what is, it's the data set is like, you have images and questions, this is the images? Yes, it's called clever data set. Oh, clever data set. So you'll have images and the questions regarding the images. So mostly these are computer generated ones. Right. And there will be multiple shapes and the questions will be like, what is the circular object near the rectangular object in front of a cylindrical object? A question similar to that. So mostly it will be like a compositional questions. Just to give a context here, there are some sort of questions which are simple visual question answering where, you know, you'll have an image and this was done maybe 2016, maybe? Carpathian space, I don't know. So 2015 maybe. So where you give an image and you'll say, so this is normal images, you know, world scenes, right? And you'll ask, okay, and who is, okay, what color is the shirt of the person holding the back? And I'll go to them and say, it's pink color or whatever it is, right? So those were the initial models, right? And they were only answering very, very simple questions, right? Where you only ask for a relatively straightforward correlational, you know, reasoning question. And this particular one, what they did is they have these computer generated objects which are then created in a virtual environment and then they move them around in very complicated fashion, right? Because they were simulating it, they know what the ground truth of that is and now on that they're asking these questions, right? What is in front of what else, et cetera. But the questions are much more harder here because of the compositional nature that we talked about. So that is the data that we're talking about. And that again goes back to another question which is now slightly outside of visual question answering to general question of question answering. I forget the dataset, the one exact name, this is from Alan AI. And so, again, question answering, for simple question answering in text, you know, you have decent accuracy, you know, 70, 80% depending on dataset and all that. But now this particular dataset, they generated where you had to do a little bit more reasoning, right? And those, the questions, the data, the dataset where you had to do more reasoning, the algorithm accuracies came down quite drastically, right? And some of the best models that were, you know, even talking about 80, 85% accuracy, they all had like 20 to 25, 27% accuracy on questions that are, that are essentially A, B, M, H, A objective question where you have A, B, C, D choices, right? Even on them, their accuracy is like 20 to 28, 29% each. So you can imagine how much of a difference it makes about the type of question, right? So now the focus has been moving in the last maybe year or two in question answering general to more questions which are slightly more trickier than the general simple questions, you know, where we never thought we could ask even the first set of questions, that is, you know, who's holding the back. You asked me, for example, you know, seven years back, would algorithms be able to do that in five years? I would have never said yes, right? But it actually did happen, right? That we're able to do that simple correlation reasoning across two modalities from speech or to text data to image. And now we're asking, we're going to the next level where we're asking, can we do slightly more, right? And that's where we need slightly more complicated mechanisms to do reasoning. And that is where this, that's a context of this particular paper. So I actually was listening to one of the talks given by Lagoon and there some of this, this kind of idea was being talked about that we were talking about that question, right? How many people are standing, probably wearing a violet color T-shirt or whatever. So from this question, it will be passed to some neural network to generate another neural network on the fly, okay? And the head fellow and the image will be passed through that neural network to get the answer. So. Okay, I'm not sure about which, what work is this? And if you know anyone. I want to give you the citation, but it is about, you know, dynamically generating. So the neural network is going to output another neural network. And the image will be passed through that to get the answer, like how many people are standing with violet T-shirt, are you aware of? I'm not sure about this particular paper. Were they generating neural network? Not sure how that works. It is getting generated on the fly based on the question that you have asked. Yeah, I don't know. And I think that thing can be sort of generalized for NLP related things as well. Not only it is useful for images, but also for question answers on natural language. Uh, yeah, don't know. Yeah. Maybe you can catch more of mine. Yeah, maybe you can catch more. We can talk to the software, yeah. All right. So in that case, I have questions. So, there are people who are using attention here, right? And what are the mechanisms you all use attention for? I'm curious to know. And what does your success stories or failure stories mean? People who are using attention, yeah. Used in object detection and, I mean, primary object detection, the object can be text as well as other kind of images. Okay. Uh-huh. And I curiously, there have been really improved as compared to normal object detection approach that we used. Okay, and what kind of attention were you using? So, like an RCNN based RPN is one such example. Okay. Where we use region proposal network to determine where exactly our objects are. We needed to modify it slightly, but that worked pretty well. I see. As compared to like similar stage object detectors or others. I see. Others? Attention, okay. I didn't use it, but I'm just wondering the kind of computational savings what we are getting, which is looking only part of the sentence rather than the entire sentence. What kind of achievement do you have? It's not computationally. It's not a question of computation. It's a question of how we want to actually respond, right? So, you're looking at an entire sentence. You may not, it's, sometimes it's not, your context is looking. Some of the other words may not actually be useful in your generating or when you're responding to it. So. That was exactly the question. Rather than looking at the entire context. In fact. You're localizing the context. Yes. So, it must be able to get some kind of an advantage when it comes. So, yeah, so there's a very interesting question in the sense that, okay. So, attention itself comes from, you know, this area in psychology called visual saliency. And the idea essentially is that, you know, human brain, we had a small area called the fovea, which is a very high resolution region. So when I look at you, for example, most of the things have blurred for me, for example, I'm looking at this region. And so that is essentially a mechanism for me to know what is the next region where my fovea should attend to. Thereby I can do the, I can do that. And there were some old papers, like people like Lauren AT and all at USC, for example, who were doing those kind of networks. I mean, they were doing this kind of models. They're not deep learning models, but there were these models that were used. And there, they did actually use it for computation savings purposes. But I'm not aware of in a neural network setting where it actually reduces computation quite a bit. There may be some improvement, but it's not a, it's not a real reason for using it. Now, currently what it's used for is really for, the four reason really I would say at least for me is that it's interpretability. I can know why when I'm having a problem, I can go back and backtrack and say, where is it working? Especially in text, where it's more harder to reason about. The interpretability is what really, really helps quite a bit. And it may not necessarily be computationally more efficient. Because you are now in hierarchical networks, you have like three layers of RNNs. So that may not be actually much better than a sequence-to-sequence where you're just doing a machine translation. So it's not computationally efficient sometimes. When the Mac may not be computationally efficient because they're like stringing together a bunch of such Mac sets. And also sometimes it's also about the focus, right? Whether there are networks that are just, there's a lot of, I mean, there's so much work happening it's very difficult to keep track of. But there are a lot of work where, for example, in compression or model compression, et cetera, where they use these techniques and then compress them quite a bit. And these particular papers are not, they're focused not that really. It's about really pushing the barrier in terms of reasoning. And that's what the focus here. So yeah, this particular one is not, I don't think it's about, it doesn't reduce the computation. But in the paper, they'll say like, it's able to converge very quickly like with very less data compared to the earlier models. So we can get an advantage there. Converge, you have past convergence, which also, yeah, in terms of totally computational flops you're spending maybe that's better. So that way that's very interesting point. Any other questions? Hot experiment, you're talking about images. What if you have like a three-second video clip, which is a collection of images over three seconds of time? So what would be the level of complexity there? I mean, it's tough for me to even imagine. So I just wanted to bring that out. So you're asking to do what kind of a problem you want to solve here? For example, I have a ball in my hand and I just drop it. So the three seconds just covers the ball falling on the ground. So can you ask what happened to the ball or something? That's a very interesting question. That actually might be a future extension you could do. There has been work on, in general, in video. Video question answering I've not seen so much. In general, with video and object detection, et cetera, there have been work. And there have been works about how do you do that efficiently? Because the moment you go into video, there's a lot of reuse of information and how do you do that efficiently is actually still an open research problem. But about visual question answering in the video, I think I don't know anything. I don't know of anything. Subha? I think you can have attention over video frames. It should be some similar mechanism that you say in self-driving cars, right? So you're driving and you have these video sensors coming in, but you won't know if the next car is actually going to cut into this lane or not. So that's a question that you want to answer. So I do have some, sorry. I do have some friends who are working in self-driving, I don't mean, but my general understanding from what they have told me is they are all, most of them still do only at the frame level, right? They don't, I mean, they do computation and they hierarchically propagate that information to keep track of where am I, what direction am I going in, what objects are there in front of me, and you have a scene that you create and you update over time, but you don't look at a series of frames together in order to make any understanding, right? It's always done at the frame level. So that requires quite a bit of computation. That's why people usually don't tend to do that. They were work before a pre-dealt deep learning where they did a few kind of things like, I'm trying to remember if I can remember it, tracklets, I think, et cetera, where they were using some of these for certain purposes. But it is more research problem than to a core application specific. So yeah, I'm not quite aware of looking at video because, for example, asking, did the ball drop after you showed the video, right? That actually I need to look at all of the frames together in order to make that reasoning. Any more questions? So here's a question about question. I came in later, so this is a, I should not be asking this. This is a session about where we ask what machine learning can do and cannot do. This is about attention at the moment. But we at the end, there's no question I can take your question, yeah. Anything more on attention? Okay, I think we're done with attention, maybe. Then you can take your question. Yeah, yeah, please go ahead. Yeah, go ahead. No, no, no worries. Hold up the mic. I like fitness apps, so can machine learning model running on the browser or something detect what sort of exercises I'm doing and count it for me? Okay, interesting. So, okay, so another question depends upon, you know, what kind of information you're gonna give, right? So if you're gonna say, okay, given the actual remitted data alone, can you make a prediction about what exercises I'm doing? That's a slightly complex problem. And I don't know if there's any apps that does that just by that. Yeah, so for example Fitbit and all that does to a certain degree, but I don't know anyone who does it at the very high, are you aware of the device called move? Yeah, I've used that also. Yeah, so that does it. What is your, if you're used in what is here? Doesn't do it very well. Okay. That's why I was like, there are a couple of other apps I have like, which can count the reps, but. Okay, so that is the one immodality, right? With you and you have, so I know someone who started actually in Bangalore itself where he looks at video data in order to make, to understand what kind of exercises are you doing? And this particular one, I think he's more interested in knowing to give feedback about how correctly are you doing the exercises, right? And I think they did have some, so they were doing it usually with yoga and they had some certain constraints about how well you stand and all that. And you put your camera and then you stand in front of it and you do that exercise you're supposed to do. And then it'll give you feedback saying that you know what, you are doing this incorrectly or you are not quite there, et cetera. But again, it is not, I don't know if you can solve it by single modality. It'll be, at least my personal feeling is that it's a combination of multiple modality. You need to together combine all of them together to solve it. And the reason again is because when we have used some amount of accelerometer with vision sensors, for example, augmented reality, et cetera. And you can see that a lot of sensors have a lot of sensor creep where the sensors go, their readings are not correct with the origin, which means that the readings that you're getting actually cannot be used a lot. So that means you have to have multiple, for this uncertainty in that reading and also in the visual data also, because the algorithm might be throwing you off, right? Now you'll take this multiple more uncertainties, put into a model and keep track of all of that. It's still a pretty hard problem. And so unless you're ready to wear, you know, one sensor on one hand, another on another hand and a bunch of sensor all up your body, then it probably will work. And I have a advisor who actually runs the startup where the only thing they want to do is to measure the number of bytes you take. And that actually works quite well. I've used it and I've said, well, that actually quite works quite well. But that's pretty much, that's the cases where I've seen it work well. But other ones, my feeling has been that it's not, that's great. So there is some research work. So you can look it up, I think it's on the MIT website. And then there's a startup. So they basically use your indoor wireless networks to actually detect your movement based on the... Right now they're using it for elder health care, I suppose. So based on your wireless networks, they're able to track your, how is your body reacting or where your body is, like your heart rate is slowing down or not. And they're able to trigger signals. So for elderly people who are at home alone, they're able to do that. I was like, my quick story, I was making some games where you work out for 30 seconds and then you play the game for 30 seconds, like H-I-I-T game. I see a thing. This is one more similar question if you want to take it like... Last question probably, yeah. Hi, just for confirmation, can you use a tensile mechanism for autonomous driving or autonomous drones, and as well as for speech recognition? Speech, yes. Can use the same models that they've used, train and stuff for speech. Yeah, can you use it for mainly autonomous driving or autonomous drones? For cell-diving curves. So there are, I mean, so there are, there's so much of work in general in autonomous driving, right? There are particular models that use tension in various settings, like for example, which frames to pay attention to. And there are other work also where you're given an image itself, what regions to focus on. And some of those, in some cases, they even have, so I think I'm aware of this particular work where they figure out that, so there's this thing called, what in autonomous driving is a navigable region. And there are other obstacle regions and then other miscellaneous regions, et cetera. And for that, for example, I've heard of people using attention mechanisms. And some of it, again, is we know it's a mixture of many different things. It's just not attention alone. You have attention along with sensor data, along with, I mean, sensor meaning lidar data, along with other, whole bunch of other sensor data. So it's finally all this, it's a combined model of a lot of different things. So yeah, all right, right? Yeah, can take one last question. About tracking, picking photographs of food items and figuring out what you are eating, what is there in that food item? Like, maybe how many calories are in there? How easy to solve or how hard is it in Indian context? Okay, I don't know about Indian context. There was a work at University of Wisconsin where they were trying to do this, where they were using to take a picture and you'll understand what all are the ingredients in the food and the ingredients you know, or if you know the dish type, for example, we'll actually know what general calorie count is, right? And one of the problems there usually is the volume. How much volume is there? Estimating that is slightly on the trickier side. And that is one problem that I'm aware of, but there are people trying out various things, right? And so the question really here is, I'll put it this way, right? I mean, it's about how you define your business problem. If you define your business problem where you need to solve the whole problem in order to do it, it may not work, right? But maybe what you want to do is verify whether the user doing, so the user already enters for you, I'm having ice cream and it takes a picture, right? And then you can maybe this task is actually to do verification, right? So I'm probably going back to my morning talk, which is the definition of your business problem, problem definition matters a lot, right? If you give it very, very hard open mission problems or NLP problems, that's very, very hard to solve it. So the more narrow and tighter can you make it, that's a lot more functional and usable your application wouldn't be, right? And so that's one thing I would recommend. Sorry, yeah. I'm just reminded of this episode in Silicon Valley, so. Yeah, there's one more question I think there. I was wondering if there is a way to flip the attention problem, right? Instead of having the model figure out where to pay attention, can we have, you know, the annotator signify where the attention is so that you can use that as a prior so that you can have the model learn something better or faster, right? So in a image, if I can point out where the saliency is, right? So are there any mechanisms for doing that? I haven't been looking for something like that but I haven't come across. So Lauren, it is group at UAC does a quite a bit of work retention and they actually, I have used the algorithms and I would say they're pretty reasonable actually. And these are all computer vision, non-machine learning, yeah, I would say non-machine learning algorithms, right? They all are hard coded intuitions of psychologists and they are user testing in psychology labs, right? And I've seen them to be quite useful but at the same time it's not foolproof. A lot of issues where it does break. Now I actually will have to go back and read up on that particular thing whether you know, have people that are using those kind of mechanisms. Because the current, I used to have a, this used to be my big quam, right? They call it attention but it's not really attention, right? It is in a sense that attention not in the saliency sense of the word, right? That it's not doing what humans are doing because what humans are doing is in order to know what region to pay attention for, right? Whereas here it is saying what things should I give most importance for? That's slightly different from the notion of attention. There's a really important mechanism, right? What should I weigh more than other regions, right? So I am unfortunately unaware of, I do not follow up on that actually. So yeah, I should probably, that'd be a good take away from you. Yeah, I've been looking for that but haven't come across anything. That was the other thing that I had in mind. Yeah, maybe the key but you can look for saliency. That's what it is. I tried looking at cams. So you can actually search on, this is a website I don't know if you know, I mean if you might know actually, called Archive Sanity, you use that. So yeah, can we try saliency in Archive Sanity? Yeah, I tried saliency, I tried cams in reverse and what? Okay, so maybe we need today who spoke about about interpretability in models, might be a good person to talk to. He works quite a bit on the side of things. So he might be someone who might know or if we can take it offline, I can tell you. I don't know if you've seen. So there's a new, I think last week or week before, I think there was a implementation of Capsule Lengths which was released. So there they have actually showed in an image based on annotation how Capsule Length is able to zero in on exact spots of the image itself. So it's very good. So it's a very first cut of Capsule Lengths but it does look very promising. Okay, yeah. This is I think the first implementation that they've done and they've released out visits. Thank you all, I think we are done. We are out of time. Thank you all. Thank you. All right, thank you guys. Couple of announcements before you guys leave. Please drop your feedback forms at the registration that helps us improve the conference. The videos from this conferences will be available on the YouTube channel starting next week. And if you want to access the speaker slides, they're also available at anthillinside.com, anthillinside.talkfunnel.com, sorry. Good evening, everybody. Dot Talkfunnel, anthillinside.talkfunnel.com and the videos will be available on hasgeek.tv also, hasgeek.tv.