 So, hi all, I am Abhishek, I am CTO and co-founder at Infillect. First of all, thank you to ODSC for inviting me here to talk about some of the work that we are doing at Infillect. What I am going to talk to you today is about the core problem that we as engineers face. Having access to enough data which is labeled, right, you know, most often because we do not have enough data, we do not end up training models which work with very high F1 or precision or recall also, right. So, I am going to talk to you about a technique which can take you from having small set of labeled data to having, you know, a large set of labeled data, right. So, before I begin, if I can have a quick show of hands, how many of you are deep learning engineers, okay, perhaps around 30 percent. How many of you work on computer vision problems, okay, same set. So, all of us know that in deep learning stack, we have three main building blocks, data sets, algorithms and architectures, right. So, if I ask you this question that you are supposed to rank these three things from the most important to the least important, how many of you would have architecture at the top? None, okay. So, that, okay, one person. So, hopefully at the end of my talk, I will be able to convince you that, you know, a data set is the thing that can, you know, that has the most impact on the way we solve the problems, right. So, and this is, you know, not just me that I am telling you, there are pioneers in deep learning, you know, talk about not just more data, but having data which is better, right. So, we know that more data trumps algorithms and better data will trump more data. So, I am going to talk about a technique which can get you better data sets. So, if not anything at the end of my talk, I want you to have three takeaways. Okay, that real world problems are, you know, messy, they are chaotic, right. So, if you have worked on more than one computer vision problem, you would know that one problem differs from the other in terms of schema, in terms of the number of classes or the number of instances for each class. So, there is a lot of chaos out there, right. I am going to talk about a technique which given in any kind of data set, it will, you know, help you to transform it into a data set which is, you know, a near optimal. And hopefully, we will see that this technique gives us enough bump in, you know, accuracy, okay. So, taking a step back, the three things that I mentioned in the beginning, in terms of data set, in algorithms and architecture, right. So, no matter what press or media says about AGI and the robots taking over the world, we know that it's all about the game of taking a configuration of weights from one state to other, right. And all these three blocks help us to tune these weights. Because in the end, we want to arrive at a set of weights which help us to give very high precision recall on our test set, right. So, these three building blocks are the ones which you can use as parameters to tune our weights. I'm not going to talk about algorithm and architecture. I'm going to talk about the aspect of data sets, right. So, I'm going to assume that algorithm like backprop is fixed. I'm also going to assume that architecture to solve problem is fixed. And we'll see how to tune data sets so that we take weights from one state to other, which help us to get very good accuracy on different data sets, right. So, roughly there are three ways in which we can build our data sets. Either we can go and collect a more data, or we can synthesize more data, or we can augment data, right. And each of these have pros and cons. In terms of collecting data, if you take example of a self-driving cars, it's quite expensive to collect data, right. Because you have to ride the car all over, it's a bank load to collect data about the conditions of Indian roads. It's also an expensive to annotate this kind of data. And sometimes it is just not possible to get data, right. If you work with enterprise customers or inside big companies, if you work with different teams, you will understand that it is sometimes hard to ask other teams to give. Please give me 10,000 more photos so that I can train my models, right. It just becomes hard to get data. Synthesis, right. So all of us know about GANs, right. So GANs have been used to synthesize images. I think all of us know that it works pretty well on constrained settings. So if we take example of human faces, GANs are pretty good in generating fake human photos, right. But if you go beyond this realm, if you go into unconstrained settings like Indian roads, it's just hard to produce pictures which would look real, right. And also it is quite expensive, right. So GANs can take anywhere between two days to two weeks to get trained to be able to generate photos which can look realistic, right. And there are some recent studies which say that the pictures that the GANs generate don't really model the underlying distribution. So you may end up training your models on data, which is not really a true representation of the underlying problem. So as deep learning engineers, all of us apply augmentation techniques, yes. So those who raise their hands, who work on computer vision plus deep learning, how many of you do not apply augmentation when you have to solve problems for classification and for detection? None, right, so all of us apply. And we know that it's a fast technique, it is inexpensive. But most often the part where we struggle is how do we choose augmentations, right. So if I'm working on a problem of classifying different types of cars on Indian roads, how do I choose augmentations? Should I go for rotation or blur or noise? Or should I not use color transform, right? So we kind of make some intuitive guesses. We pick a few augmentations, and then we get some kind of lift in our accuracy. But it leaves us wondering as to are there optimal augmentations that we can apply to underlying data sets which can give us the best performance, right? And this is exactly what I'm going to talk to you about. Finding a good augmentation strategy that removes this approach of trial and error, right? So I'm going to take an example of a task of detecting different home decor objects. So here you see a few examples of objects like TV and bookshelves, tables, lamps, right? So for this particular example, there are around 15 classes. We are given 5000 photos. And there is a bit of imbalance in the data set, right? So objects like table or bed could appear frequently. But objects like bookshelves or lamp may not appear frequently, right? So we want to train a detector which helps us to get F1, which is perhaps more than 85% or so, right? So if I train a detector, let's take example of RetinaNet as a single-shot detector, right? So if I take that as my choice of architecture, and if I don't apply any kind of augmentations, this is what I will get. Overall, F1 of about 70%. And the classes which don't have enough number of instances, I will get lower F1. So here I'm getting around 51%. Now what do we as deep learning engineers do? We will choose a set of augmentation techniques like I mentioned earlier. So here I'm showing an example of the color augmentation where there is input image and I have applied this aspect of inverting the colors, right? So for the color inversion, we have a range of 0 to 1. I have chosen a parameter of 0.5, and this is the output, right? So my input image is transformed into this in output image. The other example of augmentation is a geometry augmentation where you can apply crops, you can apply rotations, you can apply a shear. So here is one example of crop where the chair is cropped, right? And last, an example is more of a bounding box augmentations where you are focusing on the object inside the box and then you're applying augmentations, right? So as you might choose color and resolution and shear, these as my choice of augmentations. I will get some bump in F1. So here that 69 is increased to 77. And the F1 for class which doesn't have any instances, it has increased from 51 to 53. So I'm still not satisfied with these scores. So then I change my strategy and I apply rotation, noise or shear. So these kinds of augmentations. So now if you see my overall F1 is decreased, but the F1 for the class which doesn't happen in instances has increased, right? But this has still kept me wondering are there any other augmentation strategies that I can apply by which I can take this number 69 to at least 85%, right? Which was my goal. And it may also talk to you about a few other quirks, right? So as compared to the problem of classification in case of detection, it is not easy to solve the problem of class imbalance, right? So imagine a data set where an object is appearing on all the photos, right? So that particular object will always dominate your distribution. If you have played with architectures which can detect objects, you would know that if an object is appearing in the same size and at the same location, only a few anchor boxes get trained. So at test time, if you get that object but which appears at a different location or in a different size, the architecture filters. And of course, how do we think about modeling the background class, right? Because we want to ensure that there are not many false positives, right? And then again, there are other challenges of the camera sensor being changed or the photo quality gets changed. Or is suddenly in your data set, you see that the compression technique is changed, right? And also one augmentation does not apply to all kinds of data sets, right? So if you have to detect a number plates and if you end up applying the augmentation of the flip, right? So it's going to change the whole story. Similarly, if you apply the color augmentation to the image on the right, it's going to be a recipe for disaster, right? So red to green will look at you into trouble. So coming back to the point that I'm making is that different problems demand us to think about different kinds of augmentation strategies. How do we think about building a technique, which is build once, but it can be applied in many times, and it removes the guesswork of which augmentation I should apply to work enough at the set, right? So hopefully I'm able to convince you that there is this problem, right? So how do we go about finding this technique, which irrespective data set helps us to do data augmentations? So I envision a future where this task of choosing the right augmentation strategies is automated, right? So this is what I have shown at the bottom. So you get a tag data. You first spend a lot of time in figuring out the policies that you need to apply on this data. Then you train a model and hopefully in just one shot, you will get the accuracy that you need. What happens today? We get that data, we apply some intuitions, we create augmented data set. We train a model, we are not satisfied with the evaluation. We go back, we again change a few things, and it takes time, right? So hopefully this future will take less time and it will also give us very high accuracy. So let me start talking about that technique, right? So in this technique, there are two parts. So in the first part, once I'm given a tag data set, I'm going to run an algorithm which may take time. But it is going to give me a set of policies, right? Once I have these policies at train time, so imagine you are doing a training. You are sampling a batch, in your batch you have let's say four images. So you pick the first image, this is your input image. You apply the learn policy at train time when you're doing a training for your model, you're not doing any extra work, right? So you have policy already trained. On that input image, you're applying a set of augmentations, which is given by your policy, right? So here I'm saying that we apply n augmentations, n can be 1 or 2 or 3, right? And that transforms our image into an image which is given as input when we train the architecture, right? So what I'm going to talk about now is the first aspect, right? So if you're given a tagged data, how do you learn policies which are the best in augmentation policies, right? Any questions so far? Yeah, what? It could. So what I'm going to talk about is specific to computer vision, but there are certain clues which you can apply to even a text kind of data set. Yes, yes, yeah. So there are, so what I'm going to talk about is perhaps custom for computer vision, but the technique I think is generic enough, which you can apply to the kind of data that you are saying, yeah? Okay, so just to get the terms right, when I say policy, it's basically a set of augmentations that we apply on an image. And each augmentation is applied with a magnitude and certain chance, right? And if I have a set of policies, I'm going to call it a strategy, right? So as I explained, what happens at a trading time is you sample a photo, you apply a policy, and so you sample a policy from these set of policies. And once you apply that policy, you get a set of new photos, okay? So on the same picture across different batches, I may get different outputs, right? So here I have applied policy P1 and because each policy can be applied with certain chance, it's not like in all the time you will end up generating the same picture, right? So across different batches, on the same photo, if you apply different policies, you will get a different set of photos. You could, yes, yeah. Okay, so let me now talk to you about the core problem, right? So what is the core problem? We are given a number of augmentations, right? So you can think about augmentations like the ones that I mentioned, right, from color to cut-paste kind of augmentations, right? So this could be 30 or 40 augmentations that you have. Then in your input data set, you have a single number of classes. What we need in the end is given a class and given an augmentation. We want to arrive at a number which I call as magnitude, which says that I'm, so if you think about, as I mentioned, the augmentation of color contrast, right? There are parameters like alpha which can change from 0 to 1, right? So here, when I say in a magnitude, sorry. So it implies that I'm choosing a number between 0 and 1, right? And I'm going to divide this whole range into let's say 10 buckets. So if I'm going to choose a color contrast as an augmentation for a particular class, I'm going to say that apply that augmentation with a magnitude of 0.3, right? And similarly, I want to know if I should apply this augmentation to this class with this magnitude with some chance, right? So it's not like in all the time I'm going to apply the same augmentation, right? So given a picture, I want to apply n number of augmentations. As I said, this n could be 1, 2, 3, 4. But essentially, given a photo and given a policy, I'm going to get the in-output pictures, right? So if you think about the overall search space, right? So if you see, I mean, earlier, what we were doing, we would just choose intuitively a set of augmentations and we would just choose a certain magnitude. So if you want to do, let's say rotation, right? You would do it by maybe 20 degrees of rotation, right? If you want to do a translation, maybe you will choose something like, I want to do a translation by 10% or so, right? So all those things we are trying to model as a part of a discrete space, right? So this is the entire space from where we are going to sample a strategy. So there is not going to be any kind of guesswork. But this search space is very large, right? So you have a number of augmentations, c number of classes. And for each of these two combinations, you want to learn two numbers, what is the magnitude, and what is the chance, right? And since I want to do this n times on a given picture, this whole search space is huge, right? So if the earlier in bulleted list is not clear, essentially we want to learn this s set of policies. And for each policy, I want to choose a class. Once I choose a class, I can go and sample image from my data set. And then I'm going to apply in augmentation A with magnitude V and with chance P, right? And on input image, I want to apply two types of augmentations, right? So that is my overall problem formulation. So given it's a search in a discrete space, we are required to venture into the whole aspect of control theory, right? So here I have four main things to talk to you about. First thing is about running an experiment with an input, right? So input is going to be some strategy that I have sampled from this whole space, right? So once I get that input, I'm going to run an experiment and I will get some output, right? That output is used to tune controller, which is going to give me a set of better strategies, right? And as I spin this wheel, I can expect to generate the better and better strategies because the way I'm going to optimize this controller is to give me a better and better accuracy every time, right? So let me jump to the main slide of my talk, right? So how do we apply this kind of theory to solve the problem that we have in our hand? So as a part of the experiment or as a part of the enumeronment, we are going to have an architecture, let's say an SSD kind of architecture, right, where we are going to train an object detection model, right? So it's going to take a dataset as input, it's going to take a strategy as input, and it's going to produce accuracy as output, right? So this block should be very simple to understand. Now I have this block which is acting as my controller, right? So allow me to use this to explain what is happening here. So each of these cells is basically an LSTM. It's an LSTM of about 100 neurons inside. It's going to take a vector as input. It's going to produce a vector as output, right? That vector at the output of LSTM, it's going to have three types of softmax, okay? The first type of softmax is going to have output of C. So C is my number of classes. So it's going to have the output range of C, right? The other softmax is going to have output range of A, and the third softmax is going to have output range of B, right? So B is the number of buckets that I have, right? So if you see, we are going towards modeling, how do we sample C? How do we sample A, and how do we sample B, right? So the same vector is going to act as input to the next block. So as you unfold LSTM, right, you have a set of blocks, right? So these seven blocks are going to form one policy. Now, from the first block, we are going to first sample C. So C implies we are choosing a class, right? So once we choose a class, we are going to choose an image. Once we choose an image, the output of this from here, we are going to sample an augmentation A. From this block, we are going to sample what is the magnitude of that augmentation. And from the next block, we are going to sample what is the chance with which I want to apply this augmentation with this magnitude on this input image, right? And since I mentioned that n is equal to 2, on an input image, I want to apply two types of augmentations, right? So these seven blocks are going to constitute one policy. And since I want to have S number of policies, the whole block will have seven into S, these many blocks, right? So now imagine that the controller, the 100 neurons that we have, these are initialized with some random weights, right? So we will, in the first iteration of this whole loop, we will end up getting a strategy which is a bit of random, right? So it's going to just sample some class, some augmentation for that class, some magnitude, and some chance, right? Once you feed that as input to the experiment, right? We have our fixed dataset which has 5K photos. It is into train, val, test. We will train a detector. We get accuracy as output. That is now fed as input here, right? So here, what we are going to measure is change in the accuracy, right? So change in our reward, right? So if my strategy were good, and if I'm getting better accuracy, it implies that the output of the whole LSTM block is good, right? So I can reinforce the fact that, well, we are choosing the better strategies. So let me just show you this slide. So here, if you see, so the core aspect that I'm explaining is, how do you train this LSTM, right? So from where are you getting the ground-put information, right? Because as I mentioned that each block has three softmax, it is going to produce some kind of probability distribution, right? But how do you train this LSTM? Usually when we do image classification, and when we have softmax, we have the cross-entropy as the loss function, but we always know the output, right? When we do image classification. So here, that output is being generated based on the reward signal, right? So if reward is higher, I'm going to artificially inflate that in a probability distribution. This is called as a proximal policy optimization. So once I inflate the output distribution, it is basically saying that you were supposed to reach that kind of distribution, but you are only here. So now you should change your weight so that you end up producing policies which are better and better, right? And similarly, if I sample a strategy which is not good, then I'm going to reinforce a negative signal, right? Sorry, go ahead. State, so, okay. So in case of LSTM, so, okay. So this LSTM is going to unfold, right? So it is just a single LSTM block, right? As I mentioned at the output, I have three softmax. So when I say that there is a policy, I say that these four blocks basically form a policy. One policy, I'm not sure what you mean by state. Yeah, right, yeah, go ahead. Sorry, okay. So well, yeah, I think that's a good question. You could choose not to use it. But since I'm going to apply in augmentations or then image, let's say if n is equal to 2 and if I'm applying the first augmentation with some chance, right? It gives me four different combinations, right? So I may, in some cases, I may end up not applying any kind of augmentation at all. Correct, yes, yeah, right? So, okay, so let me talk a bit about the details of, yeah, yeah. So let me talk to you about a few details, right? So as I mentioned, this LSTM is of 100 units. I have three softmax. If you end up working with proximal policy optimization, you should be really scared because unless you figure out these numbers of learning rate and certain inequality numbers, it becomes very hard to make them in practice. So I'll just talk about one interesting phenomenon that we observed. So if you see here in this data set, I have my train valent test fixed, right? And I'm going to search for a set of policies here. If I have my val set fixed, then this block, because now it is learning on its own, this block is trying to overfit on the val set, so that you always end up producing good accuracy, right? Typically in what happens in, so okay, so overfitting happens when you have a large model and a small set of data, right? It will, right? But here, because I have small set of val data, and this block is aggressively acting to produce a very high accuracy result. It is just trying to overfit on that val set, right? So what we had to do, we had to do sampling from the val set every time we're in an experiment. So if you see, one round of this experiment is that every time you have to run this loop, you want to initialize your architecture with some initial weights, right? So because you want to figure out a new strategy, it is always a new experiment that you have to run, right? So there we had to sample the val set so that the controller doesn't overfit. So a key thing here is the cost involved, okay? So one run of training detector involves around 150 epochs. And to find a strategy, the best strategy, it took us one key iterations of that whole loop, and it roughly amounts to eight days, eight days with certain GPU compute. So is this cost worth it? Well, in the end, we could achieve that elusive number of 85%, and if you see for the class which didn't really have a lot of examples, even there we are seeing a jump of around 17%. So these are some of the policies learned, right? So here if you see there are a lot of, I'm not sure that it is visible. But there are a lot of edges, and there are a certain set of cover contrast values that it has learned. Yeah, so that's because I guess that's how the network wants to learn. So nobody has really fixed these magnitudes or probabilities. It just starts from the data set, or the controller has learned to produce these kinds of augmentations, okay? So I'm showing you the same image input, but across different batches and across different policies. That's the output, okay? So that perhaps you can talk offline. So here, for example, we learned a lot of edge detectors. That is a lot of texture. So if you see, it is trying to bring out a lot of texture, right? So there are a set of augmentations which we can discuss offline, yeah. Go ahead. Sorry. Okay, right, yeah. So that's a good question. So some of these augmentations we also applied to adjacent problems, and it worked quite well. So that was coming at that, hold on one minute, yeah. So, yeah, so if you see, so I as a deep learning engineer, I would almost always choose rotation blur or shear. And in some of you may also do that. But the controller didn't learn that. It basically instead focused on contrast, edges, or image crop, or flip inside the box. So these are like in out of box, unusual kind of augmentations that the controller learned. So keeping us at the time it took to learn all these policies, it was also a good learning for us as a team that perhaps we should have a better approach to augment our data sets. Because of course it takes eight days, and if it is going to take eight days for every damn problem, it's not going to be scalable. But so this was a very good learning. So this was more like aha, a moment for me that, well, what we were doing to solve problems was perhaps not right. And this kind of judgment where you end up choosing a magnitude and a chance to do augmentation, this kind of judgment, we as humans, we cannot make all the time. So I would say it was worth it spending eight days of GPU compute. So then later we converged upon better technique where you can replace the controller instead of having RNN there. You can also think about having basing optimization. So I hope this future we can reach sometime soon where this block of auto augmentation can be done in 10 hours or less than 10 hours or so. So my advice to you would be that instead of being AI research scientist, if you become a data set hacker, that will give a lot of rewards in your work. I'll just stop here and if you have any questions, I'll take them now. Yeah. The augmentation parameters. So how do you know that the same optimized set of parameters would also be optimal on the full model? And the second question is that why did you choose LSTMs? I mean, it's such a complex controller instead of that because the parameters for each policy are only seven and there may be multiple policies, but still the number of parameters there since it's small. Can't you choose a more simpler controller to just search over that space? So answering your first question, right? So we can't expect the same amount of performance if we apply the same policies on a larger architecture, right? But when we applied these policies on larger architecture, we still got a considerable pump as compared to what we were doing blindly. So that's first. Second thing, if you see in LSTM, I only have 100 units. So it's a lot, a simpler unit. It doesn't take time to, it is not the most time consuming or resource consuming aspect. The core aspect is how do you run those experiments fast? And which is where we had to strip down the version of the architecture? The reason I was asking is LSTMs are especially good for sequence type data. And in this case, the order of policies, parameters. So it matters, it's what I'm saying. So why that specific order? So here, if you see, we are first sampling a class. And then with respect to a class, we want to choose an augmentation. And with respect to that class and augmentation pair, now we want to choose a magnitude and a chance. So there is a sequence involved. What? Thanks. So go ahead. Actually, I'm coming from oil and gas industry. And just translating your idea into that domain. We have oil fields producing for 30 years. And every 10 years, you have newer kind of sensors, newer kind of distal oil. So your legacy data is different from, and how do you augment that if that's what? And then nowadays, you are having lots of wells and you are collecting data from lots of sensors. So then how do you, across space and across time? How do you? I don't mean that I can answer that question quickly in one minute, but as you can talk offline. But I think Bayesian optimization is something that has been applied to the kind of data that you are saying, where you essentially try to learn the data augmentations. So you don't really have to go for such a complex technique all the time. So what I'm saying is that there are certain algorithms that exist which you can apply to the kind of data that you are saying. Yeah, so time axis is something that we have to worry about, that I'm not sure. But we can talk offline on that, yeah. Yeah, so yeah, I think my answer would be it is possible, but it will take perhaps 10 times more time and cost. So if you have it, feel free. Yeah. Yeah. So if you have questions, please come. Sure, thank you.