 graphs on text to doing image processing. So how to learn a real-time object detector with less data? That's the question I'm going to answer at the end of my presentation. If you are a deep learning practitioner, which most of you are, I'm going to leave you with a technique and a case study which can help you to efficiently extract objects from pictures and videos. And this is especially useful for those domains, where it is sometimes not that easy to get large-scale label data. So I'm Vijay. I am the co-founder and CTO at Infillect. At Infillect, we build products in the domain of retail, media, and advertising. And this is joint work with my colleagues at Infillect, Bala and Anand, and Uma Savant, who is a machine learning scientist at LinkedIn. I have a three-point agenda for my talk. First, I will talk about the problems and challenges in domains when you go beyond Pascal VOC or ImageNet kind of open-source data sets. Then I will tell you about an architecture which has ability to ingest a lot of unlabeled data sets when you're doing object detection. And then I will talk about how this object detector outperforms most of the modern object detectors. So the goal of my talk is really to give you enough details and inclusion about how modern object detectors work. So before I begin, how many of you work on object detection? Significant number. And how many of you have used or applied this technique called a single-shot multi-box detection, SSD? Quite a few. So the architecture that I'm going to present is essentially an extension of this open-source architecture. So let's start with something that most of us are familiar with, doing image classification. So on this slide, I have shown a few examples of pictures. These are sports equipments, helmet, bat, and tee. Note that these labels exist for the entire image. And given these images and labels, we know that we can apply convolutional neural networks and back propagation algorithm. And then we can build a classifier. And that classifier at runtime can take an image as input and it can do classification. Object detection is a step further of doing image classification. If you observe, all the equipments that we saw on the previous slide are now part of a single picture. And if we have to detect them, we not only have to locate where they exist inside the image, but we also have to classify them. And this can primarily be done in two ways. Either we can learn to, I think, either you can learn to do bounding boxes on the objects or you can learn to produce segmentation masks. And in case of segmentation masks, note that you have to do prediction for every single pixel of the image. So for most part of my talk, I'm going to focus on methods which talk about putting bounding boxes on the objects inside a picture. And this has several applications. So here I have shown two applications. On the left is a visual search application where there is picture from home decor domain and object detector has learned to identify multiple instances of objects like chair, table, and plant, right? And given a box around a chair, now we can do a search over e-commerce catalog. And then we can figure out the matching chairs, matching in terms of either color, pattern, or shape. There's another very popular application where you can mine social media pictures and then figure out the presence of brands, right? And this is especially useful for brands because they want to know how they are performing with respect to their competitors, right? And there are many more applications that exist for object detection. You can build some of these applications on your own by making use of these open source data sets. There is Pascal VOC data set, which has around 11,000 images with 20 categories. These are chair, cat, table, so on and so forth. And there is recently released open images V4 data set, which is a large scale data set. It has about 2 million images and around 600 categories, right? So if these open source data sets exist, where is the problem? So the problem comes when you have to build applications for domains which are beyond what you find in these open source data sets. Take for example, identifying sports equipments for sports famous in India or identifying Indian food items. I mean, if you go north to south, east to west, there are more than 400 items that we consume, right? And there is no way you're going to find all those classes in these open source data sets. So you have to collect your own data sets and build your own custom classifiers. And sometimes it is not that easy, A, to collect a lot of images and B, get them tagged, right? Secondly, open source data sets are not necessarily clean. They are quite noisy. So on the left, I have shown one example where there is only one box that exists around a dining table. But if you see, there are lots of dining tables which are not marked, right? On the right again, there is a chair which is wrongly marked as a dining table. These are images from open images V4 data set. To show you a few more examples, again, a popular class trees is not marked. And if you see image in the right, so far as the tables, these are not marked. Now, why is this a problem? This is a problem because when we learn the models, we are giving a wrong signal to the model that these objects don't exist in those pixels when they actually exist. And if you have a lot of such images, you don't have good hope of learning a very robust object detector. So there's no way to solve this problem, but to go through all the images that you have in your data set, scan them manually and correct all these labels. And thirdly, the problem could be complex. So the problem that you have that you are working on, it could be complex than the problem that you have in these open source data sets. So here, I have shown a table which gives you comparison between different open source data sets on number of classes, samples per class, and what is the complexity score. So if you look at Ms. Cocoa Row, it has about 91 classes and about 7,500 samples per class. And the complexity score is 25. So the low score indicates that it is easy to locate and classify objects and high score, it means it is not. If you look at the kind of data that we work on, we have about 98 classes. We have around 500 samples per class. So this is significantly lower than what you find in these open source data sets. And this is purely because Ms. Cocoa has spent a number of manoeuvres in terms of collecting images and then getting them labeled. And our problem is complex than what you find in these open source data sets. So where is the way out? Collecting large label data sets is expensive. The models which can generate synthetic data, these are not a mature data augmentation is one way out. But at first place, you need a lot of label data to do augmentation. So if we have a decent size labeled data, but if you have access to large amount of unlabeled data, can we solve this problem? So can we make use of that large amount of unlabeled data and then learn robust object detectors? That's the question that we're going to answer. So to formulate the problem, we want to detect objects of different sizes and shapes. We have small sort of labeled data with us, but we also have access to a lot of unlabeled data. Boxes and classes don't exist for this data. And we want to learn an object detector with as much speed and as much accuracy as possible, of course, right? So let's build a solution, right? So let's start with again something that in all of us know, right? So all of us know that image consists of a 3D matrix of RGB channels and each pixel can take value between zero and 255. We also know that we can apply in convolutional filters on the image when you apply a filter on one part of the image, the image response with activations. So the strength of activations indicates to what extent pattern in the filter is matching a pattern on the image, right? And we can apply several such filters and the image responds differently to every filter. So if you see here in the first filter, the image is responding in terms of either the brown color of the bat or the shape of the bat. And if you see the right bottom one, it is actually responding to the green color that you have in the background. So different filters get learned. So as we go on applying these filters and do convolution and pooling operation, what we get in the end is a feature map that all of us are aware of. And as I said, a feature map consists of a set of activation maps. So here I have shown examples of two maps. And as you are going deep in the network, different activation maps can learn to identify different patterns. So here the maps have actually identified what is a face or what is a bat, right? And again, all of us know that we can convert this feature map into a feature vector and then we can apply a prediction function like softmax classifier and we can apply a loss function like cross entropy and then we can learn this network end to end, right? If you are a beginner in deep learning, I especially want you to remember this slide because what happens in case of object detection is not very much different. We still have to look for a prediction function. We still have to look for a loss function. We have to figure out a technique by which we can make use of this feature map, convert it into a feature vector and then do the object detection, right? Now, if you think about how we as humans detect objects, at any point in time, we can focus on only one or a few objects. I mean, if you're looking at me speaking, then it is not that easy for you to focus on my slides, right? Now, taking hint from this observation, you can see that at very high level, we have to answer these two questions, right? So where should we look? And if you're looking at one part of the image, what exists inside that part, right? And these two questions are answered in prior work in two ways. Either you can learn to answer each of these questions, which is essentially approach one. These are two-step object detectors where you learn where to look. And once you learn where to look in the image, you can learn to identify what exists inside that part. Or there is approach two, which is called as single-shot detection, where you scan the image quickly. And as you are scanning the image, you are asking this question, what exists inside this part of the image, right? So let's look at a few examples of these two approaches. So as I mentioned, in case of a two-step object detector, we have these two questions to answer. Where should we look and what exists inside what we are looking at? So if you look at the middle part, we have identified a few boxes, right? And on the left side of the image, if you think about boxes of all sizes and shapes, there are tens of thousands of boxes that you can put on the image on the left. But in the image in the middle, we have identified only a few tens of boxes, right? And these are around bat or helmet or tee or the player, right? So we have answered the first part in terms of identifying where to look. And in the last image, now we have in fact filtered most of the boxes and what remains is the most important boxes and their labels, right? So let's go through this process, right? So how do we learn to do these two steps? So what we know, so we know that from an image we can extract a feature map. That map contains a set of activation maps. So now think about applying a set of anchor boxes on each part of this feature map. So on the left, I have shown examples of a few anchor boxes and those come in all aspect ratios. I mean, those could be square or rectangles or so. And you apply all of these anchor boxes on each part of the feature map, right? So it is as if you are applying these boxes on original part of the image, right? So a feature map is representation of the image. And when we are applying these boxes on different parts, it is as if you are applying these boxes on the original image, right? So there are still thousands of boxes that get applied on the feature map. So we still haven't solved this problem. So the problem that we want to solve is if there are these thousands of boxes that we apply on this feature map, which ones are significant, right? Which ones are promising? So this is what we can do by making use of data, right? So we have data, we have images, where we have bounding boxes. So we can actually project the box that we have on the feature map on the image, right? So that yellow box on the feature map is now projected on the image. And now you also have access to the ground proof box, right? And you can compute a very simple metric of intersection over union. And then figure out what is the chance that this box contains some object, right? So remember what we are trying to learn is where to look. So it's a simple binary classification problem where we want to filter out most of the boxes but only keep those boxes, which have very high chance of containing some object, right? And this is how we can do it because we have access to a label data. So after doing this step at runtime, when you get an image, you can do this binary classification. You can filter out most of the anchor boxes that we apply on feature map. And what you get are only a few set of boxes. Now we have to answer the second question, right? So what exists inside each of those boxes? And as I mentioned, we have got a set of boxes. It is not necessary that all of them contain some object. So now, you know, this is a candidate box and we can project that box on the feature map, right? And we can extract that part of the feature map, right? So earlier, our feature map was of size 13 cross 13. Now we have it of size seven cross seven, right? And now we know the pipeline, right? So it's doing image classification where you convert the feature map into a feature vector and then you can apply a softmax classifier and then you can learn this pipeline, right? There is a slight technicality of a region of interest pooling because as you can observe, you can have anchor boxes or different sizes and shapes which basically means that the size of the feature map could vary, right? And hence, since we want to have a fixed length, a feature vector, we have to do some kind of pooling operation so that this pipeline can be trained into in fashion. So we are not yet done, right? So it is possible that the box is not enclosing the underlying object appropriately, which means we have to correct for that box. And again, we follow the same pipeline where we can convert the feature map into a feature vector. And now we have a four output linear classifier which is essentially doing offset correction, right? So that yellow box is now, you know, it can be corrected in terms of correcting the top left coordinate and bottom right coordinate, right? So, you know, we have seen these two steps of where to look and what exists inside the part that we're looking at, right? So we have learned these two steps and so that's why it is a two-shot object detector. Now what happens in case of single-shot object detector is not much different except that you apply anchor boxes on the feature maps and you are directly doing the prediction, right? So there is no selection bias here. Now this is architecture of single-shot multi-box detection and there are two points that I want you to note. Firstly, in this architecture, instead of looking at feature map in the end, we are looking at several feature maps, right? So it is as if you have a main branch of doing convolution operation and on each of the feature maps, you have sub-branches where you are doing prediction. Now because you're applying anchor boxes on so many feature maps, we have almost 8,000 boxes on which we have to do prediction, right? So you might think this is a problem but because that sub-branch is lightweight, in fact, SSD is much faster than two-step object detection and it is much more accurate. And I'll talk about why it is accurate, right? And it is mainly because the choice that we have made where we are now applying anchor boxes on different feature maps, right? So we know that feature maps which are close to the input, they have a smaller field of view, right? So the kernels are of size seven cross seven, we have image of 250 cross 250. So we are looking at small parts of the image, right? And hence these feature maps are especially useful to identify objects which occur in small sizes. And as we go deeper towards the output layer, if you look at the feature map which has size three cross three, it has a larger field of view, which means it is able to identify objects which occur in medium or large sizes. But if you observe, you're still not exploiting the strength of these feature maps, right? So we know that feature maps which are close to the input, they don't necessarily learn semantic features, right? So they learn high-level features like edges, circles or so. And hence they are not really capable of identifying different variations of small objects. Also, if you look at feature maps which are close to the output, again, because they have a larger field of view, they need not necessarily perform well on objects which occur in small and medium sizes. And of course, this architecture doesn't really let us ingest a lot of unlabeled data that we have. So remember we started with the problem that we have labeled data which is on smaller side, but we have a lot of unlabeled data, right? And there is no way in which we can ingest that unlabeled data in this kind of architecture. So before I proceed explaining how we solve this problem, do you have any questions about the things that we have discussed so far? I can take one question if any part is not clear. Okay, so let me go ahead and talk about how we extended this architecture, right? So on the left, we have SSD architecture, which is what I explained, right? So you have a main branch, and then you have sub-branches where we are doing prediction. What we call our work is SSD++, and the first thing that we can observe is that we have encoder and decoder architecture, right? Which means that we can give unlabeled data as input and we can expect to reproduce it as output, right? And why do we do it? Because we have a lot of unlabeled data, if we can learn through that data, of course it is going to help us learn better feature maps, right, which is going to initialize our network with the right weights, which we can use when we have to do supervised learning, when we are doing object detection, right? So this is the first deviation that we have with respect to the prior work. Now the second thing that we do is we combine feature maps which are across layers, right? So as I explained, that feature maps which are close to the input, they are not strong enough, but they are useful to detect small objects. Feature maps which are close to the output, they have a larger field of view, and they also have a lot of semantic information, but we are not really making use of that information to detect small and medium-sized objects. So to solve this problem, what we do is we combine layer i with layer i plus two, and the reason we do this is because there is enough delta between the information that you have across two layers. So if we look at that first ping part, which has four inputs, it has two inputs from the encoder part and two inputs from the decoder part, right? So we combine these layers and form a new feature map. And once we have this feature map, we can apply a set of anchor boxes on this feature map, and then we can do prediction and regression, right? So after that, the subprimes that we have, it is similar to what you have in SSD. So let me go a bit into details about how this thing is done, right? So I will read this slide from the left. So we have input, which is of size 300 cross 300. We have encoder parts, that's essentially a rest-net architecture. We take the feature map at the output and give it to the decoder part, and then we try to reproduce the image, right? So this happens in stage one when we are doing unsupervised learning, right? In stage two, as I mentioned, we combine four different feature maps, right? So let's again look at the first thing part, which has four inputs, right? So it has input from layer, which is of size 38 on the encoder part, 10 from the encoder part, and again, the counter part from the decoder layer, right? Yeah, 10 minutes, yeah, right? So now the key question is, how do we convert this layer, which has dimension 10 into dimension 38, so that we can do meaningful element-wise addition or multiplication, right? So all of you can see that we can easily combine layer 38 with 38, because it is of same size. And to answer that question, we apply a deconvolution, where we upsample that feature map into size 40 cross 40. Then we apply a deconvolution operation to convert it into 38. And now we have all feature maps, which are of the same size, and hence you can do element-wise addition, and you can construct a new feature map, which is a far better feature map than what you see in a purely encoder architecture. So we applied this architecture on a Pascal VOC dataset, which has about 20 classes for each class, there are on an average 400 images and around 800 instances of those classes. We use NVIDIA GPU, a KT GPU. The main part that we tuned was a Jakard overlap, where in most of the prior work, it is 0.5 or 0.55, but in our case, 0.65 work the best. And we do this unsupervised learning for first one or 20K iterations. So one iteration is going through one batch. And then we do supervised training for around 80K iterations. So this table compares our work with most of the prior work, like you look only once yellow architecture or SSD architecture. And I'm going to compare it with respect to MAP. So it's a mean average precision metric. And if you see with respect to SSD, which has MAP of 77.5, we are gaining almost 3% and that's a very significant improvement. And in terms of frames per second, again, we're not doing that bad. So SSD has performance of 62 frames per second, whereas we have about 50 frames per second. So we have improved accuracy, but with marginal decrease in the speed. We have made several design choices, so we have confluence of a convolution feature maps, confluence of a decommotion feature maps. We also have unsupervised learning. So how do each of these choices affect the accuracy? So for answering that question, again, I'm going to look at MAP for small objects because these form the stringent cases. Now when we add confluence of feature maps, just on the encoder part, we get increase of about 4%. When we add confluence across the decoder part, along with confluence on the encoder part, we get jump off almost 10%. And then on the top, if we add unsupervised learning, we get another bump off around 3%. So each of these choices help us to improve the accuracy. And these are the few examples of applying SSD++ on test data of Pascal VOC. And yellow boxes are the ones that we are producing, whereas blue, red, or green boxes are the ones which SSD produces. And as you can see that we are doing significantly better. So where do we stand with respect to the prior work? There is a recent work called as RetinaNet, which some of you may be aware of. We are doing slightly worse in terms of MAP, but we are doing significantly better in terms of the speed. I don't have time to go through this. I'll just touch upon these key questions, right? So we looked at architecture, we understood the intuition behind the architectures, but it just, you know, tip of iceberg. When you have to make these things work in practice, you have to answer all these questions. How do you do a transfer learning? You know, what kind of weights you start with? How do you choose filters and anchor boxes? What do you do when you have a class imbalance? And that's a very practical problem that I'm showing many of you have faced. How do you choose data augmentations? Can you just apply 30 augmentations blindly, or is there any way in which you can apply the right sort of augmentation so that you improve accuracy? How do you do hard negative mining? How do you choose a maximum entropy sample, which is what Madhu talked about in the morning? And finally, you know, how do you visualize, how do you know that your object detector is getting trained in the right direction, right? So these are all not that easy questions to answer. So I hope, you know, I have given you enough intuition behind the kind of problems and challenges that you face when you have to go beyond open source data sets. I talked about an architecture, which as we saw, it has ability to ingest a lot of unlabeled data. And I also showed that it outperforms most of the prior work. Before I take the questions, I just want to show you a few failure cases, right? So on the left, there is a picture where a dog is marked as a bag. And that's purely because we have lots of images where men and women are carrying bags and there are not enough images of dogs. Again, on the right, there is a picture where a TV monitor is wrongly tagged because the model just assumes that certain shelves contain a TV monitor. So whether to call this kind of object detector as doing reasoning or artificial intelligence or brain-inspired deep learning or doing simple statistical learning, answer to this question, I will leave it to your wise judgment. So with that, if you have any questions, you can either feel free to ask me your email or if you have any questions, I can take them now. How much time do I have? I think I can take the mic here. Hi, could you just take us through that slide again of the convolution and deconvolution? Some more explanation. This slide? Yeah. Why do you have to take the feed from the 5x5 feature map to the 38x38 feature map? What is the intuition behind that? Are you asking me why do we have to combine to feature maps? Is that what you're asking? Yeah, because the 5x5 feature map is eventually containing the information from the 38x38, right? It will have the, what do you call it? The receptive field from the 38x38 in a smaller format. So... Yeah, the question is not properly audible to me, but if you're asking me why do we have to do this, a confluence of feature maps, I can answer that, right? So as I mentioned, the intuition is that feature maps which are close to the input, right? Those learn pretty generic features like identifying circles, edges, or lines or so and so forth. And as I showed in my earlier slides, as you go down in a feature maps, those feature maps learn to identify abstract concepts like faces also, right? But now, the feature maps which are close to the input since they have a smaller field of view, those are especially good in terms of identifying small objects like in our case, in case of fashion and home decor, you can identify specks or earrings or so, right? Now you can equip the feature maps which are close to the input with enough semantic features which are there down the line, right? And if you could do that, then the feature maps which are close to the input, they will not just identify abstract concepts like circles or squares, but they will also learn to identify that in there's a bird or there is a ring, right? So that's the reason we combine the features across the layers. Hello, hi, my name is Teja from Deloitte. I have a question like the data maps which you have shown for the images, are these only for the static images or we can have it applied for the live images as well? I'll just repeat, I'm not able to hear properly. For the filter maps which you've shown for the images, right? I mean, identifying a T-shirt or a bat. So are these only for the static images or we can apply for the live video as well? Yeah, so this is object detector on images, right? Which means you can also apply it on video if you consider video as a set of frames. And we are also working on an extension where you can look at a set of frames and then you can do the detection where you can identify different parts from different frames. Hi, I have one question related to the problem we often face. We will not have a lot of data relevant for, as we said like label data that is relevant for our category, right? So have you tried this to see how much improvement it brings in? If you train the data with a larger corpus but apply it in a different category or a specific area or, you know, to see if that improves that. I see, I clearly see one advantage here is that if the objects are small, this deconvolution step is helping to detect them better than what other algorithms would do. Question is application to an entirely different category where there aren't many labels, just entirely unsupervised, let's say. How would it help? So I think answer to your question is in this slide where we applied this of the detector on 98 classes on Indian fashion and home decor where we only had 500 samples per class but we had about 40,000 to almost 80,000 images which were not labeled. So that was the training setting. And when we applied a vanilla SSD, we got MAP of, you know, a 56% but when we applied SSD plus plus, we got a 72%. And as I said, this is, I don't think it is Apple to Apple comparison because they're also doing unsupervised learning, right? Because you're also ingesting a lot of unlabeled data into SSD plus plus. But again, that's our, you know, architectural innovation. So, yeah, does that answer your question? In part, yes, thank you. I guess I'm out of time yet, yeah. Thanks for your time. If you have any other questions, feel free to reach out to me.