 We'll be starting with our day-to-first talk which is fine-grained image classification with biolinium CNS, the talk is by Rajesh Pat. He's working as a senior data scientist at Walmart Bangalore. His work is primarily focused on building usable machine deep learning solutions that can be used across various domains at Walmart. He has been speaker at various international and national conferences from ODSC, Seattle Silicon Valley, AI Summit, Kaggle Days Meetups, and a lot of other places. He has also been mentor for Udacity Deep Learning and data scientist nano degree for the past three years and has conducted like numerous number of workshops. So not just wasting time and getting on board and starting with the talk for today. Welcome Rajesh. Hey, thank you. Thanks for the introduction. You can start. Okay, I can get started, right? Yeah, so hi everyone. Good afternoon. I'm Rajesh Hrida, but working as a senior data scientist at Walmart and recently recognized as a Google developer expert in Yemen. So today I'll be talking about fine-grained image classification with biolinium CNS. So this is how the agenda looks like. Hope you guys are able to see my screen. I didn't get a confirmation. Cool. Yeah. So yeah, I'll be talking about giving an introduction, brief introduction about what fine-grained image classification is. Different challenges with respect to fine-grained image classification and then applicability in multiple domains. Then I'll be touching upon the prerequisites basically just giving the overview of CNS and receptive fields. Then I'll be talking about biolinium CNS, which is correct for today's talk and some modifications that we made to the biolinium CNS with attention. So I'll be talking about that how we can reduce number of parameters while keeping the accuracy intact. So talking about those things and then I'll be showing some sample code in PyTorch and talking about the results that we got. So yeah, coming to the classification versus fine-grained image classification task. So everybody are familiar with doing classification with images. So let's take an example. So here, in the image shown above, so we want to see, like, classify between, let's say, birds and animals. So in this example, it's like crow versus a cow. So it's pretty easy to distinguish between these two classes. And as humans also, we can just look into these two images and then easily comment on, okay, this is a crow and this is a cow. So it could be of different breeds. But in fine-grained image classification tasks, so the classification is within the crow class. So it could be common raven or American crow. So these are like two different classes within a particular label that is present. So you can see from these two images, like, I mean, for humans also, it's very difficult to distinguish between these two images. It's like two birds are like crow sitting there. But how do we say that this is like common raven or American crow? They're like very minute differences between these two images. So this is just an example of just an, I mean, example of image classification task in general versus fine-grained image classification task. So we'll see, like, what exactly the formal definition of fine-grained image classification is and also we'll look at your examples here. Okay, so the fine-grained image classification task is like nothing but it's focused on differentiating between hard to distinguish kind of objects or classes. So it could be like bird species or it could be flowers with very minimal difference between different classes. It could be animals or it could be, let's say, product images. There could be very minimal difference between two classes. It might be with respect to flavor or any other aspect of the product. And also, if you take an example of medical domain, then if you're looking at, let's say MRI scans or any other images, right? So then a difference between two classes could be very minimal, right? Kind of diseases that are present. They could have like, I mean, visually very similar, but there could be some minor differences. I mean, making them different in terms of disease or any other illness. So these are the kind of domains where this fine-grained image classification can be applied to or these techniques can be applied, right? So the challenges are like, as you can see in the image shown above, these are like two species of birds, right? But visually they are like very similar, like the color, the only difference is like the beak color, right? So it's like yellow over here and there is a black dot over here, right? Like minute difference between and also the legs are slightly different, right? Rest, the majority of the like appearance is like very similar. And if you look at, let's say, this husky, right? Like Siberian husky, Eskimo dog and Malamote, right? So these are all like different species and so different classes, right? So it's like visually very difficult to classify these kind of images that are there. So same thing with the Irish Terrier and the example that is like shown above. So very complex to identify the minute discriminating parts within the images and also labeling these kind of images also is very difficult, right? We need expertise in labeling these images. But just by looking at the image without any domain knowledge or without any knowledge of different species or breeds, so it's very difficult to like provide a label, right? So in that sense, as you can see here, number of samples, these are like kind of different fine grained image classification data sets, flowers, cars, birds, aircrafts and stand for dogs. So as you can see here, like number of classes are pretty high and number of samples per classes are pretty less, right? So here it's for Stanford dogs, it was like 120 classes and per class had around 100 images. If you look at flowers one or two data sets, then there are like number of classes are one or two samples per classes, only like very few, right? If you compare to the data sets like CIFAR or ImageNet, number of classes are like not that high, right? So if you look at CIFAR or MNIST, the number of samples are pretty high and number of classes are pretty less. So in that case, it's easier to build a classification models and do like use just transplanting approaches and then use a network to train these kind of, I mean, data sets, right? But in this case, since number of classes are greater than number of samples, so it's pretty hard, right? We can't have a very complex model that is in place, right? So yeah, we'll come to that. Yeah, before moving into the bilinear convolutions, let's just go across like what convolutions are and then come into the receptive field. So in convolutions, what we have is given an input image. We have like set of filters, then we apply the filter on the image and then we get the features out of it. So feature maps are there and then there is a max pooling layer for like down sampling. So you can see the feature map getting reduced from 28 cross 28 to 14 cross 14, right? So there's a down sampling that is happening. So that's using max pooling layers. And then, so these kind of, I mean, convolutions and max pooling, so these patterns are repeated. And then finally, let's say we get a feature map and that is like flattened and then the flattened layer is then passed to the neural network, right? So fully connected neural network, right? So basically, in given an input image, we get the features from the convolution layers. So then those features are used for downstream tasks. So it could be of the detection or classification. So in our case, it's a fine-grained image classification. Okay. And yeah, I hope like everyone is like familiar, at least familiar with convolution, what convolution is. So not spending much time explaining each of the terminologies here. I'm hoping that at least the basic understanding of CNN is there. Okay. And so coming to the receptive fields, so receptive field is nothing but so we have the feature maps, right? So how much visibility a feature has on the input image, right? So that is nothing but the receptive field. So in this example, layer one, we have like the LO that is you can treat that as an image or else like a feature map that is present. And so let's say you're applying the three cross three filter, right? So then after applying the three cross three filter, you get a single value. This single value has visibility on the three cross three patch. So that is nothing but the receptive field. Okay. So that is exactly like that's shown on the right side as well. Let's say there's a three cross three filter, then you convolve and then get a single feature value, right? So then this particular value has visibility on three cross three patch. And as you move away from the image, let's say again you apply one more filter here, three cross three filter and then you get a single value, right? So that means, so this has a visibility on three cross three patch that is on the feature map. And then this internally has some visibility on the input in it. So in that sense, this particular value has visibility on the larger patch, right? So intuitively, it's like receptive field increases as you move away from the input image. And if you're very close to the input image, then the receptive field decreases. Initially, it's like three cross three when we're using three cross three filter, then as you're moving away, it could have a visibility on the entire image aspect like four cross four, like sorry, like in this case, it's five cross five, right? So I hope this concept of CNNs and receptive fields are clear. And so we'll move to the BCN architecture and see like how exactly this architecture is helping us in classifying the images with very minute differences. Okay, so the bilinear convolutions. So this was introduced in 2015 by Shingon Lee, Aruni Roich, Choudhury and Sobran Shomaji. So this was published in ICCD 2015. So this is how the architecture looks like. So given an image, so we have like two CNN streams, CNN stream A, CNN stream B. So that's why it's called as bilinear. So it's not a single network that is being used to networks can be used. And it's like a same network can be used twice with the weight sharing. So that is also possible. Okay, so input images pass through CNN stream A and input images also pass through CNN stream B. Then we have like a feature map of shape like W cross H and M is nothing but depth, right? So same case here, it could be different number of filters over here, different number of features over here. So after getting the features from CNN stream A and CNN stream B, then they calculate something called as feature interaction matrix. So need not worry at this point, we'll go into the details of feature interaction matrix. So just treat this as a black box for now. And once we have the feature interaction matrix, we flatten it, and then that is being passed to the fully connected layers. So in typical CNNs, what we used to do is like we had like CNN feature maps. Feature maps were being flattened and sent to the softmax or the fully connected layers. But in this case, we get the feature interaction matrix. And then the feature interaction matrix is flattened to the, flattened and sent to the fully connected layers. So that is a major difference actually, right? And then there are like two different, two different or two CNN streams in general, I would say. So yeah, we're going to the details of how we calculate the feature interaction matrix. So before that, so let's say we have a feature map, right? So feature maps is, so these are nothing but feature maps, right? So what we have over here, right? So yeah, so let's take an example. Let's say we have a feature map of size h as a height, w as a width and m is the depth. So those many filters were applied. So those many features are present. So we can convert this to 2D and then show it like a 2D representation of the feature map, right? Instead of flattening it. So w cross h will be as the rows and m is nothing but the depth, right? So depth is being converted to now like columns now, okay? So now each row is nothing but a location, referring to a location in the input image, right? And then we have like m features for each location, right? So the concept of receptive fields over here, which I like spoke earlier, right? So every row is corresponding to a particular location in an image and each location has some featured descriptor for that, right? So and the size of the like feature is m over here, right? So I hope this is clear. So just converting 3D representation to a 2D representation. So this will be helpful for us to calculate the feature interaction matrix, right? So yeah, so this is how the feature interaction matrix is calculated. Image is passed through a scene in streaming. Then we get feature maps in a 2D shape that is shown on the left and a feature map but also a 2D thing over here and that is coming from the scene in streaming, right? m is nothing but the number of features and w cross h is on the columns. So a transpose is taken here and for feature maps coming from the stream b, we have like w cross h, the rows corresponding to locations and then we have n features over here, right? So now it's just a matrix multiplication. So in order to multiply the matrices, so we need to have this dimension same, right? So w cross h and w cross h. So this needs to be same. So then so if we multiply these two matrices and we arrive at a matrix which is of shape m cross n, right? So I have highlighted a particular row and a column here just to understand let's say this particular element, right? How did we arrive at this? So this particular row was taken and this particular column was taken then it's kind of a dot product I would say between the rows and columns and then finally like we are getting a particular value. So if the similarity is very high then we will have the values higher over here if the similarity is low then we'll have the value slower and same thing holds good for the, so it's done for all the locations basically like first row, last column, right? So that would be I mean having the value over here, right? First row and last column and similarly for basically first row with all the columns would be the first row here and similarly for all of the rows. So this is nothing but the feature interaction matrix. Now let's get into the intuition of feature interaction matrix. So I have taken from the slides from the authors itself. So let's say we all know that CNNs, so they capture some features from the input images based on the task that is in hand. So it's known that the features which are learned in the initial layers are very generic and the features which are learned towards the classification layer or the layers which are very close to the polyconnected layers are very task specific. So let's say from CNN stream A the features were like, I mean the filters are trying to capture the beak or tail or belly or legs, something like this, right? And let's say the features that are coming in from the stream B, right? So let's say those are mostly relative colors. Now the interaction is being captured by the feature interaction matrix. So let's say there is a gray belly in the input image, then the value of this particular row and column would be very high. And this interactions are very important to classify this images into different species, right? Okay, so that is the intuition. So I will just go through this particular architecture. So basically we have like stream A and stream B, we get the features, we calculate the feature interaction matrix. So basically I'm trying to explain what exactly these features could be, right? Just an example to get the intuition. So this could be like, so let's say beak or tail and I'm just reiterating so that it's clear because this is a very important aspect, okay? And then this could be colors and there are like different interactions between these features, right? If those features are present in the image, then you'd have a high value. If the features are not present, let's say red tail is not present, then the value would be lower, right? So that is how the feature interaction matrix looks like. Now, so me and my colleague, so Sauradeep Chakravarty, so we were discussing on like, are all the interactions that are happening, is it important or is it only like, can we just take only a few important interactions which are helpful for classification? So that was the intuition, like that is how we started with attention models. So attention models was initially introduced in an LP space. So let's say you are trying to translate in a sentence from let's say German to English or any of the language to English, right? So while translating a particular word, you need not look into the entire sentence, the input sentence. You can just focus on only a one word and then get this particular thing translated, right? So that's the intuition of attention techniques, right? Basically being selective on some input space or a vector space, okay? And so that was also introduced in basically in the image space as well. So in image, there was a paper called show tell, show entail. So that was basically on, given an input image, trying to get a caption out of it. So let's say you're input image woman is throwing a frisbee in the park. Then came a paper called show attendent putting attention mechanism over here. So the intuition here is, let's say I'm generating a word frisbee, right? Given the input image, then my focus need not be on the entire image. My focus can only be on the frisbee, right? So that is attention technique is taking care of that. I'm not going into the details of how exactly the attention is being calculated. I hope you guys got an intuition like how the attention mechanism works. In the context of, let's say sentences and voices like in the context of images, right? Okay. Now, coming to the modifications that we made. So this is not it published. So it's like in progress. So this is a work done with my colleague. So, so given the input image, everything remains same till the feature interaction matrix. Instead of flattening the feature interaction matrix and sending it to the fully connected layers. So what we have done is we have put attention on the interaction matrix. Trying to capture the most, I mean, useful interactions, the most important interactions. Right. So in that sense, since we have like two CNN streams, we have the attention mechanism, row wise and column wise. So a row wise interaction would say that, okay, with which feature the interaction is most important for classifying this particular image, right? Let's say a blue belly example, saying that there's some difference in the belly. So a belly is like the important feature that is coming from the CNN stream. The most important feature. And we can just only focus on that, at least on a CNN stream. Right. And let's say a particular color is being important. Right. So that is taken care by the column attention. Right. I'll just go through this example. So basically trying to give importance to only certain features which are important. Right. So that is taken care by the attention mechanism. So row wise attention and a column wise attention. So that is being added. And then basically we get instead of a matrix, then we are just having a vector. So that we are calling it as zero and Z column. So here it was like, if we are flattening this m cross n. So it would have been like m cross n elements. And then that is being fed to the fully connected layer. Now it's like a row vector and a column vector that is being concatenated and sent to the fully connected layer. So in that sense we are reducing the number of features as well. Right. The feature space is being reduced and that will definitely reduce the complexity of the model. Right. And the complexity of the model needs to be reduced because we have like number of samples which are very less compared to the number of labels that are present. So as you can see in the docs data set that we used. So there were like 120 classes, but number of samples for each class was around like 100. Right. So in that sense, if we make the model very complex and we might need more data to train. Okay. So yeah, so the currently the experimentation is done on the docs data set. So there are like 20 key images. So this is from Stanford and they're like 120 classes. And so this is how the split looks like 16k for training validation and then 2k for testing. So this was the split that we made. And also the bonding boxes are available for each images. So what we do is we crop only the region where the dog is present. And then that is how we prepare the data set. And that is precise to a particular size, a standard size. And then we like we train them network. So I'll come to like what networks we used and how the experience experimentation was done. Right. So I hope this code is visible. So I'll just show like, I'll mainly focus on the bilinear convolution part and attention part. We are ready to publish. So that will be available soon. And so this is basically we tried with ResNet. So ResNet 50 as a free train model as a backbone model. And so we didn't sprain the basically transfer learning approach that we took here. And then basically ignoring the fully connected layers just using the convolutional features that is coming. So this is basically trained on the ImageNet model. So we're just ignoring the final layer so that it's very task specific. And so basically we are flattening the feature interaction matrix. So the final layer of ResNet had around like 2048 features. Or this could be anything like find a 12 filters or 1024 features depending on what kind of networks that is network that is in this. So the feature interaction matrix would be like 2048 cross 2048. So that is the input to the like the final layer output layer. Like here we have like 120 classes. And then so this is like 2048 cross 2048 comma 120. So that is the weight matrix that is being learned after the convolution layers and the final classification layers that we have. And so yeah just to summarize so we have a ResNet 50 model and then we are ignoring the final layers of ResNet 50 model the ImageNet model. And then we are just introducing the head that is very specific to our task. So classifying images into 120 bridge. So that is how the initialization looks like. And then so we do the forward propagation. So basically the interaction matrix is being calculated over here. So as you can see it's like the feature interaction matrix. So we are passing it to ResNet. So as I said earlier this stream A's and stream B can be a single network. So right now I'm using the single network or else this could be like we can have a variable called y is equal to self dot let's say inception net or any other network. So right now I'm just using single network and then calculating the feature interaction matrix. So feature interaction matrix is like I mean can be achieved using tors dot batch matrix multiplication. And then once we have the feature interaction matrix and that can be like fed to the fully connected layer. And then so these are like putting a square root and normalization is making sure that it's differentiable and back propagation is possible than this. So this is exactly like how it was mentioned in the paper. So coming to the results that we had. So this is actually very important. Sorry Ajish. We have just five minutes more left with us for the talk. Yeah. Yeah. This is the last slide. Thank you. Sorry. Thank you. Yeah. Yeah. So like in both BCNNs and like binding the CNN with attention to the modification that we applied the feature map size. Let's say this was the feature map size. So in both cases it was same. Now input size for the fully connected layer. So in bilinear CNNs what we had was the feature interaction matrix. Let's say there were like 1024 filters. So we had this interaction matrix. So that was like m cross n. Since I was using the same convolution layer. So the number of m and n were same. So it was coming up to like it's a very huge number compared to what was introduced in attention. So in attention what we did was we just took the row attention and the column attention. So these two vectors were concatenated. So that is why like 1024 vector would come from the row attention and then 1024 attention would come from the column attention. So basically multiplication has changed tradition and that will definitely help us in reducing the number of parameters. So I've given an example of a year. So in this case we had like 120 classes and so parameters to learn you can see it's a very huge number. And you can see the number of parameters to learn in attention cases like a very small number. So there was like 99.8% decrease in the number of parameters to learn. But if you see the accuracy it's still intact. So it's like 83% close to 83% with BCNNs and then with attention with very less parameters. So it's around like 80 to 0.56%. So without BCNNs so it was around like 75% just doing a transfer learning with like ResNet models. So I hope I gave an idea of like what fine grained image classification task is. So what are the challenges? And so this was a bilinear CNNs was like the first deep learning paper that was introduced in a fine grained image classification task. So before these, so there were like very traditional approaches with histogram of four integral gradients or maybe Fisher vectors. So those kind of things were used handcrafted features. So those kind of approaches taken for approaching this problem. So this was the first paper talking about using convolutional networks for achieving fine grained image classification task. And so this is what the modification that we have done and seen that there's significant reduction in number of parameters but still keeping the accuracy intact. So this is the summary of like the talk and so a blog is also available in weights and biases. So as you can see here, it's published at weights and biases and also the code is available in Kaggle. So you can just let me know if you have any questions. Thanks a lot, Rajit. This was actually an amazing talk being from OpenCV background actually can relate to it. So we have a few questions from the blog. So I'll just read the screen on the banner. Sure. Yeah, so the stream A and stream B CNN could be a ResNet model and it's basically two different CNNs or a same CNN that is being used. I hope that answers. So in our case, it was ResNet 50 model. So ImageNet pre-trained model.