 The most intriguing advancements brought by deep learning and neural networks is in the field of computer vision We associate any problem that has an image or camera input to encompass problems within computer vision Self-driving cars, FMR analysis, Mars exploration rovers, facial recognition systems, object detection and augmented reality are just a few breakthroughs in the field. In this video, we will take a look at a new type of neural network architecture called Mast region-based convolutional neural networks Mast RCNN for short and in the process highlights some key sub-problems in computer vision as well Mast RCNN works towards the problem of instant segmentation the process of detecting and delineating each distinct object of interest in an image And so instant segmentation is a combination of two sub-problems The first is object detection And this is the problem of finding and classifying a variable number of objects in an image They are variable numbered because the number of objects detected in an image can vary image to image And the second part of instant segmentation is semantic segmentation Semantic segmentation is the understanding of an image at the pixel level That is we want to assign an object class to each pixel in the image in this figure with the motorcyclists apart from Recognizing the bike and the person riding it. We also have to delineate the boundaries of each object Using object detection and semantic segmentation together. We get instant segmentation in these images The bounding box is created from object detection and the shaded masks are the output of semantic segmentation Now that we have a high-level intuition of instant segmentation. We'll take a look at the architecture behind Mast RCNN Since there are two phases we have two parts for object detection It uses an architecture similar to faster RCNNs for semantic segmentation. It uses fully convolution networks So first off, what is an RCNN? It is an approach to bounding box object detection thus creating a number of object regions or regions of interest ROI's the next version faster RCNN Performs a better job by incorporating an attention mechanism using a region proposal network an RPN Faster RCNN performs object detection in two stages First determine the bounding box and hence determining the regions of interest This is done using the RPN protocol like I just said before and Second for each ROI. We determine the class label of the object. This is done with ROI pooling Mast RCNN does incorporate these tasks, but there is a problem of data loss in ROI pooling This involves the applying of pooling usually max pooling on a region of interest the bounding box computed during object detection Hence the name ROI pool in this method the stride is quantized Now pooling is used for downsampling of features and is used to introduce Invariance to minor distortions in input these minor distortions could be something as simple as rotation of an image So consider this five But even if it is rotated our models should still consider this image as a five that is the same input So pooling enables a model to become invariant to such rotation in this case Stride is the number of cells by which we move our sliding window during pooling or during convolution If you want more information about pooling and the intuition on stride check out my video on convolution neural networks Now coming back to ROI pooling when I say that the stride is quantized. What do I mean? Consider a region of interest of say 17 cross 17 and we need to map it to a space of seven cross seven The required stride is 17 divided by 7 which is 2.4 2 Since a stride of 2.4 2 is meaningless ROI pooling will quantize this value by rounding it down to 2 So it will use a stride of 2 along the width and the height However in doing so it only considers the top 14 cross 14 pixels in the 17 cross 17 region The remaining points are lost Not only is there a loss of data, but this can also lead to misalignment If we use an 18 cross 18 input and map it to a seven cross seven output the required stride becomes 2.57 This rounds up to three in ROI pooling so you can see that there's a misalignment when we perform pooling here Now to address this problem ROI align is used no quantization takes place So in the case of the 17 cross 17 input region we consider a 2.4 2 stride as it is However, this value is meaningless Each cell is divided into a 2 cross 2 bin so that creates four regions in the top left the top right the bottom left and the bottom right and each of these sub cells is pooled through Bilinear interpolation leading to four values per cell and the final cell value is then computed by either an average or the maximum over the four sub values By addressing the loss and misalignment problems of ROI pooling the new ROI align leads to improve results ROI align is thus better than ROI pool as it allows us to preserve spatial pixel to pixel alignment for every region of interest And there is no information lost as there is no quantization Conceptually the mass RCNN is similar to the faster RCNN Master CNN additionally outputs the object mask using pixel to pixel alignment This mask is a binary mask outputted for each region of interest Much overhead isn't incurred when computing this mass as it is done in parallel with the bounding box creation and classification Consider a region of interest of m cross m pixels Let's assume that there's k possible objects that it could be for example in an image if we were trying to categorize humans dogs and cats then k would be equal to 3 For each type k a binary mask m cross m is constructed analogous to a one versus rest approach Hence while computing the mask a loss of k m square is incurred This is different from the typical approach of constructing a single mask from k classes as the classes would compete in the mask This lack of competition is the key to good performance in instant segmentation In each region of interest ROI determined in the object detection phase Let's take a look at the semantic segmentation with FCNs fully convolution networks FCNs are used to predict the mass from each ROI So why are we using convolutional layers? This is because convolutional layers retain spatial orientation Such information is crucial for location specific tasks like creating an object mask So you can see why the traditional use of fully connected layers won't work here in fully connected layers The spatial orientation of pixels with respect to each other is lost as they are squished together to form a feature vector In Facebook AI research the cocoa dataset is used It's a large-scale dataset for object detection segmentation and captioning There are over 200,000 labeled images consisting of 1.5 million objects Mast RCNN takes about one to two days to train on this dataset using an AGPU machine It achieves good results even for challenging images Here's a comparison with respect to the state of the art fully convolution instant segmentation system FCIS FCIS is an alternate framework that also uses semantic segmentation and object detection to categorize box and mask objects in an image And it does it fast But FCIS exhibits systematic errors on overlapping instances and creates spurious edges Showing that it is challenged by the fundamental difficulties of segmenting instances Here are some key things to remember Instance segmentation is object detection with semantic segmentation Mast RCNN is an architecture to achieve instance segmentation It combines faster RCNNs with fully convolution networks FCNs Mast RCNN uses ROI align which preserves the spatial orientation of features and leads to no loss of information And there's that the new mast RCNN for instant segmentation I'll leave a link to the main paper their code and links to other cool blog posts and papers in the description down below So check that out, too Leave a like and comment down below on your thoughts on this new technology Subscribe to the channel for more super duper content, and I will see you in the next one. See you