 Conceptually, the masked RCNN is similar to the faster RCNN. Masked RCNN additionally outputs the object mask using pixel-to-pixel alignment. This mask is a binary mask outputted for each region of interest. Much overhead isn't incurred when computing this mask as it is done in parallel with the bounding box creation and classification. After a region of interest of m cross m pixels, let's assume that there are k possible objects that it could be. For example, in an image if we were trying to categorize humans, dogs and cats, then k would be equal to 3. For each type k, a binary mask m cross m is constructed, analogous to a 1 vs. rest approach. Hence, while computing the mask, a loss of km square is incurred.