 The Vision Transformer, VIT, model has recently become popular for computer vision tasks due to its ability to extract long-range features. However, it can be difficult to train when there is not enough data, so researchers have developed an improved version called VITiPatch. This new model adds a shared multi-layer perceptron, MLP, head to the output of each patch token to balance the feature learning on the class and patch tokens. Additionally, a secondary task is introduced which uses the output of each patch token to determine if it overlaps with the tumor area. By adding this secondary supervision information, the VITiPatch model is able to reduce the need for large datasets while still achieving good results. This article was authored by Hao Feng, Bo Yang, Jingwen Wang, and others.