 Hello everyone. My name is Alen, and I'm going to present my capstone project and give much advanced overview. My capstone project name was smaller models for 3D semantic segmentation using Minkovski engine and knowledge distillation methods and with the supervision of Eric Haruzinian. Today we are going to talk about knowledge distillation, 3D semantic segmentation and give overview, dataset and state of the art models, result and details, and future work what could be done. So what is knowledge distillation? Knowledge distillation is when we have a big model and we want to transfer its knowledge to much smaller model and distill the knowledge. This term became popular with 2015 paper by Distilling Knowledge in Narrow Networks by Geoffrey Hinton, and the basic idea is presented on top left when we have teacher model and student model. Here on the example on the right you can see one of the simplest methods that there is used. It is fit forward convolutional networks and besides using standard loss function on the output of the model, we are also using loss function between the larger cumbersome model and smaller model, and we are trying to distill knowledge much faster and distill the general not overfit model, too much smaller model. And Hinton suggested also the formula that is used on the left where you can see the standard software when t is equal to 1, it is standard softmax algorithm, but with t we argue. And t is a temperature parameter and it helps the model to much smoother get to the result. There are a lot of other knowledge distillation methods that got further popularized. The left one is paying more attention to attention. The basic idea here is to use attention network and transfer its feature knowledge to standard convolutional networks. And on the right you can see the further methods that use not one student model, but n student model and with assembly model methods or voting system. They are using this base knowledge of n student models to learn much better student model. This is also an example of another paper where they use multiple teacher models with different approaches. The left one is the one that was used in born again paper. The second one is using also feature mob similarities between students and teacher models. There is augmentation technique generating subteacher network, which is used. So there is a method that learns different teacher methods by learning distribution of parameters, but not the parameters exactly. So the left bottom one is about that and so on. What is 3D semantic segmentation? In 3D semantic segmentation we have 3D scene and we want to do semantic segmentation of each instance there. So this is the example of indoor dataset where we have table chairs and each chair is colored different because it is different instance of the same class. The simplest way of 3D data representation is point cloud, which is shown on the left. So we represent different points, which have three dimensional X, Y, Z coordinates and also for colors also RGB. So it is basically six dimensional representation. The second one is voxel representation, which we define. So we have upper and lower bound of 3D space and we limit it by dividing it different voxels. For example voxels size 5 centimeters and we accumulate all points. There are different optimization algorithms that accumulate different points to specific cell on 3D voxels and we have at the end we have three dimensional metrics and each point represents if points exist or not and which color it is. On the right there is mesh representation, which is for optimization and we don't actually use it in this study, but it is much useful when we are dealing with industry and there are a lot of optimization techniques that are related to mesh representation and we have to finally achieve to that point. The simplest 3D semantic segmentation algorithms are point net and point net plus plus. They are simple convolutional networks on point cloud data set, but they are concepts that used, for example, how do you understand the rotation of 3D object and they suggest this T net block that is separating another convolutional block, learn their 3D rotation for example and then came back with matrix multiplication to understand 3D rotation of object and also they use this similar method for 3D points orders so they understand the permutation because when we change the order of 3D points the 3D data is the same, but data points are different with their order so we have to understand also the order. The next one is point CNN. It is using 3D point cloud information and kind of pulling it into much low number of points so here you can see that the number of points gets lower and lower so we are using convolutions to get much details and then using pulling methods to get much low number of points but these are old methods and they now perform 20-40% less accuracy than the state of the art methods. The left one is the first one that used transformer methods that my colleague called about recently so they are using transformer methods on 3D representation, but it is based on the point cloud data. On the left they are using for multi-ed attentions on 3D point clouds and then using classification network and then segmentation which is based on standard algorithms. On the right it is one of the state of the art methods. Actually there is a slight error here. It is not only based on point cloud but it is also based on voxel representation. So on the left they are using sparse convolutional feature network which is based on Minkowski engine that I will talk about later and on the right they are... from each feature map they are extracting information which is similar to the example that is used in different vision transformer methods so they extract from each feature map different information and by combining them they are using transformer to understand more deeper about the data and on the right too these are the blocks shown here and on blue the next one is swing 3D this is the state of the art on validation data set of scannet which will I talk about later the basic idea here is use swing transformer methods that are used in standard computer vision tasks based on 3D data set so in standard swing transformer methods they are using transformer blocks by dividing the image or the grid into different patches for example it could be 2x2 patches 16x16 patches and by combining all of them and using transformer they are extracting as much information as they could here on swing 3D they are using the same method but with different voxel sizes so on the bottom it is only... this is the 2D representation but it is 3D on the bottom it is only 2x2 but on the top it is the whole smallest voxel size it could be together by combining their whole knowledge and on the right you can see the 2 blocks that they are using the left one is just window one and the right one is sliding window this is the simple... this is the core idea behind convolutional networks they are using sliding window in transformer they are using the same to have different patches with different slides and finally this is Minkowski engine and it is based on the paper for the special temporal convenets the basic idea of Minkowski engine is that when we use voxel size with as low voxel size as we could the 3D data gets empty and with 2.5 centimeters 98% of the data of 3D space is empty so we are using convenets when we use convenets standard convenets for example 3D convolutions we are not going to further because the whole data is pars so with Minkowski engine they suggest it sparse convolutional networks where they use not standard kernels they are using kernels with limited number of points on 3D space or not only 3D space but behind the idea they prove that with sparse convolutional networks they can achieve much higher result with much simpler models and also their standard data set was for dimensional data so it was 3D videos of 3D scans and they use unit similar network but with their suggested sparse convolutional networks and here on the left we have point cloud on the right we have semantic segment data and this paper was designed on top of Minkowski engine it is called Mix3D and the main idea here is to use different scans and by combining two scans with different rotation we get another data and this is basically augmentation technique which made it the best on test data set on scannet the data set that we have used is scannet v2 which is total 1,514 scans it has 20 labels there are also a lot of other data set for example nukedSense leader scan data set as 3D IC is outside data set while scannet is indoor scans what method we have applied so we took Minkowski engine with Mix3D architecture and applied different teacher methods we used and Mix3D augmentation and standard Minkowski engine and also we learned two different student models the one is 2,5 smaller each kernel size and the second one is 4 times smaller and at the end we get 4 and 16 times smaller with parameters model and we used unit similar structure with sparse tensors and applied res 16 unit for T4C that was suggested also by Mix3D the first method that we have applied is using intern method of 3D semantic segmentation so we passed the network with student and with teacher and also applied loss function on the last layer with the suggested T parameter that was temperature and the second method that we used is use decoder loss so we took the last feature map of student method and used up sampling there and applied loss function between this block and the loss function between and teacher method and also we used the same for encoder loss layer and the result was so we used 5 centimeter voxel size but on the state of the art they were using 2 centimeters but with low resources we could use only 5 centimeter voxel size and we replicated Mix3D result and got 69 percent and with half model we got 2.6 percent lower result and another one is almost 8 percent and this is the table based on different classes that we achieved accuracy and also it is interesting to note that on some classes our model perform better than the actual state of the art replicated result and also here is shown the result of replicated and the result of their paper and also the training with encoder and decoder loss function became much stable because as we understood it correctly it is because convergence is becoming much faster and we are not letting to model to jump with gradients and we are restricting it to keep on the state as the teacher model was for the future work there are a lot of things to do so the first one is you apply this same method and understand its performance for large scale methods so we suggest we showed multi teacher methods and also there are methods that use different architecture based teachers and get at the end smaller network which has knowledge of different architecture networks and also additional layer functions because when we use different methods there could be transformer based methods convolution based methods or other approaches and we can have different architecture solutions there to distill the knowledge much better from teacher networks and also there are a lot of architectures we didn't apply architectures here first of all because they were heavy and second Minkowski engine is limited now with its implementation on specific Kuda kernel and specific versions so there is a lot of engineering to be done here to on the same system have transformer based method and Minkowski engine based methods and distill much smaller models from there thank you tis was it thanks for your interesting talk so could you give me a little information about the loss function you currently used yes this is standard MSI because with up sampling we at the end have same shape tensors for teacher and student networks with up sampling and we are using MSI loss pairwise that's it yes thank you I plan maybe to use something like SSIN loss function was ready in major yes it could be used but by our estimations MSI was enough and the papers that use similar approaches use also MSI so we didn't go deeper on that direction and also it could be used some alpha parameter we used it but on our research we applied it one but it could be used alpha parameter between this L1, L2, L3 losses to get at the end model that is much similar to what at the end we want okay thank you can you say anything about minkowski engine how does it create tensors little bit so it is using not standard kernel functions so in standard kernel functions for example in 2D when we apply 3 by 3 kernel functions it's always we have nine parameters yes but on minkowski engine it is not always that case for example there could be kernel which has the right so three weights and another three weights and also there are studies on standard 2D that use these not standard kernels that are not full shape and with these kernels they achieve the same result which much lower weights minkowski engine is the same for n-dimensional convolutions it is not only 3D it is n-dimensional