 Okay, thank you so much. So good morning esteemed highest attendees. It's a truly an honor to present in front of such a distinguished audience today. My name is Razan and I'm delighted to share with you insights from our project titled Deep Block, Deep Learning Based for Pathology Localization and Classification in Risk Extray Images. So in fact, bone diseases are common and serious problem. According to statistics, there's around 1.5 million cases suffered from fractures due to the bone diseases. Also, it was found that the fractures of the wrists have the highest frequency of misdiagnosis accounting for 32% of all the cases. It's important to note that the manual analysis of X-ray images is time consuming even for experienced doctors. Also, body quality images may hide some important details for the diagnosis. Furthermore, the experience of the radiologists is a key factor of the error probability. All in all, the problem can be seen as the detection of wrist bone diseases is manual, time consuming with high probability of errors and that's why there's a high motivation to automate this process. So our goal in this project is to increase the speed and the efficiency of the diagnostic procedure. As we can see here, this will be the input image for our model. It will take several minutes from doctors to evaluate the image while it will take less than a second using our model to evaluate it and the drawing this bounding boxes around the injured areas. The objectives of this project were to, first of all, implement the YOLO version architecture to the modified dataset that I will explain later then trying to enhance the results by adding the attention mechanism and then trying to add a transformers, a visual transformers, and finally to combine all of these techniques together which they are the attention mechanism with a swing transformer with YOLO version seven and finally finally tune the final architecture. So before start talking about the proposed approach let me first introduce what is this a swing transformer. It stands for shifted with those and it's basically a vision transformer variant but the idea here is that it is with hierarchical way of processing the image. So it's like any visual transformer it still relies on patches but instead of proving one size and sticking with it through all the layers in state the swing transformer first start with a small batch size and it will try to merge them together into bigger ones as we go deeper in the transformer layers. And here you can find each of these transformer block it consists of the shifted window multi-head self-attention model followed by a two-lighted layer perceptron and also we have the layer normalization before each one of them with the residual connections. Talking about the attention mechanisms actually there are different attention mechanisms that we're suggested to use either what's called the channel attention or maybe the spatial attention in a way that to suppress the important channels. So there's only one study it's called CBAM which is convolutional block attention model which considered both the spatial and the channel attention together sequentially. But the problem of this study is that it ignored the channel spatial interactions. I mean that sometimes we have a common or across dimension information. So in CBAM they lose this cross dimension information. So to amplify this cross dimension interactions the GAM which is global attention mechanism was proposed which is capable of capturing the significant features across all the three dimensions. As you can see here we start before going to the channel attention we start with this permutation. It's a 3D permutation to preserve the information across the three dimensions and we also have the multilayer perceptron to layer multilayer perceptron to amplify this cross dimension information. And then we have the spatial sub model which is the two convolution layers for spatial information fusion. Now we can talk about the proposed approach. So as we can see here YOLO version seven was used as baseline. So the YOLO version seven is made up of three main components. We have the head, neck and the backbone. Sometimes they merge the head with the neck together. So in the backbone is where the convolution layers detect the key features of the image for the processing later. The main convolution layer here is called ELAN which is effective layer aggregation network which uses group convolution to increase the cardinality of the added features and try to combine features of different groups in a shuffle and merge cardinality manner. So this way of operation can enhance the features learned by different feature marks and improve the use of the parameters and the calculations later. Then we have the neck. In the neck, it takes these features from the backbone into the fully connected layer which are in the heads. So finally to predict the bounding boxes, coordinates and the classification probabilities. So the final output layer here which is the head it makes the prediction as we can see at three different levels. So the key important thing here is that it improves the model's ability to detect small objects, medium objects and small sized objects. So if you can notice this red boxes, you can find that we suggested to use or to insert the swing transformer and gamma tension before each of the detection heads. And of course, now you are wondering why in this position exactly. So let me first describe the data set before answering this question. We use pediatric crystal trauma x-ray data set which has nine classes and here you can find some samples for different classes. It has around 20,000 images with doctors annotations which are the bounding boxes and the diseases. The classes are the diseases. But after discussion with doctors in hospitals, we decided to modify this data set to get only four classes. We kept forth of a fracture and periosteive reaction classes as it is. And we merged two classes which are the foreign body and metal. Into one class we called it a foreign body and the other classes we called all of them upon lesions which includes soft tissue, bronosine and other diseases. And also we deleted an important class for the doctors which is the text, which is the letter here R or L that's for the right or the left hand. So after this modification, we found the performance on this modified data set. It was divided into training validation and testing set by 70, 20, 10% respectively and the results are shown in the table. So the average precision metric is used to evaluate and compare the different models together. Before starting doing the experiments, all of the experiments we have to find out first what is the best pre-processing and augmentation technique. We've tried different techniques like the anchored masking with different filters, clahi filter and also a mix we called our technique mix which involved resizing the original data set to 640 by 640. We applied a mosaic, mix up rotation and horizontal flipping. And in terms of this mean average precision, you can find that you can see that it has the best result among all the other pre-processing techniques. Also, we tried to find where is the best place to insert the gamut tension. So we've tried it before each of the heads separately and before all of the heads and we compared the results with the C-bump because theoretically we say that gamut tension is better than C-bump but we have to see this in experiments. And actually, yes, it was better in terms of the mean average precision but if we compare this before all the heads with each one of the heads, we can see that there's a small difference. It's not a very big difference here. So here you can find a table that compares different models. We've tried the YOLO itself, we tried with the C-bump attention, with gamut tension. We've tried to insert only the spin transformer before the heads and we also tried, if it's okay to insert this as a spin transformer even in the backbone. And finally, we tried to mix all of these techniques together the YOLO with the swing with the C-bump and YOLO with the swing with the gamut which we call it the D-block because it's the winner model. And you can see in terms of the average precision of each of the classes that it has the best results even with the mean average precision, it has 0.65. We also compared between different models in terms of the number of parameters, gigaflops, layers and inference time. And of course, as much as we add layers and we add like models to this architecture these parameters will be increased, the number of layers will be increased even off there. Even the inference time, it increased a little bit, but it's still configurable. Here you can find a comparison between the YOLO itself and our modified or the winner model D-block in terms of the precision recall curve. And specifically talking about the curve that's related to the Poin legionary class, you can find a real enhancement using our suggested idea. And also in terms of the mean for all of the curves, there's also an enhancement. For the application study we've tried to find what is the effect of rotating the images with different angles. So for each of the classes, so for the angles 5, 15 and 30 degrees there's somehow it's close to our results. But if we use this, the 60 degrees rotations it resulted in a notably inferior mean average precision score. And this is due to the increased degree of the distortion and the deformation introduced to the image data. And in this table you can find the detailed results for the precision and recall for the winner model for all of the classes and each one separately. Now let's see a real images to see how is the effect of our winner model. So using YOLO itself for this image you can see that it strongly predicts the edges of this image as foreign bodies. While this problem wasn't exist in our modified version it predicted correctly that there's a fracture. So let me tell you here that the blue bounding boxes are the grounded truth and the other colors are predicted by the models. Another example here using the YOLO it was able to predict only one of the two you see here two classes of the periodic reaction while in our modified version it was able to predict both classes together correctly. And also here's a very important example because using YOLO it wasn't able to predict any of the grounded truth of the point legend while using the winner model. You can find that even though it's only one class but somehow the region or the bounding box it covers all of these bounding boxes together. We also tried to test our model on some external data from the hospital. Actually the doctors were interested in the fracture disease only. So you can find here that the model was able to predict that there's a fracture correctly even though it was somehow covered by this metal thing and exactly the same here with a very good accuracy. Also this is a very important example because even experienced doctors weren't able to find that there's a fracture here because this is an example of a bad quality image while the model was able to predict that there's a fracture here and here. So as a conclusion, I can see that applying YOLO itself enhanced the state of the art results by 6%. When we add the gamma tension mechanism with the swing transformer we have another positive impact on the detection results by 8%. We found that the position of the attention mechanism had a relatively small impact on the result and it's good to notice that we still have limited amount of bond legend class images and this affected negatively the average precision result. So of course adding new images to the dataset will definitely improve the results. For the future work, I can say that if we expand the dataset and explore more like the transformers or even attention mechanisms, these are important areas for the further experiments. And thank you so much. Your colleagues, do we have, okay, we have questions. Okay, start with you, Gini. Thank you for the talk. I have two questions particular. Like first question is, unfortunately, YOLO-V7 has like no research only license so like no hospital could use it for free. Could you please tell me whether replacing this with some alternatives or earlier versions will affect the result a lot or is more or less the same? Could you please comment on that? Yeah, thanks for the question. Actually, I didn't know that it's limited to use. But I think, yeah, if you want to use alternatives, of course it will affect the result because the baseline that we used is the YOLO-V7. But actually I've tried also using the YOLO-V5 and YOLO-V8 even, but YOLO-V7 has the best results on the dataset. Can you comment like what's the difference between V5 and V7? Because V5 is at least for now more or less free. Yeah, so I think the main difference is using the ELAN and also in the detection heads, the sizes of the anchor boxes to find the small and medium and the large objects in the image itself. Okay, thank you, a second small question. So image augmentation really helps in the computer vision. So different approach like run, document and stuff. And my question is, are there specific augmentations that could be used for X-ray images? Thank you. Yeah, thanks for the question. Yes, actually I used something called mosaic augmentation technique. In this technique of augmentation, you know that when you have an X-ray image, sometimes the object is in the center of the image, but when you use this mosaic technique, the mosaic technique, you combine many images together in the same image. So in this way, you have objects in different areas of the image itself. Thank you, one more question. Thank you for your talk and simple question, just which framework did you use for your models, to build your models? Do you mean the PyTorch? Yeah, PyTorch or like something else? Yeah, PyTorch, PyTorch. Okay, and now a simple question, maybe I was a bit too late, just to clarify what's the amount of samples in the training data? We have 20,000 images, so 70% of them are for the training data. So 13,000, 13,000. No, 20, 20,000. Okay, thank you, thank you. Okay, colleagues, if anyone will have further questions, we have a course of this paper here, so you can ask later. And let us thank Razan again. Thanks a lot for a great talk. And now we proceed. So our next talk is on greedy algorithms for fast finding of curvilinear symmetry of binary raster images. Okay, so Nikita will present. Hello, everyone. My name is Nikita Lomov, and in this report I represent Tula State University, and the title of my report is Greedy Algorithm for Fast Finding Curvilinear Symmetry of Binary Raster Images. So let me start from the field of our activity. It is analysis of symmetry of planar forms, and it is intuitively clear that the planar shape is symmetric when we can cut it into two halves and they will be equal, but the problem is that when we take a photo shooting off some object from real life, it will be noise applied, it will be deviations caused by the point of view or some pose, some occlusions. And so usually we can't kind of talk about pure symmetry in real life images. We can talk about quasi-symmetry or approximate symmetry and cause we can see that all the subject that produces these binary silhouettes are intrinsically symmetric, but sometimes their shape is affected by pose and for example, for this alligator it is, it's pose can be expressed as some sort of curdle line and we need some straightening procedure and algorithm for finding this curdle line to make a conclusion that our initial object was really symmetric. In fact, the task of obtaining symmetry of planar object is well analyzed. There are a bunches of approaches and they can be divided into two main parts. One part is contour oriented and the second one is interior oriented and in the first class of approaches there is some procedure for comparing parts of contour. For example, we are assessing Fourier transform or maybe finding some critical points and the problem with these approaches is that, sorry, is that the estimation, the number that expresses the quality of our symmetry is hardly interpretable. So it's just some distance in Fourier space or some discrepancy in critical points. So we need maybe something more accessible and more intuitively clear. And another class of approaches is interior oriented and for example, our last research was about this class and there are, for example, the mechanisms for calculating the best transform according to jacquard measures. So we just reflect our image and we overlay them and in fact, it's just intersection of reunion measure well known in, for example, segmentation by neural networks. So we can at first analyze our projections of the Fourier don't transform and of course for symmetric shape, these projections will be symmetric as functions too. And the more advanced approach is connected to any possible fin transforms of our image. So we can find optimal solution for jacquard measure and optimal parameters for straightening and symmetrizing our image for all possible fin transformations. And for this part, I can conclude that it is all about the parametrization of our transform. So in fact, the real task is how to find transform that makes our image symmetric according to vertical axis, for example. So it is clear for strict symmetry. We just need to rotate and shift our image and for current symmetry, the task is much more complicated because we can of course express our current line by splines or by some points via BCA curve and so on, but it's too hard to optimize. And the task of searching for current symmetry is very well established in bioinformatics. For example, the analysis of worms or fishes and so on. And sometimes there are different representations of our images because of the nature of photo shooting procedure. Sometimes we cannot, for example, cannot get the whole outline of our object. We can just obtain some point cloud and we need to approximate our current line through this point cloud. Yes, and our past approach for this task was dedicated to dynamic programming. So we can express our current line just as a sequence of steps with a small angle and we can compute the optimal solution for this, but we have two problems. The first one is that this approach is to time-consuming because we, in fact, in dynamic programming should check all possible lines and then if we can stitch the parts of them, we are dynamic programming paradigm. It is still to, it is still to Tadus and time-consuming. And another problem is that we need to forbid cycles in our trajectories and it implies two hard restrictions of our possible curves. So we need somehow to improve it to access a better speed and better flexibility of our line. So what's the idea of straightening? We just draw some perpendiculars evenly distributed along our curve and we just feel our straightened image row by row along these perpendiculars. So the approach of straightening will be the same, in fact, but the approach of constructing our line will be completely different. It will be constructed with a 3D approach. So it is the same sequence of steps, but of course we trace only one line and maybe with some range of possible curves. Directions in every step. And as I saw, we use a jacquard measure as a measure of our symmetry and we can apply jacquard measure not to the whole image, but to the line. So we can, so we have some segment and some point in this line and we can compare the left side and the right side of this segment and to compare the lengths and it will be some partel jacquard measure along this perpendicular and from these partel measures the total measure will be collected. So in fact, our approach is simple. So we start with some user predefined or maybe some point extracted from the skeleton on the boundary of our object and we take some steps and on every step we can adjust the direction just using our jacquard measure and with some penalty of deviation from the past direction. And we trace our line with adaptive choose of step and adaptive choose of possible ranges of angles. Until we will leave the interior of our object. So we will go out of the image itself or out of the black part. And there are a lot of formulas and it's maybe too time consuming to explain everything but maybe this one is most interesting. So it's just a measure for choosing the proper angle and it contains three terms. So it is a proportional jacquard measure. So it's just local measure of symmetry along our future perpendicular and square root of constant alpha to penalize the rotations from the past direction. So if the direction will be the same or the rotation angle will be zero and it just crashes to unit. And in the denominator, we have some factors just length of our perpendicular but the portion of width line inside the figure. So it states that our method prefers stepping into rather narrow parts of our shape because we faced a problem that it was shifted to white parts of the image. So our aim was to trace our shape along elongated lines. And it starts by choosing one principle direction over eight evenly distributed and as of a measure of our direction our direction we just estimate the length to the first point of the boundary. And so when the boundary is far in this direction the direction is considered to be better. And another parameter of our algorithm as a step where it can possesses the procedure for adaptive choose of step. So it's just the square root of this distance to the boundary. And when the distance to the boundary is short so our algorithms begin to be more accurate and less risky as I can say. And when the boundary is far we can make much longer steps, that's the idea. And another parameter tied with this distance is range of possible angles from which we can choose our direction. That is also the same idea. So when the distance to the boundary point is far we not only can make larger steps but we can allow greater deviations from the last directions because in this narrow and very curved line this distance will be small and so the range of alpha will be much greater because we need much more freedom in our curvature. And of course after applying the straightening procedure we can access the classical Jacob measure just between straightened and reflected image. And for example it should be, yes, that is 0777, so it is pretty but it is rather close to zero to consider this image to be symmetric. But now we have some advanced procedures just to align the other parts, not the main body but the whole, all the parts of our lizard together. Yes, and the main result is our algorithm is successfully applied with white sort of images and it have very successful speed of dozens of milliseconds and yes, it's the main stage for time consumption of our algorithm because straightening is much more fast and maybe there are successful images but sometimes we can face the problems because as our algorithm is greedy it cannot see the future. So sometimes it can choose the wrong direction and by its nature it should follow it to the end. So sometimes we need to find some attacks of our shape and to end our curved line in it in this point but before we reached it we don't know about its existence and it's an example of problematic image. So and yes, we successfully solved the task of searching for the curate symmetry for elongated and curate binary shapes and we have some prospects for future improvement, for example, to align more subtle parts of image and to prevent cycles and so on. So thank you very much for attention and ready to ask your questions. Your colleagues, any questions? My question is from a different angle. You can show me the straightening, yes, yes of the shape straightening. Yes, this question arise from me. All objects by nature are fractal in nature. Could you do that? You define the fractality of the nature because not doing such a computation by your algorithm but define your fractality which will make it easier to you to understand how because fractals and fractality is very close connected to symmetrization and the symmetry you could do that for example by another algorithm and to make it easier for you to note that make this curved object to make it straight for example by shape straightening for example, this object and could you do that or just a reflection and question for you. It's a very broad question and for me, yes, it seems that another approaches and another techniques will be more suitable for this search for fractality. For example, I think that graph representations with a, we are a skeleton. So, skeleton is just a medial axis. So, it has some sort of thinning and result of thinning our form center. It will be a graph and it seems that we can analyze the self similarity and fractal parts without this skeletal graph maybe but not we are this rather restricted and it's AIMS procedure. Any more questions, colleagues? Yeah, I will also ask one question with probably I was not listening that well in the beginning but like what's the ultimate goal? So, yeah, you achieved that. So, you understood your shape very well but how do you use it at the end? So, I mean, what is the end application? Okay, so in graphics. Example, it is a result from the past but in biological research, for example, we need to get straightened texture for our object. So, we have a mask in form of silhouette of binary image and when we constructed this current symmetry line we need to get straightened texture current symmetry line, we can straighten it and example, get these straightened textures and compare them because it is much more comfortable to deal with straightened image with the same size and so on with the same coordinate system and but our research in fact mostly theoretically inspired. So, we are interested in some parameterizations of symmetry some kind, every possible kind and form of symmetry also as symmetry in graphs and symmetry in some complicated shapes and so on. Okay, thanks Nikita and I think we have no more questions so let us thank Nikita again and the next talk will be again by Nikita, right? Is Maxim Kuprashevich here or Irina Tolstik? No, okay, okay. Let's proceed according to the schedule. Yes, sorry for you, we should have thought about that better. Okay, so the next talk will be on handwritten text recognition and browsing in archive of prisoners letters from Smolensk convict prison, wow. Okay, please go ahead. It was completely different, completely different field and topic and tasks. We analyze historical documents because we have a grant of Russian scientific fund and so it is access to high school of economics but primarily some participants are from Lomonosov Moscow State University and it is about analysis of handwritten historical documents and of course some sort of automatic machine analysis and the problem with these tasks is there is a lack of large scale and general purpose data sets of handwritten documents in Russian. There are some of course in English maybe in another language but we have just several such data sets and some of them are too simple. For example, handwritten Kazakh and Russian because it's too restricted in phrases and its structure is just forms filled with predefined words so it doesn't contain so much diversity and there is a problem but there are several scriptures, several script styles and in fact we revealed that the way of writing dramatically changed to last century so they cannot be considered and regarded historical and also we have some data sets dedicated to historical persons for example Petr the Great and have had a very, very difficult and unique style of handwriting and of course it cannot be directly generalized to some historical archives because the nature of handwriting will be completely different and with different features but of course these data sets are commonly used and a lot of papers use them in experiments but in practical tasks we need to find another way and so our basic data was consisted of only 67 photographs and of course it's just a notebook consisting of containing letters from the small hands comic prison and they were completed with the same handwriting because of the censorship these letters were opened and rewritten by Sam Jundarm and this stability of handwriting is a great feature of our data set of course we extensively used it and there was a transcript but maybe a bit noisy with a lot of mistakes of missing parts and it was some tasks on preprocessing because there are overlaid pages and some parts of the same letter can be distributed to several pages so in fact this preprocessing was made manually and what are possible representations of our handwritten text data and practical approaches there are three main it can be expressed the image can be split into individual lines and aligned with text and script but sometimes you can pass to the system of handwriting recognition pass the whole paragraph and it is suitable when the lines are good aligned because just one over another and there are no, for example, no marginality no lines, no portions of text in between but sometimes there is a possibility to get more structured representation of a page with different paragraphs with different types of text but of course in real applications to collect data with such representations with all fragments isolated and all fragments marked as much more tedious so we will work primarily with line and paragraph representation and we will compare algorithms applied to these two types and yes, we need some preprocessing tasks to obtain these representations and it was achieved mainly by using some ready-made neural networks so the first one is extracting pen trace just doing some segmentation and grayscale image and the second one is extracting baselines also in the sort of binary image when connect components represent these baselines and the previous step can be made automatically but in this case we obtained that applying some ready-made modules to extract text fragments wasn't successful because they are overlaid and they are in the same script so the system cannot distinguish, for example, a letter between letter and underlying letter because they have almost the same structure so this step was taken manually and it resulted in 86 text fragments and image fragments corresponding to parts of the same letter and to obtain our line representation we have done such a step of distributing our portions of pen trace to baselines it is done by analyzing connected components so we filter out the noise there are parts that are too far from our baselines and distribute components according to majority rule but sometimes there are two or more maybe candidates when, because we have in our script underlined elements and overlined some loops and they can be stitched and so they represent the same component distributed to several baselines and in this case we need to cut these components into several parts and it is down and the most narrow parts of our pen trace just because it corresponds to the places of stitching of different dukes different lines of our pen and another preprocessing task is text straightening it's straightened again because we have some sort of rotation some sort of inclination in our text direction and it is also a simple algorithm we just need to correct our baseline to make it fully horizontal and to apply the same transformation to our initial color image and another very valuable step is to eliminate the portions of pixels related to pixels of pen trace so they are informative but related to another lines so it achieved by analyzing the mask by filling pixels related to this mask with the nearest pixels in the background so without this denoising the result is much more worse that denoised option in the bottom row and the same procedure can be applied to the page at the hole so this transformation is not line by line but for the page itself and it was a lot of manual work of course with preparing and transcript because we need to achieve a full correspondence between the image and the text to use it in machine recognition so every feature of the spelling feature of line breaking and so on it was retained when preparing this ground truth transcript and there are some popular architectures for handwritten text recognition and we have taken vertical attention because of our dataset is rather narrow so we cannot use full transformers but in fact it contains some sort of attention mechanisms but it just expresses the distribution of our line between rows of the image maybe with some pulling or striding so the attention in this case is just a column vector and the representation of a line is just a weighted sum of representation of feature representation of rows and that in it we are fully convolutional matter and using this approach we achieved some satisfying results in terms of character error rate and word error rate and it's just an example of one portion of text of course it is this neural network itself knows nothing about Russian language it's just analyzing graphical structure and features of our image so there are a lot of strange mistakes and a lot of words and letter sequence that don't exist in Russian language and yes but the final recognition rate was slightly over 505% in character error rate and we can see that our preprocessing is also very valuable because without denoising it is it is almost 7% of character error rate so the denoising is very valuable and maybe the straightening is not so very available in terms of error rate but it speeds up our network because the size of the image will be much more lower and so we are dealing with some sort of noise and machine generated output and to use it in practical tasks we need to do some post-processing and there were two directions of this post-processing so the first one was dictionary based and rule based and we just transformed our automatic transcript to modern spelling as to correcting the alphabet correcting the rules for prefixes so they can be prescribed of course and to make our words realistic we used some dictionary based method relying on Levenstein distance and make limitisation an altercal library for future analysis because the text should be prepared for some sort of navigation procedures and there another direction is GPT based correction so it illuminates the process of searching for proper instruments it suggests some instruction to our board to correct our text and we obtained that we need to provide some more instructions not related to text correction itself but for example pay attention to named entities, to actors in text and the problem was that the result with the initial prompt was too loose so GPT can use synonyms can use some reordering and it was instruction to forbid this transformation and to keep it structure as similar to the original one as possible so yes and here our quality measure is word error rate because it's more suitable for GPT because it can correct spelling and it still makes some reordering in our words so when we and in terms of third penalty it was much more for reordering so yes after improving our instruction with some examples and some role it overcomes the error rate achieved by dictionary based methods and maybe one can explain that GPT perfectly corrects the text that's not the case so we obtained some number of mistakes in automatic transcription and also less than half of this mistakenly transcripted word words are corrected by GPT sometimes that is even in case when the solution seems evident for human being so we have no such word as in Russian language but it keeps as is and another problem that GPT produces the output is one portion of attacks without line breaking so we invented some sort of dynamic programming it is very similar to added distance just to set line breaks in proper place of our text and minimize added distance not between the whole text but paying and attention to line breaks and in fact it was a preparation for navigation systems for our colleagues from humanities historians and philologists and one task they are interested in is a search by keywords search and topic extraction and in case where topic can be expressed in a set of keywords so what is the flow of our system we have some automatic transcription and as the system use CTC laws as as a loss one training we can align our predicted text distributed to columns of our image so and also we can distribute the text over lines so we have placement over both x-axis and y-axis and we can make some sort of visualization just to underline our found our keywords found using this placement and there are results of our system it was a model task when the expert was setting true for topics expressed by sets of keywords and so there are four colors corresponding to these topics and we have make some sort of visualization and maybe some sort of analysis of co-location as future work and some topic extraction and the right most image is the visualization of our page based model and we can see that this placement, this highlighting is much more smooth just because of the nature of our attention because it is very simple distribution of vertical levels so our underlining part will be completely horizontal and another possible task using this machine generated output is search for personal names and we utilize two simple approaches just to extract everything that starts with capitals and to analyze some computations so when for example this capitalized word starts after a dot it can be not appropriate just can be ordinary word we have here two types of highlighting for ordinary words capitalized and words that regarded as proper names and it is interesting but the search for proper words is the main tasks when GPT based correction was the most successful one it is because it corrects maybe not the particular words but it describes the overall structure of the text and separation in sentences and restoring capitalization much better than correcting just spelling it is maybe unexpected conclusion in our experiments but there are so we can use GPT as a solution for everything we can search for a proper task to utilize it and some slides for our future prospects of course visualizing and analyzing and archive for less than 100 images is looking like a toy problem although it is very it can be very helpful still but now we have much more much more extended archive of four volumes of diaries of admiral Ferdur Lidke and it contains several thousand pages and only several hundreds are mainly transcribed with great artifacts so we are planning to transcribe it automatically and we are towards a successful solution and the feature is that notebook is aligned so the line space seems stable to our data and in this case we can for example invent some new type of attention for equally distributed lines and another trick is CTC loss with missing parts the missing parts are ubiquitous and this archive especially when transcribing another language like German and Latin so we can use special symbols when the real the true character is unknown and it still works and still have an ability to train and the main result of our work is that we can achieve rather satisfying recognition rate using just just one thousand of lines but written in the same handwriting so thank you very much and thank you for your attention and interest thank you very much Nikita project is amazing I think and also your effort today it was longer than a plenary one it's definitely difficult thank you very much for the talk and sharing this task with us did you use any specific and exotic augmentation techniques for the training process of the OCR model here something like handwritten stroke augmentation or so thank you there are some sort of augmentation but it is dealing with images so all possible geometrical distortions or tone corrections and so on may be stitching parts of words from different lines together but we haven't generated new text and maybe it can improve the results but we have to do it because of the low amount of our data but of course it is possible any more questions here did you use only GPT based model for your transcripts and recognition or for example other models also for entropic for example Google's BART and so on Metas Lemmon in fact dealing with language models is a task of my co-author so it more experienced maybe than me but it checked some language model maybe not so much maybe three and four and concluded that GPT is better and is the most accessible one but of course there are a lot of tasks and here and every can be improved so yes there are a lot of open questions and a lot of drawbacks in our system there are no drawbacks that's the directions for improvement let's be positive any more questions well if there are no more questions then thank you Nikita again I think that our next talk is gonna be online let's check Maxim I guess you are presenting do you hear me? yeah we hear very well so please share the screen do you see my screen? yeah we see it well and actually and I think we are ready to start so please go ahead okay hi everyone my name is Maxim Kuprochevich I'm computer vision team leader at layer team Salute Devices and I'm Andryan Tolstyk as the afters of Piper Miwala, multi-input transformer for ancient gender estimation second okay so let's start with task definition and our task can be divided into two slightly different sub tasks first one is very straightforward and it's much more common vast of methods for trying to solve exactly this task so let's say we have some crop of person and the face is visible enough to be used for prediction with our black box to predict ancient gender and this task is I think very plain and very common but there is another type of task not so many methods trying to solve it when face isn't available or heavily occluded but despite this task is not so common it's still very important for science and business for example for surveillance cameras or personification or as it is in our case for a close accessories choose recommendation systems because let's say we cannot estimate gender from pictures the bottom therefore we have to search through the entire index and we will search of course just for visual similarity therefore we can easily find something wrong for example I don't know man pants for a kid here and of course this will be an absolute fail so that's why it's important so our goal was to create such model that can operate in both cases and be as precise as possible of course and be real time because we're solving a business task first of all okay let's discuss metrics and for gender everything is super simple so I'm not describing this here we use just accuracy as most of work because most of works before us have been used this metric and for age also quite simple main metric usually is mean absolute error so most of work also uses metric and we did the same and addition metric is cumulative score it can be seen as a portion of predictions with absolute error not higher than now and usually L is 5 it's also very clear metric I think it's easy to understand okay so what about data not so many data sets have age and gender ground rules as ground rules and you can see biggest open data sets for this task on the slide and this data sets can be separated to regression and classification the same as methods of course classification is much cheaper you can mine much more data with this approach but it's less precise less strong and regression approach guides model to better generalization work we refer to in our paper called deep and balanced regression they have a really good argumentation for this if you're interested you can read the original so as you can see and we perform most of our experiments on MDB and Utica phase and as you can see not so much data and MDB is heavily based to the celebrities so this data set mostly contains celebrities so obviously we had to mine our own data and we did this we collected many images from open images data set huge Google data set from our production system we have many products we were able to do this from some additional sources like web scrapping and so on and we sent these images to the crowdsource we asked users to estimate roughly age and I hope not roughly gender and we set overlap of 10 and we collected more than half of a million of images so after we finished raised the question how to aggregate these 10 volts into something one precise prediction so the gender everything is quite simple you can just use a mod we did this and we also cancelled we declined some samples with inconsistent votes like 50-50 or 40-60 so that's usually images with bad quality and also these crops were generated with a detector neural network so some mistakes possible but with age not so everything is not so simple and of course you can use some statistical methods you can see in table one and you can see that result at my year like 4.77 for example isn't so high but it's also not so good so that can be used for a train but still error is quite high and yes we calculated these results based on control tasks we use them for quality control so we can calculate these numbers and this is one quite simple idea if you have control tasks and obviously some people estimate age better and some worse and a few slides later I will show you that Maya distributed almost normally if we have this Maya and we can calculate them individually for each user we can waste so we did this with an exponential term that results are much better with significant margin from other methods so 3.5 that's a really nice result I can say so after we finished we collected more than half of million images and half of million we used for a train it's our close proprietary data set because we can share them some of them from our production but we decided to create new benchmark from images from open image data set we called it layer H in gender data set you can see statistics on the slide you can see that the histogram and you can see it's almost perfectly balanced except they were right because it's very hard to mine ages this ages obviously and we decided to create it because previous benchmarks are all heavily based to celebrities or very small because has been taken in like police office or studio so obviously they are quite small and you can download this new benchmark by URL for free without any forms so you're welcome okay let's move to the methods so we started we wanted to start with some baseline some strong model so we can be sure we can move on to something more complex and we took as our baseline visual outlook if you're not familiar with this network you can read the original paper but all you need to know right now that's a modified visual transformer with modified attention block and we started with the simplest tasks so face crop as input and age as output and then we add another output as gender so I won't go deep here to train procedure it's modified slightly of course compared to original classification model because we described this details in our paper and we have no so much time let's take a look to the results so you can see that our baseline already take state of the art and for both MDB clean for example here is a previous state of the art and the same for Utica face and also very good results for our new legend training benchmark test set and what's really interesting here you can see that age and gender model with double output is much more precise than just age output difference between these two models just gender output nothing has been modified in train procedure so that's a yes well known phenomenon when model generalized better with multitasks but it's not so often you can absorb it so that's amazing and you can see the same for all the data sets model with additional output is more precise okay so we beat state of the art but that wasn't our goal our goal was to create a model that can operate in any cases in any combinations of faces and bodies some of them can be unable and so obviously we cannot use single input of entire person crop because the resolution of this model is very small and face features will be just vanished away so we cannot do this and we have to create some model that will use multi input architecture and what we did so you can see that we separate face and body crops and to perform this cross view future vision we created future engines model so first we pass our inputs to the original patch embedding and also very important that we need to preserve original dimension sizes because otherwise we cannot use the transfer learning otherwise our model will be slow so we have to use some early fusion and fit the same dimensions as an origin so our future engines model face features in the way you can see on the right we do cross view attention so first we perform face to body cross attention then vice-versa body to face cross attention then we can continue features pass them to multi layer perceptron and eventually we have a fusion joint representation of the same dimensions as an original so we can use transfer learning and everything else and the model is quite fast because of this early future fusion strategy and everything else is an original network except output of course so let's check the results here you can see baseline at the top and here you can see our multi input model and you can see that we achieved significant improvement for an age and also for a gender for a agenda the hardest and benchmark for now I think and slightly lower results for a gender but differences so small that it's hard to say why is that this needs to be resourced deeper and what is most interesting here you can see we can we can use of course separately only faces or only body inputs of course battle model is joint face and body inputs and when you use just a body input model still performs quite good for example for an age it's 6.66 let's remember this number we will need a few slides later and you can see that for all the data sets gender works quite good so of course lower than this face of course face is most reliable way to predict age and gender but still it works with quite good accuracy so at this point we took state of the art for every benchmark and we became interested what will happen if we will run more benchmarks so there were remains not so many benchmarks we can run for with a regression output for an age we tried HDB but only some old measurements were available here so of course we easily beat these results but much more interesting was to run our model for classification benchmarks of course we cannot use any trained data for our model because we use a regression and this data sets has a classification have a classification outputs for age ground trusses and also the ranges are different they define these classes differently so we cannot use any trained data but we can simply map our regression output classes ranges of course and we run our model with a validation part of these data sets and you can see that we also took state of the art for both of these data sets you can see for example for agents quite old data set here is a previous state of the art for gender and you can see that margin gap is really significant for our model and that's amazing because it's proofs real good generalization power of our model and a few samples from the internet this visible face everything is very simple here is there less than one and you can see that even with tricky samples for example here quite tricky hair features you can see our model still performs very well and also age works very well you can see even when face is not available so model performs really good and you need to remember model never seen samples like for example without visible faces because we trained our model on face centric data sets because it's hard to annotate data without faces for humans this will be super hard and we just removed our removed faces from the data during the train so we drop them randomly and also we dropped randomly body crops so we've never seen samples like this and still works very well so generalization power is really good and what about human level so of course because we had control tasks we can calculate it and you can see that human on average mean and median you can see here so okay we are quite better at this task you can see that Maya is higher than 7 and that's the peak Maya and if you remember our model performs with Maya 6.66 even when faces not visible so even faces were removed so model can operate better even without faces on average of course because some of annotators has quite good Maya like 4.5 but only a few another plot about relationship of Maya and age you can see that model bit human on almost all the range except very left but this data was created from MDB so control tasks was created from DB and this data set is quite hard to imbalanced so just a few samples on very left and very right and it's hard to say why is that so it's need to be researched deeper but maybe that's because just few samples there so no not enough data here for some conclusions and about speed about performance so original model is very fast you can see that with big batch size on NVIDIA V100 this model can perform with more than 1000 frames of course our model a little slower because of multiple inputs but thanks for early fusion strategy slowdown is just about 20% so it's still very fast almost 1000 frames can be achieved per second okay so validation code all the models trained on open data sets demo with our full closed model and everything else you can need is available by this URL on the github there are also our contacts feel free to contact with us and also we have a telegram channel of our team it's in Russian but if you're interesting you can also follow us with QR code search by name thank you do you have any questions thank you very much Maxim yeah we have questions in the audience hi there thank you for your talk I have quite a few questions maybe first like straightforward questions of how does your model perform versus real state of the art which namely needs to challenge so and you were comparing some with some baselines in academia which are not quite updating because there is less and less data have you tried to send it or your planet or you already sent and you know the results can you please comment on that thank you for the question no we based mostly on papers with code so we took all the benchmarks we can found big benchmarks we can found there that reflects our real world task without celebrities okay yeah with benchmarks we took all the benchmarks even with celebrities yes so we based on papers with code mostly and we took first place for everything we can find choosing from something big benchmarks like this okay but you're aware of this challenge right okay so the second question is about the data so maybe I misunderstood but too big chance of your data is a celebrities people whose age was crowdsourced estimated so you don't have real you know age I mean you don't really have an age you just have an estimation from different people or from celebrities which are look better than this they should write yeah okay so it's very hard to explain in short time so much more information here so yes again existing data sets they are usually has been taken or in some studios or police offices of course because only there you can estimate real age so they are small and most of data sets contain celebrities like mdb for example for obvious reasons because you can obtain the ages that's why we use the crowdsourced and yes we can we can't estimate real ground truth of course but we can estimate how well eventually our annotation is and it's free and half Maya should be like this based on control tasks of course this estimation is based on control tasks generated from mdb but I expect it should be even lower because mdb is harder for annotators as you mentioned because of celebrities because they are biased very hard so yes we cannot obtain real ground truth but we expect our annotation is super precise thanks for wasted means strategy because with other statistical methods as you can see Maya will be quite high so ok maybe this is the last first question not the single spotlight the question is that you use standard RMSE because it's used in some of the papers but there are like quite a few papers and even in this news talents like it states that it makes sense to have a different metric on age estimation because two years difference for children or for all people is not the same as for middle aged people so there was like as far as I remember quite a few relative metrics proposed in back in 2014-15 could you please comment on that yes I agree that Maya isn't perfect metric here and for sure error for like two years on the very left or very right are not the same and we have some samples with even age more than 100 so it's of course two years here is not an error at all because it's impossible to see some difference from the picture so I agree but we used something that can be used for comparison with other works yes because of that so maybe in future we will use more advanced metrics but because we have this plot we can be sure that model performs quite well you can see that error growth of course is it's obviously should happen but not so hard as for example human thank you all and all really great work and thank you for this work and data set it's really important because there's problems with data sets right now thank you very much and for your question one comment and one question one comment your model by using celebrities as as a model isn't couldn't be right because celebrities undergo regularly cosmetological procedures and they are age cheaters so they age on the face for example on the picture it's not a definite age couldn't show the definite age the question is how your model of age definition and age regression model differs from for example Microsoft's and in silicon medicines age definition models and was the range of the plus minus was the range of the right answers for your model age definition model and how it differs from Microsoft's and in silicon medicines models sorry do I did I get you right what do you mean you mean accuracy of their model or what yeah accuracy is meant I don't have results for their model so I can't compare I never think them so if you have some and we will try to run if there is some benchmark I know so I don't know but you can use our demo for example if there are no benchmarks and you can compare manually for example but I don't know benchmark I can run and compare with this models Microsoft in silicon any more questions seems no I just can comment on the silicon model I don't know probably they have some updated one but I have seen a demonstration of it back in 2016 I guess and it was actually very funny because it was Eric Zhavernkov who was presenting it to Skoltech president Aleksandr Kuleshov and Aleksandr Kuleshov is very well known to be to look much younger than at least at that time he was like 70 but was looking like 60 probably and then the model output was like 82 so it's apparently there were issues for the elderly people because probably not that much data for this category the same as we distribution is super hard we are trying to solve this we are developing the next version and we try to solve this so yeah that's a real mess okay so let us thank all the speakers of this part of our session and now we have coffee break so we ended a little bit later so I suggest that we take full 10 minutes just start a little bit later just we have time for coffee and so on so well yeah like 15 minutes to one we met again so let's start so because every minute we have now will be subtracted for our lunch so please let's make it try to make it brief so well if I'm replacement chair for the competition track and our next talk is interactive image segmentation with super pixel propagation good afternoon everyone I'm delighted to present our paper titled interactive image segmentation with super pixel propagation this is a collaboration between Pixar and American University of Armenia our main task is interactive image segmentation which is to create a tool user friendly tool that enables users to achieve precise segmentations by actively guiding the process let's look at some characteristics of state-of-the-art methods they are mostly click based methods they use click user interactions they are commonly deep learning methods trained on huge data sets and they are designed to get 85-90% intersection over union with minimal number of iterations here is shown an example of one such method vocal click after one iteration and 8 iteration on user clicks so to address this challenge we are concentrating on a method that doesn't require a training data set and we are prioritizing achieving higher than 95% intersection over union precision so let's take a look at our interface of our method workflow the first one is the user zooming user zooms into the part of an image and then this sub-image gets partitioned into super pixels using ETPS algorithm next user may click one or more times inside the object of interest and the next step is fast dashing method and arrival time distribution which is the main contribution of our workflow this is the crucial step for controlling the propagation wave super pixel by super pixel and finally user can utilize a slider to control the overflow of the wave from outside the boundaries of the object this is a cyclic process and it continuously improves until it reaches an acceptable segmentation and user can then extract the mask of our future use let's take a look at the user interface shown in the video below the right part is the final mask and the left part is the segmentation process this example achieved 99.86% intersection over union in about two minutes let's dive into the data sets we used for our experiments we are interested in accurate image segmentation with well defined boundaries so that's why we use data sets with reliable ground truth masks first two Berkeley sorry, first two Berkeley and Davis are benchmark data sets from other state of the art methods and also in addition we added our in-house logos data set for experiments in another settings, another domain so let's take a look at the graphs summarizing our evaluation experiments first one is mean number of iterations per intersection over union it is worth noting that we extended our experiments up to 500 iterations instead of commonly used 20 to explore how methods work on higher precision the purple line in these graphs describe our method other two lines are for state of the art deep learning approaches as you can see the initial segmentation, deep learning methods reach the initial segmentation more rapidly than our method but as the number of iterations increases our method achieves much higher intersection over union over time it is improved increasingly and also it is worth noting that on logos data set some methods didn't even achieve 85% intersection over union which can be the problem of specifics of training data sets for these two deep learning approaches the next metric is cumulative number of images to achieve certain intersection over union values in these graphs a higher graph corresponds to a better performance of the method and as you can see on all these three data sets our method outperforms these methods to do such extensive experiments on huge data sets and to reduce user effort we need to simulate the user interaction part for these methods using the ground truth masks there are shown two examples of such optimization process of our ITM method one of the state of the arts and our method the left side is the segmentation process and the right side of each video is the difference mask between current segmentation and ground truth mask so in general we have several important contributions to image segmentation the first one is state of the art deep learning approaches usually achieve good segmentation in under 10 iterations however it is hard to achieve a higher number of precision using such methods and whilst achieving initial segmentation our approach outperforms these methods on high accuracy segmentation and on detailed boundaries our approach also expects considerable user effort for good results however it is significantly less than manual segmentation for this part please refer to our paper as we have limited time to discuss this during this presentation as we look ahead there are some improvements for future research the first one is improvement of our method to handle better textured images adding negative clicks for the fewer iterations we also plan to investigate some hybrid approaches using deep learning methods for the initial segmentation and use our method for final refinement to reduce overall user effort thank you for your attention do you have any questions? your colleagues do you have any questions? because I have many thanks for the part we are really into the field of interactive user segmentation but the first question is straightforward have you compared with the segment anything model which is quite a loud one in the last few months I would say so we did this research quite sooner than now and we actually didn't compare it with that method but we are compared with state-of-the-art interactive segmentation approaches and our main advantage is that we are good at higher precision than deep learning approaches yes I see could you please also go back to the slide with this very good curves my understanding is that these clicks but there are some generated clicks by some algorithm right so how to distinguish the procedure which was clicking was not that good, how to distinguish between a good clicking procedure and a good part of your algorithm is there a way to do it? maybe if I click in the correct way with a different model I will achieve better results faster so we have used the interaction of automation methods mainly used in our other methods which is we are taking the click as the center of the difference mask the larger connected component of difference mask and we try to get very similar automation as the other approaches like ITM or focal click how well is your model performing on more complex scenarios like when you don't have to segment images like this but a human from background or its hair which is not always hard mask you should sometimes also give soft mask the main advantage of our method is that it gives full control to user unlike deep learning approaches which can be hard to achieve the small boundary delineation as the example you said but our method user can fully get full segmentation for the desired parts because it is a traditional method not a deep learning approach any other questions? well in this case let's send the speaker again and our next talk is acne recognition training models with experts thank you so I'll present the work acne recognition training models with experts I represent high school economics and we did this work with collaboration with Skin Research Institute in US so we will start with motivation so we know that acne is a huge problem nowadays many people have it but I guess they it marks the quality of life they can feel insecure about that and so on so the doctors who study and treat this disease are called dermatologists and they usually first to diagnose and then to choose the procedure to treat that they use so called severity grading system there are several severity grading systems but most of them are based on all of them are based on the visual features and the majority of them are just counting the amount of acne lesions on the face of the individual and then proceed to give the individual the severity score so the example of such systems are AGA and GAGS so our goal is to kind of develop our own data set with our own grading criteria because there are drawbacks to the majority of the grading criteria because they only focus on the acne lesion count and the second goal is to develop automatic grading severity system so we first start with again these issues with labeling from the point of view of dermatologists so first of all if we choose two different dermatologists let them discuss what they think the right criteria and then set them to separate room give them the same image they will return most likely with a different score another issue is that the same dermatologists can give the one score on one day and another score on another day so and as I mentioned the most system are count based so to illustrate that issue so there are two different dermatologists in question and as you can see that first image is red line is a sorted scores of the first dermatologist and the blue line are the scores which are sorted according to the index of the first sort of the second dermatologist and as you can see like there are a lot of outliers and distortions and the general plan is preserved but the distortions are there and the second plot is a histogram of both scoring so you can see like you can observe the differences as well so we acquired our own all data and decided to let them determine the optimal guideline to score the images and this is the table represents the conclusion they reached all together there are around 600 and 70 images and the score is real valid not categorical as you can see and you can see a lot of different factors are taken into account apart from the counting so after these consensus the images were labeled according to what was reached to the consensus and we can see the distribution of the scores on the data set are shown as following so later on I will explain why we need the additional data set but just to demonstrate some facts about that additional data set it's called Technis 0.4 it's open source so one thing important to mention is that our data set consists mostly of selfies taken in front of the individual but this one is kind of from photographs for this data set are taken from different angles so those angles and those conditions to take the photo are called Hayash requirements so in total there are 1.500 images approximately and most importantly the most important thing about this addition data set is that it has bounding boxes around each region and there in total 90,000 bounding boxes around 90,000 so bounding boxes distribution are shown here so most images have almost no bounding boxes but the more bounding boxes the more rare images so examples of this data set you can tell that it was the photographer taken from the angle and you can see like the bounding boxes around lesions so then we choose so since our main data set has real valued target variables we solve the regression problem and to evaluate the quality of automatic data we choose the following metrics the first one is well known it's mean episode year measures the episode year for each prediction and averages them and the second one is symmetric map so in short it's symmetric version of the just map and what it does it's basically in analog for relative year but for the whole data set and the symmetric part is about the denominator because if we in the standard map formula the denominator just consists of the truth value and it's symmetric it's punishes more for our prediction being bigger than the actual target value and less for the case when our prediction is less than the target value so this denominator is basically a symmetric correction for this metric so for all our experiments which follows after this slide we use the following techniques to increase our data size so the first baseline approach is just to choose some backbone and that fully connected layer at the end and get one value for the score so you can see the results presented here for different points the results are different but we can see like while we can consider that mean absolute year is more or less decent especially for mobile net V3 because our score ranges from 0 to 1 but symmetric mean absolute mean percentage year mean absolute percentage year is quite quite big here so the actual images of the absolute year is jumps a lot and yeah this is transfer learning paradigm so next we know so we think about why we have such deviations why we have such reasons and most so the main conclusion is that in our main dataset we have positional information about our acne lesions so we have to start searching for something additional and the additional dataset which I described earlier is that we liked so we then proceed to use this additional dataset to build our some sort of detector or segmentator I guess we could call it that but to build segmentator we have to somehow adapt our target images to the segmentation problem so what we did is just everything inside bounding boxes we just label as a target pixel and everything outside as not a target pixel this way we can try to build some segmentation but we see like the segmentation performs a bit worse than the detection model and visually it looks like this so we can see like in case the acne lesions they're basically for segmentation in case there are not much acne lesions we get the following picture which is above and in case the more acne lesions there are the better the segmentation is but nevertheless it's not suitable for our use so we describe that so for detection we train the model and the yellow model and we can see on the image that there is a significant improvement we can observe correctly labeled bounding boxes around acne lesions so using this detector we can now use this detector to improve upon our initial baseline so we oppose the following scheme so we choose detector in our case it's yellow we get the bounding boxes and then from the bounding boxes we simply the simplest way is just to count them and when to build about that some classical technique like regressor or something else like reading books and boosting etc but this is just one feature of regressor so we found that just linear regression works the best and yeah so this this approach produced an improvement but later on we discovered that we can slightly improve upon that as well we can introduce two more heuristic features basically handcrafted ones so the first one is just measures the coverage of our detected lesions so basically it's a total area which covers the which those lesions are covering and the second feature is okay it seems like this old version of the presentation but basically there is a second handcrafted feature which measures the the amount of lesions which were detected in different regions of the individual phase to do that we split phase into N by N grid and count how many boxes were detected in every single cell of this grid so this way we acquire two features but the second feature which is called positioning basically amounts to N squared additional features because grid is N by N so the proposed scheme is shown here so basically we add two handcrafted features but we can now discard the count feature because positioning features basically just sums up to the count feature so its count feature is just related to the positioning features and we acquire slight improvements upon the results so from the initial baseline we didn't debate much in terms of mean absurdity or it still looks the same but still in improvement but in terms of symmetric mean absolute percentage we improved twice from the initial results so to sum up we developed a new grading criteria for severity score of acne we suggested the model to automatically grade the severity according to this criteria we proposed new handcrafted features and for the future work there are a lot of degrees of freedom to improve upon for example to choose the amount of cells in the grid in positioning feature we can vary the detectors use different detectors and so on so there is a lot of room for improvement so thank you if you have any questions I'm happy to answer them do you have any questions your colleagues how do you think is it possible to take into account some additional information on a person such as age also there was a lot of talk about to take into account the race of the individual since it's a sensitive topic in US but yeah you are right so it's better to take into account the additional features but we didn't do that if no more questions arise let's send the speaker again and our next presenter is online or offline so the talk is named learning facial expression recognition in the wild from synthetic data based on an ensemble of lightweight neural networks thank you we don't have the speaker we don't have any of the speakers okay I think we have some kind of a technical pace because oh oh hi there can you hear us yes can you hear me yes great please share your presentation yeah thank you yes please start so we have 20 minutes for the talk and three for questions thank you thank you so hi everyone my name is Long and today I represent also for my professor and research and call from higher school economics university and our paper is learning facial expression recognition in the wild from synthetic data using lightweight neural networks so facial expression recognition is a task to classify this question in digital images or video frames to category like anger, fear, surprise sadness, happiness and so on so it has a wide range application like in marketing in gaming, how monitoring or any human machine interaction etc and I believe this is also a very familiar topic so we have SMI some recent progress facial expression recognition and we found that despite numerous proposed methods the performance has improved significantly over the last two years on the effectiveness data sets and so this motivator to explore the concept of ensemble model that we combine single idea, different single solution to leverage their advantage to enhance our performance so in terms of in terms of data sets I think that they are now the challenge that we are lacking the data set for this task and for example in the right table that I believe that this number of image is not a big number for when we compare with the number of people in this in the world so this leads to a common problem that those models perform well in a controlled lab environment but their performance drops significantly in real life conditions and because gathering a proper amount of real life data sets is time-consuming and expensive missions also that to get the concern the agreement of individuals to get their facial image that is not easy task so using synthetic data sets can address those problems because it offers numerous advantages primarily due to its generations so we also be motivated by the synthetic using synthetic data sets in our research so in our paper we use synthetic data set from the learning from synthetic data set competitions from the photo workshop the photo effective behavior analysis in the wide workshop and we use the LSD training data sets to train our model and we use the LSD validation data sets from this competition in our validation step also because the test data set was not public so we have sample some images from the multi-task learning competitions and we use it as the test set to evaluate our models so we use the F1 score as the magic in our validation step which is similar to the original LSD competitions and after that we deploy our models in the JSON manual and activate and measure the frame per seconds of our models on a random input tensor so this pilot just provide a more detailed illustration of the data pre-processing using in our research in total we have 277K images in the LSD training data sets and because the original image have some blur it's input size quite small so we decided to have an additional pre-processing step here that we use a super resolution algorithm here to upscale and deploy the images and we run the experiment in this variance of those data sets so we don't apply any almondation for the image because we want to and we apply the same training procedure on the model because we want to have a fair comparison of those models on these limited data sets so in terms of the proposed methods we have in mind to select the solution that were used by the top performers from this LSD competition so first we select the multi-task the empty emotifnet that which is recently the state object model on the effectnet data sets and the second one is then the transformer based model which is well known for its advantage on the generalizations and the last one is graph convolutions that were used in the second solution from the LSD competitions that we propose some different ensemble approaches to utilize the advantage the first one that we we use the backbone from the empty emotifnet and we use the attention layer and classification from the data model and the second one that we that the empty emotifnet we use the backbone from the empty emotifnet and the graph layer, graph convolution layer and classification head from graphs and the last one we adjust the combined of those three solutions so in our result we find that the best result were achieved by the ensemble models and in particular that the empty emotifnet then graphs achieve the highest F1 score of 0.771 on the original validation data sets and the empty emotifnet then achieve the highest F1 score of 0.419 on the empty data sets in general the results show that the ensemble models achieve a better result than any single model on the LSD data sets and the ensemble model the empty emotifnet even increase the F1 score by a significant factor by 10% so the the similar results were observed on the MTL data sets except for the most complex ensemble model from the most complex ensemble models and although this model achieved high score on the LSD data sets it failed to provide a high solution on the MTL data sets another observation that the ensemble of the empty emotifnet and then not only achieve a higher score compared to the single models but also able to achieve the higher results on the MTL data sets the state of the empty emotifnet when we use it backbones we could attract the better embedding from the data sets and when we passing this embedding better to a more complex classification has so that from dance or go up we could improve the performance significantly however when we perform we compare the F1 score between the LSD and MTL data sets we see that the F1 score drop significantly this could be due to the lack of generalization of the synthetic data sets and the difference of the label distribution between the LSD and MTL data sets so in terms of inference pick in the ensemble model were more complex than the single models those inference pick were slower however while the real-time is around 30 FPS we could see almost model achieve faster or near real-time speak here and if we stop here at this point we could say that maybe this speak is good for real-time applications we could evaluate it in a more practical condition that we have included those models in the typical end-to-end video analytics applications with a face detector, a face recognition and our emotion recognition models and I have used it in some new scale from my work before and I think I got some positive feedback but if we look at the frame perspective here we see that even only one face appear in the frame that the frame perspective drop less than 20 and when the number of face increase that the the spec drop very quickly so I believe that there further work need to be conduct here to to make the model faster or something to bring those models to closer to the real-time applications so two conclusion that the first one that we propose an ensemble approach that combine different single state-of-the-art models to achieve a better result on the face and face recognition test and secondly, although there are some gaps of F1 score between the LSD dataset and MTL dataset, that using synthetic dataset is still high visibility and in further work we will explore our approach maybe not only in a lecture synthesis dataset but also for all the face expression recognition datasets I've seen the influence pick up many deep learning models still admit the near real-time requirement in a real application maybe there are some further work that we will explore later to make our model faster and so that's all and thank you for your attention Thank you for your talk as a previous face recognition worker my heart is filled with joy when I see that the people not only combine the models but also implement them on hardware and measure FPS so thank you for that Do you have any questions, dear audience? Well, it seems that I have a question I'm really curious I'm not familiar with the original challenge can you please comment on that whether each face represents one emotion or some labels because some maybe some face maybe both disgust and anger, something like that I think emotion is a very complex context for example I think some politicians can hide their emotions maybe in the face you don't see anything but they are feeling happy or something like that I think that depends on the requirement or some specific use I think we can develop some models for specific tasks for example for children in a class high class or in a hobby so something like that thank the speaker again and our next talk should be semantic aware gun manipulations for human face editing is anyone from the authors here to present? I mean online can you please stop sharing the screen oh hi Pavel, do you have any problems? we see your message but couldn't yes, okay how can I help you? maybe you are actually using the wrong link let me just double check with the organizers in a second, we're working on it yeah it seems like it just started to work yes we can hear you and see you thank you yes thank you you have 12 minutes for your talk hello my name is Khrusov Pavel and today I will present PEPA semantic aware generative adversarial network manipulation for human face editing the present study is devoted to manipulation of the generative adversarial network Latin space in context of human face editing for this study we have chosen several unsupervised methods for detecting detection a semantically meaningful direction in the style again to Latin space we evaluated the quality of obtained direction and images obtained with them and also we analyzed result and propose original method that allows to improve quality for manipulation with larger shape so work the style again to model is used this model uses a style based architecture as opposite to the original architecture where random noise is immediately transferred to the synthesis network here a non-linear mapping has used that produce an intermediate support vector style this architecture allows to make the variation factor more separable in particular it is shown that the intermediate space W is better line separable than original Latin space that it is useful for image editing in this study we have chosen three methods for the discovery of a semantically meaningful direction in the generative adversarial network Latin space the first method is based on optimization and includes two main parts the first one is matrix A and the second part is reconstructors that takes two images original and edited and try to predict direction and shift size these two parts are trained together and the loss function uses cross entropy for the predicted shift direction and standard and the mean square year for the prediction of a shift size the next method is based on a principle component analysis to get the direction with sample n random vectors from Gauss distribution fit them to the mapping network and we obtain intermediate Latin style vectors W calculate principle components on their basis and use principle component as direction and the last method is closed form of characterization of Latin space it's based on the assumption that the weights of modulation layers contain knowledge about the semantics as direction we use single vectors of weights matrices of modulation layer for experiments we calculated single vectors for layers to the average and use those vectors as direction semantics so we use fetched assumption distance this matrix allows to evaluate the quality and variability of generated images the next matrix is learned perceptual image path similarity that allows to evaluate the perceptual similarity of images in order to evaluate how well the target attributes changes the pre-trained regression prediction was used and the last matrix is cosine similarity so as an example consider shift in direction associated with H change as can be seen in the images all methods are able to find a direction associated with H change at the same time it's clear that when we perform manipulation with changes both in the direction of decrease and increase all methods add additional attributes such as glasses change face position hair color and so on but generally it can be said that the methods based on optimization show the best performance actually this is confirmed by the value of matrix for optimized based method we have the best values of fresh reception distance best values of learned perceptual image path similarity and best values for other metrics during the experiment it was found that it's difficult to achieve changes only local attributes such eye size eyeglasses and so on to overcome this problem we used so-called layer wise editing technique in this way in this method shift is applied only to the vectors W that are fitted to certain layers for example consider a shift in direction associated with eyeglasses it was detected this direction presented here was detected by optimized based method the first row is images obtained by applying her shift is applied only for the first two layers and we can see that in the second case the identity of the face is well preserved there are practically no changes in the independent attributes in the case of plan shifts we get much better metrics especially for fragile reception distance to improve the quality image obtained by large shifts we developed the original method based on the use of extended latent space of style instead of the mapping network several are used the number is equal to the number of modulation layers so which modulation layer recives its own vector W initially all mapping networks are obtained as a copy of the original networks weights of the last four layers it's a hyperparameter which mapping network obtained as a copy of the original mapping network since manipulations don't change the domain images obtained by the modified generator and edited images to belong to the same distribution as images generated by original generator we minimize deviation in discriminatory prediction for images obtained by the original generator and the modified generator also we minimize deviation in discriminatory prediction for images obtained by the original generator and shifted images pre-trained discriminator, we took pre-trained discriminator from original tile-garner to model in order to preserve the identity between the images obtained using the original and modified generator. We will minimize learned perceptual image past similarity. In order to improve training performance, we also minimize the north of the vector, obtain it as a multiplication of weights, matrix, modulation layer and output corresponding mapping network. Now let's see the results. Despite the fact that only the directions obtained by closed form factorization method were used in training, improvements were obtained for the shifts in the direction obtained by all considered method. For example, here is an image that's obtained by shifts made in the direction obtained by optimized based method. In the upper line the images is obtained by the original generator and the lower line images obtained by the modified generator. We can see that the images in the first column are almost identical. They are obtained without applying shifts. For images in the second and third columns, we can see that images obtained by the modified generator look more natural. We have more natural color, more natural features, as well as fewer artifacts. The similar picture we can see in the case when we apply shifts for indirection obtained by PCA based method and indirection obtained by closed form factorization method. The result of visual analysis are confirmed by the values of their matrix. Below are presented metrics for directional, for shift in directional associated with H change. It can be hypothesized as a transformation of the Latin code that do extended Latin space using proposed model prevents the emergence of details that could lead to significant quality degradation during shifts. This hypothesis is supported by metrics valued for zero shift. We observe a slightly high pressure inception distance, but the discriminator still recognizes the images as real with the same level of confidence as for images obtained without mapping to extended Latin space. In conclusion, I can say that considered on supervised method can be applied for tasks of human face manipulation or human face editing. Also, we can use some techniques to improve the quality of changing the editing of local attributes and the method proposed in this paper makes it possible to achieve an improvement in the quality manipulation for large shifts. Thank you for attention. Any questions? Thank you for the talk, Pavel. Do we have any questions? Dear audience, unfortunately I have questions. So one of my first questions is about so this work is really, really style-gun-to-based because it was the best network for face generation for a number of years and these similar methods are kind of well known, but it seems you achieved significant results in terms of naturalness. So maybe I have two questions. One is about the disentanglement of this feature. So you kind of disentangled this feature by design, but could they be disentangled even better because some of them are still connected, for example, I don't know, like long hair and gender. And another question is that whether it's possible to transfer these methods to, like let's say, new era of generators, namely diffusion models. Thank you. All right, about the first question. So we use, since we use unsupervised methods, editing really changed not only target attributes, but independent attributes. So when we change gender, usually we change hair size also. But when we use so-called layer-wise editing technique, we can, in some cases, but not all, we can achieve more disentanglement changes. And about the second question, in this paper, we don't consider the possibility to transfer this method to fusion models, but I think it's a goal for the following research. Okay, thank you. Let's send the speaker again. And we are approaching our final talk. Yes, Pavel, you can stop sharing the screen. Thank you. We are now... I'm not the president or president of the orchestra today. There's no option to turn on the mic. Pavel, can you stop sharing the screen, please? Yes. Giorgi, can you share the screen or you don't have the options? One second. Can you hear me well? Yes, we hear you perfectly well. Hello. So I'm sharing now. Can you see it? Yes, great. You have 12 minutes. Thank you. Okay, thanks. So I'll present to you our work on dynamic gesture recognition via contrastive pre-training on video sequences. So we'll start with motivation and then the problem statement where we're trying to solve the data sets that can be used and the current sort of approaches and our proposed approach. Then we'll show the results and conclude the work with the further research. So what is the task? One second. I'm trying to move it. So the motivation is basically to develop a system to recognize dynamic hand gestures. It can be used in many scenarios, basically mostly human-computer interaction. We can control robots, computer systems, games, VR and AR applications. And we can translate and generate sign languages, especially for speech and hearing impairment people. We expect to see even more applications with research and development involved in this area study. And we mainly focus on human-computer interaction here. So the problem statement, we were basically given video sequences, I mean image sequences, and we have to identify what is the gesture class that's been performed in the last end frames. And is the fixed size window that we use to basically identify for the whole sequence. So the sub tasks here is basically hand detection, because given a full frame, we have to find the hand. And then we have to come up with the gesture class and prefer the segmentation to remove any other irrelevant information from the image. We also have to recognize neutral gesture, which is basically not a gesture. We're showing our hand when not doing anything, anything special that we need to recognize. And it has to be end-to-end trainable pipeline because everyone is executing gestures differently and using some other subsystem which is not trainable can produce noisy results and poor solution. So the data sets that we found is basically here. We'll go over them quickly. First is Cambridge hand gesture data set. It's very simple data sets, nine classes and only RGB image sequences. They are executed on a dark background. Then there is a DHG data set. It will use it for evaluation for hand key points by method and it contains depth images as well as 22 3D key point sequences and there's 14 classes. The key point sequences are generated through the Intel RealSense depth camera. Then there's ego gesture data set. It contains RGB and depth data. It's a very large data set with 83 classes and the camera is capturing from the top of the head. So it's an egocentric camera view. And then there's data sets for dynamic hand gesture recognition systems. It contains also RGB and depth data and it has 27 classes shown here. The quality of this data set is probably the highest among other data sets because every participant has been trained to execute the gestures and they strictly follow the guidelines. Then there is Nvidia dynamic hand gesture data set. It basically has RGB and depth and infrared data. An example is shown on the site and it has 25 classes. Gesture data set. What's different in this data set is it has also neutral gestures and 27 classes. Only RGB data is captured. Then there is IP and hand gesture data set. Hand data set will use it for evaluation for image-based methods. And it contains RGB optical flow and hand segmentation sequences, 13 classes only. We'll not use optical flow and hand segmentation because we'll try to focus only on RGB data as it's the least noisy and we want to focus on that. Then there's a Chalurn data set, RGBD image sequences, 249 classes. It's a very large data set and here's the depth sequences shown. And Sheffield connect gesture data set, RGB and depth image sequences, 10 classes. And this data set is kind of weird because some of the classes are basically drawing some figures like a triangle showing on the left here. Then since we focus on the human-computer interaction and not really on sign languages, but there are very high quality data sets for sign languages, so we'll mention them here as well. First is American Sign Language data set. It contains RGB image sequences and the 29 classes basically are the letters of the English alphabet and the space delete sign and the neutral gesture as well. Then there is the modern new Slovak Russian Sign Language data set. It's very large data set containing 1000 classes, cloning words, phrases, numbers, and even sentences. And it contains RGB and 21 3D hand key points, but the hand key points are generated through the media pipe and framework. So it's not manually annotated and it's prone to inaccuracies. So we'll discuss the current SOTA approaches. And as I said, we'll use only hand key points and RGB image-based approaches. First is ParallelCon. Basically, it does one deconvolution supply to each key point coordinate in the sequence, but it doesn't really treat the sequence as the sequence. So it's unlikely to be superior to sequence-based methods like transformer architectures. And the other drawback is it works only on complete sequences, which is not the case in real-world scenarios where we have to actually use the sliding window because we never know when the gesture is ending. Then the DGSDA, it does spatial and temporal attention with the key points and makes use of positional encoding. And as we'll show, this is the best approach we've found so far for hand key points. Then for image-based, we came up with 3D Resnext 101. There's also MB2 which we'll find on the table later. And here, when we say 3D, we mean we're dealing with 2D image sequences, so 2D images in time. So our proposed approach, what inspired us and what we actually propose to do and the subtext that we'll solve. So we're inspired with OpenAI clip method that does the following. It takes image and description pairs. It has a text encoder which transforms the given text to vector of size k, and image encoder which does the same thing, but with an image. And then both encoders are trained to maximize cosine similarity within the pair and minimize cosine similarity outside of the pair. So it's called contrastive pre-training. So it also uses symmetric cross-entropy loss to minimize low-combose directions, text-to-image, and the image-to-text. The rest of the clip is less relevant for our test because we'll discuss why. So what we propose is to replace the original image encoder with a 3D image sequence encoder, because we not only have one image, but a list of images. And we should take a pairs of gesture image sequences and their textual descriptions, and we can take a large pre-trained text encoder, which will generate the text embedding from the text, and we'll train specifically the image sequence encoder to produce similar embeddings to text encoder in the same way the clip does it. So we'll replace the original classification task with actually a metric learning setting. So do we even need the text encoder afterwards? No, we actually it serves this role through text supervision. So during inference stage, we can just use the image sequence encoder and forget about a text encoder. So we can just use any huge sort of model for text encoder that is trained to produce good embeddings for the text. And we picked the sliding window size 32, which should be determined specifically for every data set and it should be studied to how to take the window size properly. So the sub-tests that we solve is, yeah, we don't even need the irrelevant information from the scene, so we can use a pre-trained hand detector to crop the hand and we can take the largest visible hand that we say we're working with. And we can also replace the other irrelevant parts of the image using hand segmentation, as we discussed before. So the results that we came up with for hand keypoint based approaches, we can see on the table that MLP and CatBoost succeed over LSTM, which is pretty interesting because MLP and CatBoost don't treat the sequence like LSTM does. So we can see the non-trivial connection between the noisy hand keypoint in the sequence. And image-based results, we're using IPN10 data set and so there are C3D and then ResNet, ResNext and MVTv2 small. And so the baseline has been chosen as MVTv2 and this can be also substituted with the latest, for example, tera transformer. So the conclusion, we deduced that hand keypoint approaches are more prone to errors by design since they use several subsystems to generate, for example, hand keypoints if we use media pipe, for example. And then we designed a novel system that is able to recognize dynamic hand gestures can generate as well the new unknown gestures and doesn't need to be trained on all possible gestures due to a metric learning setting. So we combine vision and text, which is in line with how we humans understand gestures as they provide the language of their own. And there's a link to sign languages here. So for further research, we propose the following. Current data sets have no textual descriptions, just labels. But we want to use the text encoder. So we probably want to have very detailed descriptions for every sequence. So we propose a large pre-trained video to text model that will generate descriptions for us. For example, for a wave gesture, we can do the following. A person's hand is moving from side to side with an open palm, and it's the wave gesture. So the current data sets will just have for the wave plus, but we need a detailed description, as I said. So we'll use it, generate the descriptions as the input or text encoder. And implementation of the proposed approach is left for further work. And it also needs ablation study to understand which parts of the system actually benefit the actual solution. And we need to carefully study the sliding window for every data set and come up with an algorithm to properly choose it. And right now, we just use the 32-window size. It's also important to investigate depth-aware models for the tasks. So right now, we only have RGB data, but we could add depth data to it. So thank you for your attention. Thank you for your talk. Does anyone have any questions? Well, we're kind of actually not even slightly out of time. So yeah, I also don't have any questions. So let's send the speaker again. Thank you. Thank you. This finishes the computer vision section. So let's go to lunch. Yay, thank you to all the presenters. Today's session, and it's my pleasure to welcome the speaker, Gagik. Yeah, Gagik, sorry. Yeah. And so the talk, visualization-driven graph sampling strategy for exploring large-scale networks. Four is yours. Thank you. Okay. Hi, everyone. Thanks for coming. Today, I will present our research work conducted with Irina Tyrosyan and Vartey Razaryan on visualization-driven graph sampling strategy for exploring large-scale networks. So what are graphs themselves? Graphs are complex data structures that represent pair-wise relationships between different objects. And when speaking of the large graphs, it means that we have a lot of entities that might or might not be interconnected, one with another. Here we can see several examples of large graphs. For example, on the left we have social circles, Facebook dataset, which basically represents interconnection between different Facebook accounts. And in graph terminology, like people here or entities are called nodes, and the connections between them are called edges. And for example, here, if people are friends together on Facebook, then they will have an edge connecting one with another. And here are different other graphs. For example, these two represent collaboration of different paper writers with one another. And the key thing about the graphs is that they are gaining more and more usage nowadays, and we have really a lot of amount of information. And they are becoming too large and too complex to apply any analysis on them. And especially when we talk about visual analysis with so many nodes that can grow up to like tens of thousands or even millions of nodes, it's becoming quite hard to apply good visual analysis on the graphs. And so what we offer for that is actually an approach called graph sampling. It just refers to selection to only some portion of the nodes and edges and not operating on the whole graph. So that we will have less information to process and it will be much more easier to gain insights from them. But the challenge with the sampling is that we might find, we must find a good way to apply it so that when applying and conducting some analysis on the sample, we can actually replicate these results back to the original graph. And all the information that we retrieved will be correct. And so having that, our research mainly focuses on two questions. First is the evaluation of the existing sampling approaches that we have, like which are the state of the art approaches and what drawbacks or advantages do they have. And our second question was the proposal of the novel approach that will address at least some of the drawbacks that the current state of the art approaches have. And speaking of such approaches, after conducting extensive literature review, we actually identified three algorithms which are more or less good at graph sampling. They are random jump sampling, random edge node sampling and the final approach, minocentric graph sampling, is actually the best one that we had. And so we will now go over each of them and explain how sampling is conducted in each case. First, random jump sampling. Random jump sampling is actually based on the idea of random walk. So what random walk is, it is when we randomly pick a node, meaning an entity from the graph, and just randomly start traversal between each adjacent node. We randomly, with it, we can start, for example, at such point and with random movements, go and explore the whole graph. And so what random jump is adding to random walk, it is actually that at each step with some fixed probability, we can jump to completely random new region. And it solves some generalization problems because if we apply like regular random walk, then we start at some point and we can only move to the nodes that are actually connected to them, directly or indirectly. But with the help of random jump, we are able actually to generalize better the whole graph. But of course, as the whole process is actually applied randomly, there are drawbacks related to it. And sometimes, based on at which point we are starting, we might miss some important components of the graph. And so this was the random jump. The next thing is random edge node sampling, which is pretty close to the previous approach. In this case, we randomly select edges from the graphs. And also the nodes that form this edge. And in the end, all the nodes that we have selected, if there is also some interconnection between them, even though the corresponding edges have not been selected randomly, we like strictly connect all the nodes that we have picked. And so in this way, again, we are able to get a very good graph sample. But again, the main drawback is related to the fact, to the word random. Again, as the underlying process is completely random, we still might miss some important components of the graph. And finally, the last approach, as we already stated, this is the state-of-the-art approach, minocentric graph sampling. And the idea of this approach is to break this selection process into two parts, minority selection and majority selection. In case of minority selection, actually, we are trying to first pick, like, anomaly nodes from the graphs by saying anomaly, we don't mean like that thing, just the nodes that are different from the others and might contain more important information than the regular nodes that we have. And so in this case, we define four types of minority structures. The first ones are superpivots. We can note one here. And what superpivots are, they are nodes with high degree, meaning node that has a lot of connections with other ones and also with high connectivity between its neighbors. We can see here. So such kind of nodes marked by number one are actually called superpivots. So they are mostly important structures in the graph. The next type of minority that we have are the huge stars. We can see one here. Again, huge stars also have high connectivity. But in this case, now they are forming like star-like structure, meaning their adjacent nodes have no connectivity with one another. The next important minority structure are dreams. They are these edge like structures getting out from the main clustered component. And also we have bridges. Okay, here. Then bridges just connect different highly connected components with one another. And so this approach is based on first selecting these types of nodes from the graph. And after the minority selection is complete, majority selection is being applied. And the idea of majority selection is very simple. All the remaining nodes that we have are being evaluated iteratively. Like all the ones that are remained, we calculate several distance metrics for them. We actually add them to the graph one by one, and then calculate some metrics. And the node that provided the best distance metric, meaning the sample graph is closer to the original one, this node is being selected. And once we actually add this node to the graph, the process repeats for all the other ones. Okay, and we can directly say that this is very computationally expensive approach, because like with a greedy approach, we add all the nodes, then remove them, then simply add the best one. And then this process repeats. And in fact, with MCGS, we are able to retrieve a pretty good sample, but the main drawback is actually related to computational expense of this majority selection phase, because too many iterations are done. And if we have a very large graph, the process can take really long. And also, we have a problem of imperfect minority selection, meaning that actually here we identify four main minority types, but there might be other important components of the graph that might simply be missed as the result of the algorithm. So what we offer, we actually take MCGS as the best approach that currently exists and try to apply several modifications to it. And they are named batch major, enhanced minor, connected component, and also the ensemble link of all the separate approaches that we have. Let's also go over one each one of them. So the first modification that we applied is called batch major MCGS. It means we take the regular MCGS, but modify the majority selection phase. It actually means that instead of picking a single node at each step, the one that performs the best, we decide to take the batch of nodes, okay, like 10 nodes at each iteration. So we will actually reduce the number of iterations by iterations by 10 times. And our algorithm will become much faster. And here we can see the actual results. For example, here we have condense matter collaboration network. This is actually the largest network that we used in our analysis. It represents some collaboration between paperwriters in the field of condense matter physics. And here we can see that with original MCGS, like the running time of the algorithm was 79 seconds, but with much, with batch selection, when we picked 10 nodes in its batch instead of a single one, we have much faster implementation. And the algorithm simply runs off in 8.6 seconds, okay, almost 10 time execution time improvement. And how we picked this batch size. Here again, we can see some analysis applied on the same largest graph condense matter collaboration network. And we tried to sample with different sampling grades and different batch sizes and just calculated the execution time of the algorithm in all such cases. And then we decided to apply the elbow method to identify the best point, the best number of the batches. And we can see that starting from number 10, like for almost all the sampling grades, the execution time decrease is not that sufficient. In fact, it is getting closer to zero. So we decided that 10 is the good breaking point and the batch size of 10 will be a good solution for our approach. So this was the first modification that we offered. The next one is enhanced minor MCGS. Here we tried to change the minority selection process. We said that like we still after selecting the four main types might have important minority structures. So what we did, in fact, when observing our graphs, for example, here, we still see the graph of the Facebook people connected with one another. Here we can see a bridge like structure with this single node between two largely connected components. And we can see that actually the regular MCGS is missing this bridge. So this bridge does not exist here. And in fact, this bridge is connecting two highly connected nodes that have been included here, but the connection between them is missing. So we decided that the minority structures that we pick, in particular, the superpivots and huge stars that we are picking, we should also include the shortest passes between them. So just once the minority selection is done, we also add the shortest passes between the high degree nodes. And only after that, the majority selection process begins. And we can see that like by doing our modification, actually the bridge is retrieved in the given sample. And one last approach that we offer is called connected component MCGS. The regular MCGS performs on the whole graph. It doesn't differentiate between different connected components. And as we can see on this example, some information can be missed. For example, in our original graph, we have a large central component. And at the top in this cloud-like thing, we have many small connected components. Okay, they can count like two, three or four nodes. And we can see that one supplying regular MCGS, the central structure is kept pretty well. But this cloud-like thing is not retrieved because the components were too small and the algorithm didn't retrieve them. So what we actually offer is to independently apply MCGS on each connected component and then combine the results. And as we can see, having done that, we have a much better representation in this case. And finally, as I said, we also offer the assembling of such methods. And basically, we take all the possible combinations of length two and three of our approaches and combine them together, like batch major with MCGS, with connected component, and enhanced with connected component. And in the end, we apply all three modifications together. So once our approaches are ready, we needed to evaluate the efficiency of them. For that, we picked eight different graphs from Stanford's large network dataset. And we can see a varying number of nodes and edges. Again, as I mentioned, Condes Matter is the largest one with 23,000 nodes, almost, and approximately 93,000 edges. And in the evaluation, we decided to go with two steps. First was quantitative evaluation. And the second phase would be the qualitative. In case of quantitative evaluation, we tried to measure by actual metrics how the sampled graph is close to the actual graph representation. For this, we picked five metrics like average clustering coefficient and global clustering coefficient of the nodes. They just tell some information about the general node connectivity in the graph. And having this, we just calculate the distance between the sample and the original graph. We also calculate the number difference between number of connected components of the sample and the original graph. And we calculate skew divergence distance and Coma-Gorov-Smirnov distance between the degree distributions of the graphs. And having all these five metrics, what we are doing, we are actually picking all our eight graphs. And for each graph, we run each of our algorithms for six sampling rates starting from 0.05, which basically means that only 5% of the nodes will be retrieved up to 50%. And also, as all the algorithms have implicit randomness, we actually run each one four times for the given graph and for the given sampling rate. And what we do, we just calculate the stated metrics for each one by one. And for each metric, we rank our algorithms. Meaning the one that had the least distance to the original graph will appear at the first position. And then the worst one will be the last. And what we do, we just sum up all these positions that the algorithms have gathered through different runs. It actually means that the algorithm with the smallest amount of final aggregated points have appeared at the top most times. So this is the best approach. And what we do first, we try to, using this ranking mechanism, identify which approach is the best among the ones that we proposed. And we identified that batch-major CCMCGS, meaning the combination of batch-major and connected component approaches, actually is the best one that we can offer. And it gets almost 10,000 points, but we see that it is actually in the first position. So what we did, we actually picked batch-major CCMCGS and tried to compare it with the same approach with the existing state-of-the-art algorithms. And here, we can actually see that still MCGS leads, and it's a little bit up from our approach by the final number of points. But the difference is not that big, and also taking into account that with our approach, especially with the batch-major step, we make our algorithm more than 10 times faster, then we concluded that this is a pretty good and sufficient result. And so this was the first type of evaluation that we performed, quantitative evaluation, based on the metrics. But as was the main focus of the research was the visual analysis of the graphs, we also tried to visually compare the generated samples with the original graph representations. So we performed qualitative evaluation of the algorithms. And for this, we conducted a survey with real 100 users, and how was the survey conducted? Here, we can see a sample screen that the users received in the survey. At each question, like they received a triplet of the graphs, and the middle graph was always the original representation of the graph. And on the left and right sides, we had samples generated by different algorithms. And of course, these samples were generated with the same sampling rate. And what users had to do, they just had to look at the original graph, and then from the given samples pick which one was better. And if the user was not able to identify which was better, they could just mark these two sampling algorithms as equal. And having done that, actually, if one algorithm was winning over the another, it was getting two points. Otherwise, in case of the draw or equality, we were giving the algorithm a single point. And so in this case, the algorithm that actually gets the most number of points would have been the best one. And we identified that in this case, in visual analysis, our approach is actually winning. And surprisingly, MCGS even dropped the third position performing lower than random edge nodes. So we can conclude that actually, our algorithm performs pretty well and sufficient in terms of the visual analysis that the users can apply. So what conclusions we can make that we got pretty solid quantitative results? Okay, we were behind MCGS just by a small margin. And also, we got the best results in terms of the visual analysis and of course, the very good time complexity optimization. Regarding of the future work, still the problem of selecting not all minority structures remains and there might be room for improvement of the minority selection process by selecting additional complex and important components. Also, we can still apply testing on the larger set of the graphs because as you remember, we picked eight graphs for the main analysis. Also, we can still do potential improvements on implementation level. Like still, there might be room to speed up the algorithm. And finally, we can really look at the application in diverse domains because graphs are really used in different domains as we saw like social networks, for example, Facebook, and it can go up to natural sciences like chemistry or physics. So real applications in such domains will domains will also help us a lot. And basically, that is all. Thank you. Do we have any questions? Yes, please. Thank you for your presentation. Well, on the slide, when you showed the time complexity of the algorithm, for what size of network was it? Yes. Condense was really the largest one that we had. Okay. Thank you. More questions? Okay, let me ask one question. Can we go please to the slide with manual results, like human evaluation? And are the points, are there some of the points of all, like all the people participated in the query or there were some kind of specific formula that was calculated to calculate the final score? Provided each algorithm equal amount of, we just provided each algorithm equal amount of times to each user. They were just shuffled randomly. And so through all 100 results, each user was given like 16 such triplets. And if the algorithm was winning, we were assigning to its total two points in case of the draw one point. And so in this way, we got these final results on how many points each algorithm has received. Okay. Thank you. And maybe it was interesting to check, like, do people agree on the same images? Do they choose, tend to choose the same representations or the same different samplings or the same samplings? Like, have I measured the agreement, how they agree on this? Yeah, actually, we haven't gone into much details on this aspect, like whether there are some trends or something else, but I think it will be a good direction for future work that we can apply. Thank you too. Yeah, no questions. Thank you. You said that you estimate your algorithms, four times each one. It's too few, I mean, or not? Sorry. We just made different rounds because algorithms have implicit randomness, especially the random edge node and random jump sampler. And also the minority selection process in MCGS that is described in the original paper, just not to do, like, if they try to identify all the minority structures with a grid one, with a regular approach, it would take too much complexity. So they applied some modification. That is why basically everything we have has some implicit randomness. So we generated for us, like, to not have only one good result and say that this graph is good, just to get more general representation, we decided to run each four times and include it in the comparison. Okay, let's thank the speaker again. Thank you. The next one is Sergey Siderov, Sergey Mironov and Alexey Grigoryev, limit distribution of friendship index in scale-free networks. Okay, thank you. Thanks for the opportunity to present our work titled Limit Distributions of Friendship Index in Scale-Free Networks by Sergey Siderov, Sergey Mironov and me, Alexey Grigoryev. So what is the friendship index? It allows one to measure the degree, these proportions in networks. So friendship index is closely related to the friendship paradox, which says that your friends are on average, are more likely to be popular than you. In network represented by graph, the friendship index can be calculated as the average degree of nodes neighbors to its own degree. It's known that, well, friendship index allows to measure the direction of influence in networks and also allows to compare networks with friendship index and without friendship index. So in this presentation, we will find out how friendship index is distributed in real networks as well as social networks and in simulated networks produced by the Barabasi Albert model and configuration models. Also, we will estimate the share of nodes for which friendship paradox holds true or in other words for which the friendship index is higher than one. And we'll see how different are real and simulated networks. Okay, first let me shortly define some notations that would be used. A complex network is represented by a graph which has a set of vertices and a set of edges. Quite simple. DI is the degree of node VI and then the friendship index denoted as beta is calculated as the sum of degree of neighbors divided by the square degree of the node. Or in other words, it's the same as the average degree of neighbors divided to its own degree. And we're interested in finding not only nodes friendship index but how the average friendship index among all nodes with the same degree. This will be denoted as sin of k. Don't worry, when you'll see it again, I'll remind you what this is. So I know all of you are tired already probably. Okay, let's move on. As I've said, we'll see how friendship index is distributed in real and simulated networks. For simulated networks, we've picked a once created by the configuration model. This model is convenient because it creates networks with a given degree power low and with known exponent of the power low. And these generated networks don't have degree degree correlations which would be helpful for us in the future. So to create the configuration model, we need a degree sequence. These degree sequences would be obtained as n independent and identically distributed samples of random variable Xe. And this variable Xe follows the power low with parameter gamma. Okay, now we'll see what are the limits of average friendship index among all nodes with degree of some fixed degree. So first of all, let's introduce new one and new two. These are the first and second moments of random variable Xe if they exist. And L0 is a slowly varying function, which in other words is a function that when its argument tends to infinity, if it's multiplied by some positive a, its value does not really change. So what we have here is that in configuration models, actually if the gamma, the exponent of the power low is more than two, then the average friendship index among all nodes with degree k with some degree actually tends to new one divided by new two multiplied by k, which is a constant divided by k, which is really convenient for us. We can, we know this value, we can calculate it. However, when gamma is between one and two, it's not so great. And there, well, this psi n of k, which is the average friendship index among all nodes with degree k, when divided by function L0 multiplied by n in this power, it tends to gamma by two stable random variable with parameters one, one and zero. This result is not so great because what we get is this average friendship index among all nodes with degree k depends on n, the size of network. This way we cannot compare networks with the same size, unfortunately. Well, speaking about the proof of this theorem quickly, when k is gamma more than two, then first and second moments are finite and with the use of central limit theorem, we get the following. And when gamma is between one and two, then the second moment is infinite and with the use of stable low central limit theorem, we get the following result. Okay. This was the theorem, and now let's see how friendship index is distributed in simulated networks. So we choose a number of combinations of a model and gamma, the exponent of the power low. And for these combinations, we create networks of size 300,000, and we'll be looking at three measures. A sample mean, which is a sum of all friendship indices among all nodes with degree k and divided by the number of such nodes. And well, sample variance, which is an analog of variance for the friendship index measure, and sample coefficient of variation defined as a ratio of standard deviation to the mean. Let's see the results. Well, there are a lot of images here, but let's focus on some of them first. Each row here represents the results for each own network. So let's look at the left column first. Here we see on these logarithmic plots, log-log plots, horizontally is the degree, vertically is the average friendship index among all nodes with this degree. For all networks, we see that the average friendship index actually follows the power low with parameter minus one. I mean, it's log-log plots, then it's power low if these weren't the log-log plots. Okay, in the middle we see variances. Let's just, well, skip them and move on for a moment. The right column is, well, one of the most interesting, because at first glance it may look like they're the same for all networks. But however, it should be noted that only for network where gamma is more than two, well, 2.5, you see that for nodes with all degrees, the average friendship index is, I mean, the coefficient of variation is less than zero, because this means that mean would be larger than standard deviation. Well, for other networks, for small degrees, it's higher. Okay, let's move on. Now these were results for simulated networks. Let's compare them for some large real networks. These are networks from online sources. The data for which was already collected before us. These are, well, networks from different sources of different size. Let's see the results for them. Surprisingly, despite all these networks being, well, very different, we see the similar results that average friendship index also follows the power low with parameter close to minus one. Variances for real networks are much larger than for simulated networks, which results in coefficient of variation also being much larger, which makes it harder to predict the values of the network. So this was purely about friendship index and its distribution. Let's move closely to the friendship paradox. So as I've said earlier, friendship index is closely related to the friendship products, which says that your friends are more likely to be popular than you. So some of the known facts about the friendship paradox is that it's present in social networks. Most nodes in social networks have friendship index larger than one, or this means that they have friendship paradox. It holds true at both individual and network levels and last but not least, the presence of friendship paradox was confirmed in some random networks generated by the Barabasi-Albert model. And, well, the final theorem for today, I mean, from my presentation here, we'll estimate the share of nodes for which friendship index is larger than one. Actually, so if random variable c, again, follows the power low with parameter gamma, and its failure begins with m, we can estimate the proportion of nodes for which friendship index is more than one. We can find the bounds. So it is actually bounded by one minus a1 and one minus a2. And these a1 and a2 differ in the upper bounds of the inner sum. These are highlighted in red. So what do we do with it? Here, we plot the upper and lower bounds for, based on the results of the theorem. So as you may see, these, the value of bounds depend on the gamma, the exponent of the power low, and m, the minimum amount of nodes in network. So this is a plot of the bounds of kappa, which is a share of nodes for which friendship index is higher than one, or the share of nodes which have friendship paradox. And these were theoretical results. And now, I show empirical results for simulated networks. So these networks are based on the configuration model. And as you may see, I will just switch slides back and forth. The results are quite similar. So that's nice, of course. Again, it depends on, the result depends on m and gamma, which is the same. One thing I should mention is that, well, when gamma is between one and 1.5, then, well, the share of nodes for which friendship index is higher than one is well close to one. It means that almost all nodes have it. But then the question arises, when friendship index is not equal to one, well, the share of nodes, for which nodes it's present, for which nodes it's not. And we'll see it for real and random networks. These are random networks that you're already familiar with, that I've shown previously. So, and here we see the distributions of the share of nodes for which friendship index is higher than one, but based on the degree. So for each degree, we get the share of nodes among nodes with these degrees for which friendship index is higher than one, or for which friendship paradox is present. And, well, for all these networks, we see that for smaller degrees, friendship paradox is present. However, for large degrees, for hub nodes, it's almost never present. And also, this parameter depends on gamma. If gamma is smaller than the amount of nodes with friendship paradox is higher. And vice-versa. Okay, I think that's everything here. And finally, let's see the same for real networks. Well, actually, the results are also quite similar. Yes, you see the shapes of slows differ. They aren't so beautiful as for simulated networks, or you may consider them more beautiful. Well, so, yes, it also depends on gamma parameter of power law, but well, real networks are more complex, then the results are a little bit changed from the simulated networks. So, to sum everything up, we looked at the values of friendship paradox about looked at their distributions in real and simulated networks, and how it depends on degree of nodes. And so, in networks, without degree degree correlations, and network size tends to infinity, we've proved that power law degree distribution, because it has a finite second moment, and the value of average friendship index tends to a constant divided by k. However, when the second moment is infinite, it's not bounded, then this quantity converts to a stable distributed random variable divided by this. And secondly, the friendship paradox is present in all networks whose degree distribution follows the power law. And it depends on the exponent of the power law. And if the higher minimum degree is larger, then it leads to a stronger friendship paradox in network. Thank you. Thank you very much for your talk. Do we have questions? Yes, please. Thank you for your talk. It reminds me one interesting chapter in the proofs from the book, book by Eigner and Ziegler. And this is a chapter about friends and politicians, and the theorem says that if two people have exactly one friend in common, then there is a politician who is everybody's friends. But your friendship paradox is also should be, this is also should be a famous thing, maybe it's somehow related to this type of theorems or not. Just curious. Yeah, and just adding about this moral thing that everyone knows through like some four to six handshakes, like it also struggled my mind. Yeah, maybe comment on this. Well, actually friendship paradox is a known topic. Well, it's not different. Well, not I who invented it, of course. And usually it was developed in social studies and not really in network studies, maybe while their friendship index as a measure, while this name was introduced to recently about your exact example. That's an interesting one. I can't really say if they're it's the same or not. It's a nice thing to check out and I'll do it. Thank you. One more technical question. You mentioned a bipartite graph and you also applied power law analysis to it. And you didn't, how to say, sort it out the vertices of one part from another. So it's just the same sample, right? Thank you. Thank you very much. Do we have more questions? Okay, thank you. Let's thank the speaker again. And the last one is approximate density computation for by clustering from me, Dmitry Gnatov, Kamila Usmanova and Daria Kmisarova. Yeah, this one. Thank you for introducing our talk. This is the talk done together with my former students and the lab members and also with some colleagues from Institute for Molecular Genetics since our reference asked us to include some other examples where by clustering can be applied. We included this part and they are responsible for the data we are for the processing. So as for the motivation of our research, it was constantly multimodal clustering. Imagine that you have data, you can think of it as a hypergraph, a regular uniform hypergraph, where you have several types of vertices like the offers of papers marked by some tags and such a structure is sometimes called fulcronomy, a particular one because people use specific tags to mark some resources, some data. But this is a three-dimensional case and a lot can be done in a two-dimensional case. So here we can find the so-called three communities that is users that shares the same resources and mark them by similar resources or the same resources and mark them by the same subset of tags. They are the so-called three communities or three clusters. But the question that we asked was what is a good approximation of three concepts that is maximal triadic rectangles in such data. And we answered this question in our paper published in machine learning journal already several years ago. But in 2D case, we are going back to 2D case now. There are many things to do concerning the performance of similar techniques. And we use as a language formal concept analysis, which is rather a simple language. As we'll see, we have a set of objects, a set of attributes, binary relation on them and two operators called derivation operators or Galois operators saying that given a set of objects, what are their common attributes and vice versa, given a set of attributes, what are the objects that share them all. And the concept is just the unit of thoughts, right, as in philosophy. But here it's more formal. It has two parts, an X, 10, 10, 10, 10, A and B. All objects from A shares all the properties from B and vice versa. All properties from B belongs to all the objects from A. These concepts are hierarchically ordered. Let's have a look at this small world of geometric figures, equilateral triangle, rectangular triangle, rectangle and square. And let us consider just four properties. It has exactly three vertices, four vertices, has the right angle and to be equilateral, we can extract all the formal concepts which are just the maximal rectangles in our data and hierarchically order them by set inclusion of the first component. For example, the concept 234C, 234C is just the concept of rectangular figures. It's more general than the concept 34BC, which is just the concept of rectangles. This is the tool to analyze real data. For example, we took all the publications on formal concept analysis, performed terms extraction, or maybe we use keywords and combine them into taxonomic terms, like the sort of topics. So in the top of such lattice diagram, we have 1,000 papers, about 1,000 papers, devoted to formal concept analysis. And they are split into sub-topics or sub-communities of authors that write papers on formal concept analysis, but also specific sub-topic like software engineering. These concepts may intersect. The key property of lattices that is always infimum, the intersection and supremum sort of union. Here it's a bit more complicated. You can also analyze real websites, users. Those are the visitors of HSE university websites. Also, they read some other news like, I don't know, RIA news or Cosmopolitan or expert.ru. In triadic case, as a source of this data, we can use, for example, Bipsonomy, a German project where the authors can share their papers and tag them. So we have triadic data. But those were the motivating examples. Let's talk about bi-clustering, and bi-cluster is the approximate version of concepts. The term was coined by Boris Mirkin, but definitely bi-clusters were analyzed before by Hartigan, for example. And bi-clustering refers to simultaneous clustering of buff objects and attributes. In genetics, for example, the objects are genes, and the attributes are some tissues or conditions under which specific genes can be co-expressed. So this is actually a matrix where each cell is a number from zero to 30 to 1,000 plus something. And here we can see that some of them are grouped because of their co-expressions, so very large values are in red. And if we consider the tissues which are, how to say, suspicious to contain malignant tissues, they can be the source of cancer or the biomarkers of cancer. The formal definition of bi-cluster is just a sub-matrix of an input matrix like the gene expression matrix. It's a real valued one. We'll consider only binary ones. And it turns out that in bioinformatics, they rediscovered formal concepts as such inclusion maximal bi-clusters. So there are theorems saying that, but we propose something which is the relaxation of such rigid notion when all the objects should share all the properties. Actually, we can consider just one pair, an object and an attribute, maybe the gene and some condition. And we can apply prime operators saying that M-prime is just a set of objects that share M as an attribute. And vice versa, G-prime is the set of all attributes which describes the object G. And then we can intersect such a rectangle with the incidence matrix and count the number of non-empty cells. And this is exactly the density of such a bi-cluster. Here is an example, the geometric one. So GM cell is in here in the center. These stripes, the gray stripes, they are full of ones. And we can find first primes so to speak. They shape this bounding box. But we can also find the second primes and they form this green dense cross full of cells. And there might be some other cells, but they do not form such a cross-like structure. So these black cells also belong to bi-cluster. Actually, we may think of bi-clusters as sub-lattices in the whole lattice of formal concepts or maximal rectangles. They have some properties, like the density lies in the interval from zero to one. And actually, in reality, this one is enrichable. So it's a bit larger value for a non-empty binary relation. And they also can be hierarchically ordered. But we cannot devise some efficient data mining algorithms like a priori algorithm for finding association rules or frequently both items. Because this relation, when one bi-cluster is a sub-bi-cluster of the other, is not monotonic nor anti-monotonic with respect to density constraint. Okay, but we can set the density threshold to be zero and consider all the bi-clusters. And then the following fruitful property fulfills every concept that is dense, absolutely dense as maximal rectangle is contained in some bi-cluster. In terms of computational time, there are some propositions which says that the total gain is polynomial versus exponential in the worst case for finding all the maximal rectangles or formal concepts. This l might be exponential in terms of the size of g and m, the number of objects and attributes. l is just the size of the lattice. Here also some example from the past. We analyzed Yahoo dataset with 2,000 companies and 3,000 advertising terms. And we applied object attribute by clustering for recommendation purposes. If we generate concepts with different constraints on the A and B components, that is the number of firms that advertise their goods with some number of keywords on Yahoo. Then for zero and zero treasures, we obtain about nine million concepts. This is infeasible for manual analyzing, for example. So that's why bi-clustering was one of the means to reduce this number. And when we change, when we override the minimal density threshold, we can obtain the number of patterns which are suitable for manual analysis. And we can interpret the found by clusters as markets. We do not show the first component of the by cluster. They are just firms IDs, which we do not have from Yahoo. But we have the real keywords, affordable hosting web, hosting web and so on. This is about hosting market, something about hotel market. The pattern here is very simple, the name of the city and the word hotel. We also applied object attribute by clustering to scale, to binary scale data known in machine learning from UCI machine learning repository. Here are just some statistics, some figure statistics, the number of concepts versus the size of the relation in terms of non-empty pairs. And we can also see how these numbers, the number of concepts relates to the number of object attribute by clusters found. And there is a drastically reduction up to several times. SNA related examples are probably one of the most famous but small data sets in community detection is the so-called Sotern woman data set about 80 women and the certain part of the United States. In 30s, I believe anthropologists asked them several questions, what kind of activities they shared together like going to the church or to the dinner or some party. The cross means that they participated in some event. So it results in a bipartite graph. And we applied by clusters here and compared it with clicks as communities. So you can see that by clusters sometimes can capture larger groups than clicks due to their sparseness. So here is another example about Karate Club that fist into parts after the conflict of the, I don't know the right words, the coach and the president of the club. Some of the club members decided to be with their coach, the master and the other part decided to be with the president. And there are some key persons here like president and this master and subgroups of people that were identified with clicks and with by clusters. Here I can say that by clusters are better but they also captured similar structures for three model data. We have for our free model data, we have offers of papers, tags and we can also analyze this with similar tools. So the more focused result, which is maybe not really interesting as real examples is how to compute the density of such by cluster sufficiently. We were trying to use epsilon delta approximation here and turn off holding inequality, allowing us to use only some fraction of points to estimate the density of a by cluster. And unfortunately, this function, the number of points that we should test is unbounded in the most desirable points like delta one, the probability in threshold one and accuracy epsilon zero. So we can at least test it in real scenario. Here is a small example. We have for other sparse by cluster. We can compute the density of this gray cross almost immediately by this formula and with the formula for n as a function of epsilon and delta, this accuracy and this probability threshold, we obtain that we need to test 10 samples. We test these 10 samples and find out that among these 10 samples only three of non-empty cells high among them. And that's why the coefficient here is 0.3 and this is the size of non gray areas together. We tested the approach on free data sets. So in turn women zoo and advertising the greater the density of the data set, the greater the accuracy here, you can see in white cells where the fury agrees with experiment for which data set actually zoo data set. This is for internet advertising for southern women. It was not very stable. And here also an example from genetics where we studied is chemist stroke individuals versus non-empty chemist stroke individuals and their attributes known as single nucleotide polymorphisms being attributes. We found by clusters. We counted their density, size and purity that is the number of individuals which are not included among the healthy ones since we are interested in finding single nucleotide polymorphisms describing non-healthy individuals which are at risk of having ischemic stroke. So later on these SNA descriptors are decoded to their identifiers and geneticists analyze them and we are in the process to publish a paper together with them where the findings are discussed with experts in relevant terms to genetics. So I skip everything which is left. It's just the things which was done and can be done and I'm ready to answer your questions. Thank you. Do you have any questions? Yes, please. Could you show the slide is the complexity of the algorithm? So the complexity is a linear quadratic code. It's linear by each of the parameters, I would say. Actually it's linear by each of the input parameter that is the number of non-empty pairs belonging to the input relation, the number of objects, the number of attributes. But here for maximal rectangles or maximal big leagues or formal concepts it's not linear. It looks like polynomial but in fact it's not. In case you have very simple object attribute matrix where of size I don't know four times four where all the crosses present accept the main diagonal then, am I right? That seems so. Then this is g, this is m and let it be n, that is four. Then you will have two power and patterns. So here is the complexity in terms of the input and also in terms of the output. So in the worst case you'll have even exponential complexity. But we managed to use approximate patterns. We lost some information but we can use it with rather large data at least. Thank you. Any more questions? So if not let's thank the speaker again and this was the last talk of the session and now I invite you all to go to the main hall because there's going to be a closing ceremony and we're going to announce the best paper in this track as well. So thank you all and let's go there. Thank you.