 So thanks everyone for coming. So I would like to start our section on Data Analysis and Mesh Learning with a big thanks to our host in Armenia. Now Armenia going through hard and dark times and I would like to especially thank the host for having the strength and courage to help us still organizing this event at these hard times. So my name is Sivgini Sambalov and I have a PhD in Computer Science and right now I'm a Mesh Learning Scientist in Aptec and together with Dr. Maksim Panov from TII, Technology Institute in UAI. We will be chairing this session on Data Analysis and Mesh Learning. This year we had around 17 submissions and only five made it to the final publications paper. A little bit more, but we would like to and unfortunately most of our speakers couldn't come for different reasons. So I would like to thank the speaker who came for coming and now we will be having a first offline talk. Oh, I can read it from there. So is it about detecting design patterns in Android applications with Codebert embeddings and CK metrics and authors are Djilvet Dlamini, as Ahmet Usman, Leonel Karquank and Vladimir Ivanov. Please thank our offline presenter. So limited time, but nevertheless the topic is fascinating for me. And but first of all, a couple of words about why I'm working in this direction. I'm head of the lab of the NLP lab for software engineering in the University and we have a big contract supported by Russian government in like seven years to develop tools for software engineering automation for generative models and generative I applied in software engineering domain specifically. And especially my lab is focused on code and text representation for understanding of code, source code and for summarization and similar tasks and analysis of the text show information. And this talk is about motivated by a specific problems that software engineers, but not only software engineers, but some managers of software projects may have in real life. So many applications are being developed. They just stored somewhere, but sometimes managers of the program managers or project managers, they have this trouble with managing with understanding what is the requirements, if the structure of the project specifically, we'll talk about Android applications here, follow some redefined agreements about the architecture, about the design, about the patterns that developers agreed to use when they developed the implement the system. And the motivation of this talk is that, well, we can try to solve this task at this partially using the machine learning tools. And there are typical design patterns, architectural design patterns on the right part of this slide. Rise your hand if you don't know what is MVC, MVP, MVVM, or basically they are different approaches to splitting the internal part of the application in Android, like splitting the visualization part from the data management part, from the modeling part, and making these three parts decoupled in order to do the development, software development more easy, robust, and so on, and maintainable. So classical metrics for source code, they used like back maybe 20 years, one in 20 years, 30 years before to analyze what is going on in the software. And basically they extract some information. We'll talk about the object-oriented programming here, and it's well-known approach called CKmetrics when they can be calculated almost from any piece of object-oriented code. They deal with the structure, the methods, and the classes, how they are organized, how many methods in the class, and so on and so forth. And these metrics are widely employed also in pattern detection, not just for some quality of the code, but also to detect patterns. There are approaches, I will not spend too much time for this, but there are approaches that are devoted to analysis of the byte code. We discuss Android applications. There are approaches of analysis of the semantic features related to the source code, like training a Word2Vec model to extract some semantic features from the source, and then put them into the embedding of the, in the form of the embedding, and then make a decision about train a classifier based on this Word2Vec representation. And yeah, and there is a third approach, like using the CKMetrics only. Now you know what is a CKMetric. So you run through the project, calculate all the CKMetrics for the classes, and then make a decision about the pattern. Our research questions was related to these two things, like the first one, is it actually important to add something on top of the CKMetrics, if it is possible to use a pre-trained machine learning model that will extract features that know something about source code, many projects, and extract features on top of the CKMetrics that will be useful. And if it is possible to improve, then how well such model should work. And I will talk about these two contributions, basically the approach. That is very simple. Maybe you will have some questions in the end, but it's not that hard to understand what's going on. And the analysis of the embeddings, classical features, if they're useful for the pattern. So the methodology is a bit complicated on this picture. If you take a look on this picture, maybe a bit complicated, but nothing fancy is happening. If you have a project of your application, that is Android application, many Java files, you can calculate CKMetrics, and then you can calculate the embeddings that represent representations, the vectors that correspond to each file. So for each Java file, this pre-trained model outputs a fixed size vector. And then the second step is to pull, to combine, aggregate those vectors into one single vector, basically. And then again, train a classifier on top of this. So we have this first path using only the CKMetrics, and the second path from the source files application of the CodeBird model that is extracting the features, like the big deep learning model that extracts features. And then we can continue to teach two types of features and then decide either using both of them or only CKMetrics. For the classification part, we use the CutBoost that is a state of the art model. The thing is that here we cannot train basically this part, but the usual approach is fine tune the CodeBird or some kind of transformer that extracts features here. But here we don't have enough data for training and that's what the idea why we don't, we just use the pre-trained frozen layers of CodeBird, okay? Just to show the whole process step by step. So these metrics representing the one source file ends up in one long list of numbers, each number corresponding to a specific metric. And then you have this array of metrics, like around 10 or maybe 20. That's actually, yeah, actually more because we use the extended packages 82 metrics. And our hypothesis is that these metrics actually they kind of uniquely identify some object-oriented part and they measure complexity of the design part of the design of the project, of the design of this class or whatever the source code is. And they can capture some course level metrics related to course level patterns that related to the target task. For the CodeBird embeddings, I haven't said too much about this, but the CodeBird, maybe you heard about BIRT, the pre-trained model from Google that is kind of, yeah. And the CodeBird is actually similar idea, but almost the same architecture, but pre-trained on the huge corpus of source code. Instead of the Wikipedia and some internet corpus, they use the huge, like basically all GitHub, something like this, and the pre-train and predict the, like train the mass language model on this data. And then you have this model that is, of course it's not the state of the right right now, but it's still a kind of application of natural language methods in software engineering world to extract some information from the source code, okay? And it seems that these, it's against just hypothesis, but these may capture on the patterns on the fine level, more fine level than CK metrics because this captures the token representation. But the problem here is that for each file you have a list of kind of documents. For each file you have set of representations and the question is how do you combine them? The idea is to, we tested several ideas like of pooling, of aggregation, like measuring the maximum, measuring the average. It ended up that we use, we use for the max pooling, we use the maximum, or not maximum, but summation of all these values. And that was, we get better results with this approach. About the experimentation. So this methodology again just gets the data and the question about where we get the data. The data is coming from the repository of like the already annotated data set with the pattern labels. So we have, it's very small data anyway, 26, like 22 numbers, 22 projects that doesn't have any design pattern. But still it's, if you don't consider training, if you consider only the testing part, it's it's doable to apply this methodology if on the, in the testing mode. Okay. So there's the data set. I discussed a bit the cut boost already. Just hyperparameters, you can find them in paper of course. And yeah, we calculated the classification measures, classification metrics to evaluate the quality, of course. And five-fold cross-validation was used because we have diverse in the classes, somewhat, somewhat in balance, but not really huge in balance, but still we can sample from the data and calculate the error. The results here, the first table shows not very fascinating results, but still it's for some patterns, it achieves better results than the original paper on the coach. And that's actually compares the CK metrics only and model that combines metrics. So you can see that the, for some patterns, there is like a drop of the quality, so some the same, but in most cases this is the improvement and this is also a slight improvement. It's can be considered as improvement. Again, it's, I'm a bit pessimistic about these numbers because it's still far from a good quality anyway. So that's, and here we compare this regarding the pattern type here. We compare the, how the CK metrics compared to code bird metrics, code bird embeddings plus CK. So in most cases, you can see that the improvement over the plain CK metrics without any embeddings is outperformed by embeddings here, but sometimes you have this example here that is CK metrics is better in this case. Okay, and so discussion about our initial research questions. So we also analyze the importance of different metrics and this is kind of when it comes to the practical application. So you can train a model, you can train a cut boost model that will predict the classes or the design patterns. But when it comes to real world to practice, it's more important to get some interpretation why the model predicts this or that. So here the interpretation about the, this part of the work was about the finding the top five metrics that contribute to decision. And there actually there is analysis that like conforms that the some of the metrics, some of those 80, 82 metrics are actually not relevant for not responsible for the design pattern. But this, this metrics that you should or you as a manager should pay attention to whatever you use machine learning or don't use machine learning, you can compute this metrics from the source code from the project and say, okay, this is like indicator of this or that design pattern or the absence of some design pattern. I'll not go deep because it's kind of related to domain knowledge here and you should be a kind of developer in Android or developer to understand what does it mean and maybe it's not that interesting. The improvement that ML gives us on top of the CK metric, as I said, it's not that big and we call it moderate. You can say it's maybe not even an improvement but it opens a big question about the applicability of machine learning of that kind of models in machine learning in source code analysis. It shows clearly that it's not easy task. The current models that like language models applied to code also struggle with analysis of bigger contexts, big large context, large projects and so on. They usually process like function level or class level context size and it's still open question, it's still ongoing research. And I will conclude with these limitations. So the major limitations of course, limitation is the size of the data of course and the interpretability of the embeddings themselves which is just a list of numbers. And if you talk about the interpretation of this list of numbers, it's not that useful because when you use CK metrics, at least they mean something but when you consider the birth embeddings they may be less meaningful. So we are going to continue this work and develop more robust and interpretable approach and do the ablation study without the dimensionality reduction that was maybe for the sake of the performance we reduce the dimensionality of embeddings but maybe we can try to do this in a more regular, more granular basis to understand the exact impact of code birth. Do I have time? One minute? Okay, so it's just a conclusion. You can find code and source code and data here. This work was supported by Russian Science Foundation and was done at Nantopolis University and I'm ready to answer any questions on this. Thanks for your really interesting talk. So it's going to be more and more called and we need to process it on a more automated way to make it work. So the preference now is for offline participants to ask questions. Do you have any questions? Anyone on the room? Okay, please use the microphone so our online speakers will also hear us. It's for online speakers. I want to just to ask is code birth available only in one size? You mean the input, the size of the? I mean the size of the model. Yeah, the model, right, the number of parameters. So the question is about number of parameters. I think the code birth is, we use only one set of parameters, like one pre-trained model. It's not like the birth itself, like it comes into several sizes, small, large and so on. But the problem here is maybe not the number of parameters because it's not a big number anyway, compared to large language models and the modern, but the problem here is the input length. So it's the risk constraint with 512 tokens. And if you want to process the whole project, you should squeeze all these source files into 512 tokens, some of them just too big. So we cut all the comments, cut some unnecessary, so there was a pre-processing stage to fit. And even if you cut some of the files, they are still not fit into the input of the code birth. So that's another limitation, another problem of this approach. So one idea could be to move to the bigger models that actually have a larger input window, size of the input context that could help and or just switch to some principle, different models that can process the whole repository, not just file by file. Did I answer your question? Any questions? Thank you for your presentation. So it's not a question, but more of a, well, like I want to discuss a little bit. So I think it would be cool if your work if you'll extend your work in finding the anti-patterns in code. And well, I think that might be helpful when if there is such a system created to, well, as a software developer I say that when a new code is published and changed, then there's some changes in the pattern appear. And so not like, and not to measure the whole pattern on the project, but maybe like in a small part of the code compared to the whole project. I think a good suggestion. Thank you. And yeah, there are such works like for classifying the anti-patterns, finding the anti-patterns in the source code. And like you, as I understand your suggestion was a classifying of the commits that can break some pattern and that's it. Yeah, there are such works and definitely it's a beginning. And back to the first slide about the Enopolis group. Yeah, actually we have not only my lab in Enopolis who are working in this, but other people are trying to approach different problems in like, these times it's a bit hard to, like even to find a good idea that will be available. And so there is a number of problems of course for practice and for research. But yeah, thank you for your suggestion. It was not a question. Are there any questions from the Zoom meetings? Are there, okay, we don't really have any time for that. So let's thank the speaker again for the talk. Yes, and the following talks will be unfortunately held online and our next talk is data-driven approach for identifying functional state of HEMA analysis plus entropy complexity and formal concept analysis by Ekaterina Zvorykina, Yuriy Bishasnov, Mayid Sohradi and Vasily Gromov. So they will be talking via Zoom. So please dear online speakers share your screen or report to me if you have any problems with that. You have 18 minutes for the talk and seven minutes for the questions, thanks. I would like to give some context about what's the problem with the entropy in this field first. So as you know, chronic venous diseases are the most common pathologies that we have now in humans. So I would like to give some context on the pathologies that we have now in humans. And there is a big occurrence of AI methods used for diagnosing these problems. In our work, we decided to tackle this big issue only for one group of people with chronic kidney failure disease, especially in dialysis patients. The prevalence of this chronic renal failure in these patients is growing now, especially after COVID pandemic because people who are going to dialysis, they're facing many issues. And if they cannot go to the hospital in timely manner and have to stay in unbuilt conditions, it, of course, results in many other issues. So what is this intravenous fistula? It is a special insert into the blood vessels in the hand of dialysis patients, which helps them to maintain the always available vessel to provide the dialysis procedures. So it's a small tube that comes into artery and vein and stays there for some time while the patient has dialysis, so it can be four years. It's a very well researched method. It has very good results in dialysis. But the main problem with this issue is that the patients can get trombosis through this small channel because, of course, it's not a natural vessel. That's why it's easy to get some toxic waste from other cells and also to get the basic trombosis, especially in elderly patients. And the issue for the patient is that when he stays at home, the development of this condition can be very fast, for example, in a few hours or one day. And he cannot diagnose it by himself and it can lead to lethal conditions. But the good thing is that this fistula makes a very characteristic sound that is called a brood. And usually doctors, when the patient visits them for dialysis, they also check, how is the sound of this small channel? And the sound is a result of the blood going through it. That's why it resembles a bit the heartbeat, but of course a bit different, like a pulse. And usually the experienced doctor can easily diagnose if something is wrong or if the patient needs the replacement of this channel. But of course in many situations we are not living in an ideal world, that's why the patient cannot go to doctor every day. And as I said, this condition can develop very fast in any time. That's why we decided to use some mathematical approaches to tackle this issue. And in this study, we analyzed 290 patients from different sites in Russia and used the help of the doctors to analyze them. So our motivation was to propose 2 mathematical methods of analyzing this fistula brood sounds as a time series and try to distinguish them on the basis of the fact that we think that normal function fistula makes the sound of laminar blood flow that is normal. And if it's a pathology, it's close to turbulent flow. And yeah, we can easily distinguish these two things. During this project, we decided to first of course investigate existing approaches then calculate our methods which will be an overlay of entropy and complexity matrix of each sample on entropy complexity plane and then apply two different classroom methods to classify the results of these two metrics. And of course, interpret our results in compared to the real situation that was given us by the medical professionals. So how do we see, how they look like this fistula brood noises? If it's a normal fistula, we can see that it's more or less non chaotic time series. And if it's a pathological one, it looks very different. In our case, unfortunately, we had, for us unfortunately, but for patients, of course, it was very well. Most of the patients had normal fistula sounds. Only 50 patients out of 700 records had this functioning fistula. And also we had the 61 case when the doctors were not sure if it's pathology or not. Of course, for the doctor, it's easy if he can see the patient the next day and decide if something is wrong or not. So for us, it wasn't so easy, but yeah, this is the problem of the method. Analyzing these noises, we also decided to go with a short element of sound because we don't see a big difference in the results of our methods if we take a very long time series, a long record or not. For entropy complexity analysis, we used Shannon entropy and Jensen Shannon divergence for complexity. Then we started clustering algorithm to cluster the results. So for each time series, we assigned two metrics, entropy and complexity, and then basically clustering them on the plane. Another approach included attribute object graph construction based on the same two metrics and the time series. For each time series, we tried to make it as a binary matrix based on the record, based on the analysis. Then this binary matrix was organized to attribute object graph where each element was an object. For each object, we checked if this binary matrix values were aligned. This is a simpler example of such a graph. Here we have one, two, three, I think six objects. I will show later how it looks for the time and series that we actually used. It will be much bigger, but less easy to follow how it works. The most interesting part, the results for the first method for entropy complexity analysis. This is how our clustering results looked like. We can easily see that there are three groups of clusters. Here, if the complexity and entropy are low, of course, we think that this is very organized non-cautic time series, and we think that this still works well. Then the yellow cluster means high complexity and high entropy. This is very cautic time series, and we think that this still is malfunctioning. We have a relatively big cluster in the middle, which we cannot give a certain answer if it's working on non-working fistula. This cluster is the question for the next part of our work, how we can analyze this data. Also, we got a very small cluster in the top where complexity and entropy is very, very high. I think this is due to the bad recording of the pistol sound because it's recorded by a dictaphone, or a microphone, and usually doctors speak during this procedure. Sometimes we cannot eliminate this sound. For the object attribute approach, these are the clusters that we got. As you can see, if it's an organized time series, we have a very nice scheme. If it's not organized, it's looked more like a mess because most of the objects, most of the clusters, they had a lot aligned, and this is considered as a bad one. Also, we tried to show how it can look in XY plane, so it's easy to follow. In this case, this method actually gave better, more distinguishable results. We can see that there are two clusters, which is good. One means a healthy pistol and another one means that the patient should go to the doctor. These two methods, they aligned, but the question is how we can interpret the third cluster that is not very chaotic and not very organized and why we don't see it here. When we tried to combine the results of the methods by simple linear regression, we got, I think, five clusters, but the tendency looks the same. We can distinguish chaotic and organized time series quite well, but we have a very big cluster, a few ones actually in the middle that we cannot assign if they're healthy or not. Which also, after the talk with the doctors, aligns with the reality because they also cannot sometimes analyze and diagnose these patients. When we compared our results with the doctor's analysis, we got pretty much alignment, even though we had less than 300 samples to discuss with them. So, as a conclusion, I can say that we developed these two new methods of not only time series analysis, but also diagnostics for a physical approach. Both of these approaches were good and aligned with the doctors individually and in combination. We have observed the significant correlation between the classification results and also we compared it with not only doctors analysis, but also with some Doppler ultrasound for some patients and it worked well. Regarding the limitations of the study, of course, it's relatively small sample size of patients less than 300. Also, the data was anonymized. That's why we couldn't compare, for example, female and male patients. We couldn't know if they are overweight or not, which can also result in some different sounds. Also, we don't know if they're the age of the patient that can also be a big difference. And it's still a black box for us. We cannot consider if our clusters are the same functional categories that doctors are assigning to the patients. For the future development of this project, our perspectives, of course, have a bigger sample group. Also, know more about the patients, maybe the age or the weight or some other comorbidities that they're having, for example, some comorbid illness. Also, during the fistula construction, fistula procedure, the fistula first month has undergone through a few stages of growth. And we can also assume that the third cluster that we see in the middle can be due to the fistula, still how the doctors call it, Yan. And that's why we cannot compare it with other fistulas. And that's why our method can also help to see if the postoperative fistula will be healthy and will work for a long time or the patient has to have another operation soon. That's it. Thank you for listening me. And if you have any questions, I'm here to answer them. Thank you, Yucatirina. A special thanks from me as a chair. So you made it in less than 14 minutes, so we could try to keep on track. Are there any questions from the offline? Yes. Okay. I'll start with you, with Dimitri. Okay. Thank you very much for your work. I'm especially happy that you used formal concept analysis since this is one of our favorite topics in our department, for example. And the question is about clustering. Whenever you use formal concepts and concept lettuces, they're like big search spaces rather than single clustering. What kind of selection tools to find a good clustering, a good partition you use here? Can you comment on it? Yeah, let me come back to the clusters. Yeah, actually, we didn't develop any new one. We used already described another article. Yeah, the only thing that... Yeah, I think we... Yeah. We only use the extra tool to go through the alignments of different objects through the lettuces the second time just to see if they are combined to more objects or not. So it looks more like heuristic way to extract clustering. Yeah, I must say that even brute force was faster to go through the lettuces than any other methods. Yeah, we used it only because it was faster for us in this case. Let me allow one more comment. In the beginning of Millennium, I was also a part of an international dialysis center working as a system administrator. And I remember those people who took these procedures three times per week. It's very important to support their lives. And one more comment from Zhaumeb Bashariye, the program committee member. He had some relatives who also passed away and he supports your work very much. Thank you. Thanks, it's great to hear. It's not useless. Thank you, Dmitry. Thank you for a very nice talk. I have actually a couple of questions. The first maybe to figure out this clustering was done because you don't have the ground through labels of the samples. Or am I missing something? On the screenshot, on the slide, you have labels like 0 and 1. The question is why don't you try it as a supervised problem? Because here we wanted to try, even for this analysis, what would be better to use is not clusterization and classification as most of the people who work in healthcare AI, I think. But here we decided to try the clustering because we were not sure how many clusters we will get at the first place. Here, yes, we can see the two clusters, but we also would be happy to see more than two. And if I analyze more data, and for example, if I take another batch with the patients from, for example, different hospital sites, it can be more than two clusters. And here we are very cautious about giving labels because we are not also sure if it's either these clusters are for health and non-health patients or just chaotic, non-cautic time series. Okay, thank you. And the second question, as I understand, you analyze time series and the time series comes from the sound. Am I correct? Yes. Is it a sound wave like amplitude, or was it? Yeah, it's like an amplitude. So usually it's just a sound from the microphone, just a record, and then we review this amplitude. Yeah, yeah, okay. But my question then, have you pre-processed it or transformed it into a frequency domain? And if you didn't do this, why? Maybe there is some motivation behind this analysis of the pure amplitude data in your case. But usually it's like people transformative frequency domain like doing some pre-processing. Yeah, the only things we did, it was just trying to clean up the sound with the different methods. But even after cleaning up, it didn't help very well because the sounds were made by, the records were made by different doctors in different conditions. That's why if we try to make it ideally clean, then we lose the actual sound. And other methods, yeah, if we try to make it 20% cleaner, then there is no change in the results. But yeah, I appreciate this idea. We should try to analyze it as a frequency domain. Yeah, it's a kind of classical approach if you want to clusterize or classify sound. So maybe it's worth to try at least to convert to frequencies. Okay, anyway, thank you very much. Thanks. Yeah, thank you. Are there any questions from the online? Okay, it seems like we don't have any questions from the Zoom. Any questions offline? No, okay, so let's thank the Ekaterina again for her talk. And the next... Yeah, thank you. The next talk is an application of dynamic graph CNN and IFICP for detection and research archeology sites by Aleksandar Vakhmintsev, Olga Kristo-Dula, Andrey Milnikov and Matvei Romanov. Dear presenters, are you here? Yes, we see some screen sharing. Dear colleagues, I had a very unpleasant experience before the conference. It turned out that it was a rare variant of coronavirus. Could I please ask you to... We have an international conference and some of the speakers don't understand Russian. Could you please switch the camera? Well, one of the problems that occurred was that the main doctor allowed me to participate in the conference and to take part in it. Including military hospitals, such requirements that were not allowed to do it in English. If you excuse me, could you please introduce yourself? Well... Could we... I don't know... I don't want to talk in Russian. Sorry, yes, yes. Okay, for the... Yes, so... Forbidden to speak English? Yes, by doctor. Yes, kind of not clear. Yeah, unfortunately, we really do some of the speakers not understanding Russian. Maybe we could switch it to the end of the talk. Yes, so if we could please then make your presentation then the last one. So the people who are not really prepared to, you know, to hear for the report in Russian could like stay after. So they could actually like left to other section and stuff. So we couldn't like waste their time. Would this be acceptable for you please? Okay, thank you so much for understanding. So I should ask Vladimir Belikov who is the next... Could you please stop sharing the screen then? Vladimir Belikov who is the next Oh, we have a second offline. Could you please have applause? Because, yeah... Dear colleagues, let me introduce my presentation on some combination of clustering and sample models of the Russian language. Thank you very much. Thank you very much. Clustering and sample models and domain adaptation or transfer learning. So what is transfer learning? Or sometimes called domain adaptation. So we have two domains. The first domain is so called target domain or domain of interest. And besides that we have some additional information or source domain. For example, we have text in English and we can use this information to improve classification or clustering of text in German on some related language. There are different kinds of transfer learning supervised or weekly supervised depending on the labeling process. So some data may be completely labeling or imprecisely labeled so on. And in this work we consider so called heterogeneous transfer learning when the domain have quite different feature space information. So we have target domain and it is required to obtain a partition of this data set on some number of clusters. And additional information is source domain where objects are described in different feature space with different feature dimensionality. So source domain is labeled and we should perform clustering of target domain. And it is hypothesized that the domain have something common in their structure some common regularities and these regularities can be reviewed by cluster analysis and used as additional information to improve clustering. What means by improving we shall discuss later. So we use cluster ensemble methodology when we have a number of clustering results and we combine them to obtain some consensus variant of partitioning. And there are some non-works on this problem but these works have some limitations. For example they consider common feature spaces or the time complexity is very hard of cubic complexity and is too much for many applications or we should have multiple domains additional domains or source domains these domains are not easily found in practice. And we propose method which have four basic stages. The first one is independent analysis of data so we perform clustering of both source and target domain in parallel or independently. We use low rank representation of obtained similarity matrix to decrease computational cost. The next stage is expression of knowledge. We use some supervised classification algorithm and find the classifier for prediction of the elements of co-association matrix in source data because we know labeling in source data can ensure them to target data using so-called meta features. Meta features are features which describe some common trigularity of the data structure. For example, number of clusters or form of clusters or some characteristics of the form and they do not depend on initial feature spaces. Then we use the found trigularities to predict co-association matrix in target domain and perform on the last step final clustering. We construct the partition of target data using the predicted co-association matrix. A few words about ensemble clustering here you can see some example. We have several partitions of data obtained by different algorithms or by one algorithm with different initializations or different working parameters and so on. Then we update average co-association matrix each element is the frequency for given power pair of objects frequency of following the same cluster and then using these metrics we find a consensus partition using some algorithm which takes this information about similarities in the objects, for example, hierarchical clustering algorithm or spectral cluster and so on. This matrix co-association matrix can be represented in a low rank form using rectangular matrices or co-association matrices and this gives significant memory savings or computational savings because it's not necessary to save in memory a large quadratic matrix for example million by million is the average number of elements. The steps of the algorithm are shown here so we perform dependent analysis of source and target data using this algorithm and use spectral clustering for low rank represented matrix of similarities Now some previous papers have considered some probabilistic properties of cluster ensembles so if we suppose that there is some ground truth variable that determines for which power if it belongs to the same cluster of different clusters and we can define the conditional probability of classification error and it is possible to use some regularity assumptions, probabilistic assumptions to prove that the algorithm the classification error converge to zero when the size of the ensemble grows and other things being integral diversity in the ensemble gives the smaller error however in practice some or many or all assumptions can be violated so we have for example some we have not much memory and so for small number of ensemble ensembles we use additional data, source data to improve the clustering results as meta features first of all we use frequencies of the assignment of objects into the same clusters or elements of course association matrices from source and target domain then we use the meta feature based on silhouette index it's well established internal clustering index and we define it for each pair of objects and use its additional information then we calculate coincidence matrix or association matrix and find the decision function to classify for predicted elements of the matrix dependent on the meta features used we can use machine learning algorithm such as render for support with a machine or artificial neural networks and so on some well known techniques can be used to evaluate the performance of the quality classifier or find important meta features and then it's possible to transfer the file classifier to the target domain for predicting the association matrix and the final step is clustering based on the predicted matrix but there are some problems that these matrix cannot be directly used for clustering because some metric properties can be related for this matrix so we apply an approximate solution we start from some initial partition of data of target data and then migrate different points to another clusters to find the best improvement of the criteria this is the steps of the algorithm so they perform three or four stages of the algorithm independent analysis finding meta features finding classifier and transfer the classifier from source to target domain and find the final partition of the data unfortunately time and memory complexity of quadratic order because we need to see the old pairs of elements but it can be improved because we can use some methods such as stochastic gradient and using this method it's possible to consider only part of data not all pairs of data points by some sub-samples and this can reduce memory we have performed experiments with some data sets with artificial data sets using Monte Carlo simulation with data so we generate data multiple times then perform clustering and define the quality of the result and then we can average the results of all experiments here are shown some example examples of generated data use k-means as a basic algorithm and random forest and support vector machine for classification and knowledge transfer the next example is more real based but I think it's also some illustration of the method so we used the amnist data set of hand written digits and perform classification using fit forward artificial network so we use batch some batch normalization and in addition to the buff machine meta features we use additional meta features such as normalized pair of distances between objects and average distances to closed centroids to evaluate the quality of clustering we applied external class validity index there are different types of external indexes for example adjusted render index it's estimated the degree of similarity between two partitions the first one is obtained and the second is the ground truth partition and this index is corrected for chance estimator probability of covariate assignments of object pairs to same or different clusters the formula is given here the more closer to one the better is the matching between the two partitions and index close to zero indicates nearly random correspondence and these are the results of experiments first of all for official data it can be seen that the proposed algorithm gives some improvement of quality clustering quality and this is the example of decision boundary obtained with support vector machine algorithm so it can be seen that silhouette based index is also gives some information about decision boundary both of the features are useful for classification but of course co-association matrix base is more important and these are the examples results of experiments for real data omnis database so it can be seen that this algorithm also gives some improvement not ideal of course but some improvement of clustering quality average best result and worst result and here is given an example of clusters obtained you can see that the first cluster includes correct assignments some mistakes in the second cluster and this is a conclusion so we propose some modification of ensemble clustering based on transfer learning and of course this is not ideal algorithm and some future works are planned for example we are going to I mean me and my students from north of the state university other types of meta features and application of the method in different fields for example for text document analysis so this is all, thank you for your attention thank you thank you for an interesting talk any questions from the audience? well if I may allow I myself have one question so to the best of my understanding and please give me the rock here so maybe I didn't understand correctly so your work adds another layer on top of classification and clusterization to solve this problem so you are kind of doing it on the meta level so my question is actually about the interpretation of the features because when you are adding more and more even if you are underline classifier it is kind of interpretable like let's say linear classifier you can interpret this like is this interpretability passed through your algorithm what do you think, could some approaches in this direction help the ones who want to do it yes I think it is possible to to evaluate the importance of the features using some methods such as render forest and so on yes it is possible if I correctly understand the studio ok so you are proposing to make an interpretability on the produced features right, ok thank you, any other questions? ok maybe the second question from myself is related to the amount of hyperparameters you have because you haven't like let's say from my understanding it is kind of really complicated data analysis and feature engineering on top of it so I am interested in let's say how many hyperparameters can you pass your let's say like say meta layer add to the like simpler parts my question is if I correctly understand the studio I think about a dozen of hyperparameters on meta features using different literature but we tried to use some of them but the effect was pretty small ok and I assume there are also no online questions because actually selected people are doing it from the zoom but there may be some from the YouTube translation so thanks let's thank the speaker again, thank you so much and now our next talk should be an online talk it's the work is named metamorphic testing for recommender system by Sofia Yakusheva and Anton Hretankov sorry if I misspelled something of that let us just switch to the zoom translation now so hello everyone I'm very happy to see all of you today I'm very happy that I can present our work to you today so my name is Sofia Yakusheva I'm assistant at the department of algorithms at programming technologies at Moscow institute of physics and technology and Anton Hretankov is my supervisor so today we will talk about metamorphic testing for recommender systems so recommender systems is a popular topic today but the problem is that they can badly hurt users if we don't test them carefully so we should test them as carefully as we can but we face a lot of problems for example lack of test data or needs for human judgment all of these problems have very high cost so we can't use humans a lot another problem is the stochastic nature of recommender systems because not all the methods can be applied to these systems and the last but not the least testorical problem so what is the testorical problem as I said testorical is a partial function the simple words this function says if the test is passed or not the simplest example is if we have for example right answer for the test we can just compare the answer of the program and the right answer and say if the test is passed or not but the testorical problem says that sometimes it's computationally expensive to get a testorical for some problems and testing recommender systems is such a problem that in general don't have testorical but fortunately there are some testings to test such problem and one of them is metamorphic testing the key idea of this method is not to check every single output and instead of that we have many inputs of the program and many corresponding outputs our task is to check if there is some relation between these inputs and outputs the smallest example is for example if we have a database and we request this database for the first request we use filter A and the second we use filter A and B so the answer for the second request will be subset of the answer for the first request and we don't check if the first or second answer is correct themselves we just check the relation between them so we try to apply this method for testing stochastic systems I should say that some articles on metamorphic testing use statistic methods but they use only criteria and do not pay much attention to the general overview of this stochastic testing so we make some generalization and propose a classic metamorphic relation so what is that? classic metamorphic relation is just a deterministic function with many inputs to the set of 0 and 1 so we consider this function as a composition of sampling procedure and the function of determination so the sampling procedure is stochastic and the function of determination is something like statistic criteria and in this case we will have much more information about our system that if we use only classic metamorphic relation we will formulate some requirements for recommended systems in general you can see all these requirements it's technical reproducibility ability to learn, response to change comparison on different models of a majority of parameters some synthetics, individual features of algorithms and the response to linear transformations of distribution parameters you can see somewhere it's because we apply our method to the multi-armed banded problem so multi-armed banded is a model of slot machine with several arms if the user selects some arms he can gain a reward and the task of the user is to maximize this reward and this model used in some services like Amazon or Spotify when we have a list of options and we want to get it for the user but we can get only a short sub-list of this list so we propose some metamorphic relations for multi-armed banded problem we propose six but I want to show you only two of them the other four can be easily derived from the requirements but these two are more interesting I think so the requirement for the first was assumption about homogeneity of parameters that means that if we promote banded sums the reward should remain the same and the algorithm do not pay much more attention to the number of the arm and I think that's a very important property of the algorithm and the second is comparison of less and more profitable bands so if we apply some linear transformation to the probabilities of getting a reward our reward will change so it will be bigger but for optimal algorithm this reward can be get with linear transformation that we apply to our probabilities I hope I explain this okay so we test a lot of algorithms of banded sum of them were stationary some of them were non-stationary for example FDWTS is non-stationary and for comparison we use random algorithm and optimal algorithm and we use multi-armed banded model with Bernoulli distributions on each arms and these distributions were stationary in time so we have some interesting results for example we compare different parameters for X3 algorithm and we notice that for some of these parameters these algorithms work a lot worse than for others so our stochastic metamorphic relations are useful for detect better algorithm parameters for FD3TC we notice that permutation of arms have a lot of impact and it is not very well there are some examples of failures that we detected this picture shows failure in configuration so our metamorphic relation was applied only for the first algorithm in the bench of algorithms and the other were the same so it was taken and we fortunately correct this another example is for configuration fails too in our project the configuration files were a bit complicated so this picture shows almost identical experiments that were to be completely different and it is another example and the most interesting result we apply the six-hour summary for algorithm and this SMR was about linear transformation of probabilities so we expect that if we apply this linear transformation for the first experiment we will get the same reward as for the second experiment where the probabilities were transformed already but we don't have such result these values were different for random algorithms these values were the same but for the other this was completely different and you can see at the picture for the purple line that the algorithms don't even multiply the reward like the others this was FDWTS algorithm which is not stationary and there we use our stochastic part of metamorphic relations to analyze the error so you can see that for the random algorithm this difference was almost zero and for X3 it is not a big but still not zero for Thompson sampling algorithm this difference is less during the time and for the X3 it is rising I think and for FDWTS algorithm this difference is significantly bigger than zero and much more bigger than for larger algorithms so we consider this result as an individual feature for FDWTS algorithm because the model which is used in this algorithm is more complex than for others so in conclusion we consider the problem of commander system verification we propose some stochastic metamorphic relations formulate requirements for commander systems in general derive stochastic metamorphic relations form a time band problem and find some failures so a code of our experiments is acceptable in the internet you can see it via links on the screen and a small addition from the experiments that we don't include to our paper we test some bandits contextual bandits which use additional information of context for example the season or the time of day or the number of day in the week and we consider two algorithms C2-UCB and LIN-UCB the constant context and we discover then LIN-UCB algorithm learn very well and C2-UCB doesn't learn at all and we make a conclusion that C2-UCB algorithm is much more focused on the context than the LIN-UCB algorithm so our stochastic metamorphic relations are useful for discovering such things so thank you for your attention my talk is over I will be happy to answer your questions Dear Sofia thanks for your interesting talk and other additional thanks from me as a chair so you just made it in 12 minutes so we will be just on time so do we have any questions from the audience? I have a question about this approach I am actually interested so to the best of my understanding you propose another way of doing let's say AB testing in a way and you propose another approach to the multi-term benefits problem I am curious whether you thought about the combining strategy of combining your algorithm with the other known algorithms in the field such as Epson Greedy or Thompson Sampling whether it could yield better results whether the combination of the algorithms work for example like switching from one mode of decision making on the test to another, thanks so metamorphic testing is more about offline testing of the algorithm when we test just models or our realizations of verification of requirements and not the AB testing but the idea of this CMR is quite the same because we use some statistics for future of our work we will try to make some compositions of this metamorphic relations for complex system so we can propose some relation for the company to the system but for the whole system it's much more complex, hard, complicated I don't know so this is a vector for our future work but about AB testing so maybe it's a good idea but unfortunately for AB testing we should use users or bots and this is quite expensive for our small research we just make some experiments and have results but your idea is very interesting thank you thank you for the talk am I right that you tested everything on Bernoulli rewards at this stage yes just banded with stationery Bernoulli reward this reward is 0 or 1 at every step the question is do you plan to extend it to different models of reward of course we plan this good any other questions well it seems like we are out of question but we still have some additional time so let's thank the speaker again thank you very much for your intention our next speaker our next talk is titled application of multimodal machine learning for image recommendation systems by Mikhail Fanyakov, Anatoly Bardukov and Delya Makarov good morning my name is Mikhail Fanyakov and I am graded of high school of economics of the program master of data science my supervisor is Anatoly Bardukov and I am a student of multimodal machine learning for image recommendation systems we are living in the era of any abundance of information it can be very difficult to find the necessary information or some important content people use online stores yakamez platforms marketplace but they need to systemize all this information people start to upload the information upload the information images for everything they prefer it can be product or landscape, some funny photo so images became the basis of the development of recommendations and main data types and one of the main data types of recommender systems what is the recommender system it's a class of machine learning algorithm it's a data to help predict and find what people are looking for among an exponentially growing number of options this system have one basic goal to solve the problem of information overload making it easier for the users to search the goods however this traditional recommendation algorithm are not perfect there are many types of data that can be used for recommendation purpose of the goods descriptive text of the goods or some metrics which we can obtain these properties or characteristics are called multimodal information and the system that use the information are called multimodal what's multimodal? multimodal the application of multiple approach within one medium what's multimodal data? multimodal data is data with different types such as different embeddings text descriptions, various types of metrics it allows us to implement different features which simplify the training process and obtaining high quality recommendations for building my recommender system we need a some data set for learning our model I use data set with images from Yandex I use data set with the following structure initial image it can be a completely random image candidate image, the picture which is recommended or isn't recommended for the initial image and of course the target is the bi-normal target of the pair of the images 1. the images are similar 0. the images are not similar so more than 75,000 pesos have target 1 and about 30,000 have the target 0, that the data has a balanced structure and needs some preprocessing for building multimodal system we should use different features for each image an important part of our recommender system is the description text of the picture I have passed it from Yandex picture using the beautiful soup python library this is the simplest and the most accessible method I obtained the text in different language for each picture I processed all images using clip it's a very useful technique for image embeddings with a lot of models for text processing I used another technique which is called BAT I used some multimodal, multilingual model in 104 languages the next step is of preprocessing data is matrix counting for each pair of images and text I used two matrix to count pair distances between two clips two vectors for images and two BAT vectors for text, cosine similarity and the square including I chose from four models is decision 3 classifier random fourth classifier xg boost classifier and cut boost classifier with squares and without queries queries is groups for candidate images it is equal to number of initial images and I had the best results so our data set has the following structure it's 10 features, two embedding features of clip dimensions, two embedding features for bed dimensions, four numerical features for matrix one feature is the binomial target variable and one feature is a query, categorical feature the number of the related image here we see that the optimal learning rate is 0.001 for my multimodal recommender system I will the model based on cut boost algorithm it is an algorithm for gradient boosting on decision 3 which was developed by Yandex engineer in this algorithm I will I use cut boost classifier package the classificator of the cut boost algorithm because my data set has only two target variables 0 and 1 for model constructing my data needs in some preprocessing matrix feature was scaled using standard scale method I learn my model with following hyperparameters iterations is at 2,000 it is optimal number of iteration for learning model with such data early stop rounds is 20 the algorithm stops in training if the parameter dictate learning rate is 0.001 I need a small gradient step size it helps to learn my model more accurately much depth is 6 it is the optimal depth of the trees for classification model and scale post wide 2.64 as I have imbalanced data I do the parameter with this value which is equal to ratio of majority class to minority class model result as a loss function I use organic loss function this is matrix of a learning the performance by binary classification model as a custom metric I used AUG AUG is an effective method to visualize the performance of the model after the learning model we have the AUG is equal to 0.83 the probability of prediction of positive class is higher than the prediction of negative class the loss a function loss of the model is 0.29 however I will use predicted probabilities of positive classes output of our model it will help us to range the result of our future experiment profound metrics shows the quality of ranging as we will range our images its indicator is 0.77 so we need some experiment with our data for the experiment the data needs in another preprocessing what the data preprocessing is we set three experiments using only clip embeddings using only clip embeddings using only clip embeddings for experiments my data set needs some preparation I divided it using package from the escalon lab library this is the most popular classification method I separated my data set into 1000 classes according to clip and embed vector also I have obtained the clustered descent for each cluster of each image then using this data I counted three nearest cluster for each images and I have separated my data set into pairs according to belonging in three classes for each pair I counted its cosine similarity and a clear distance for clip and bed vectors so I have data set with the following structure the clip vector for related image the clip vector for candidate image the bed vector for related image the bed vector for candidate image the cosine similarity between two clip vectors the cosine similarity between two bed vectors the Euclidean distance between clip vectors and the Euclidean distance to bed vectors so my experiments we group a certain number of candidates with the higher with the higher for cosine with the higher for cosine similarity and with the lower for Euclidean distance between two bed vectors for each image in this case the number is equal to 5 for gas cast boots model and clip vector the system selected the correct images without any mistakes so the accuracy here is equal to 1 for the bed term bedding the system have two mistakes in the first and the second images and for bed vector for the correct images for the bed vector the system made two mistakes in the first and the second images its accuracy is 0.4 the second experiment the accuracy is 0.8 clip model has mistakes in the fourth and fifth image the accuracy is 0.6 bed model has no correct images of course the accuracy is 0 the experiment number 3 cut boost model, clip model and bed model has no mistakes so the accuracy is 1 the experiment number 4 cut boost models has no mistakes the accuracy is 1 clip model has mistakes in the second image the accuracy is 0.8 bed model has mistakes in the third fourth and fifth image the accuracy is 0.4 experiment number 5 for the cut boost model and clip vector the system selected the correct images without any mistakes so the accuracy here is equal to 1 for the bed term bedding the system has mistakes in the third image its accuracy is 0.8 so I also analyzed 2025 images and then the following result for cut boost model the average accuracy is more than 0.8 the accuracy of clip vector is less than 0.8 and the accuracy of bed vector is less than 0.5 also I I have experiment for Flickr image data set the Flickr image data set is simply then the Yandex data set so in the experiment once all models have accuracy equal to 1 the second equal to 1 2 and the third experiment third experiment there are no mistakes for the cut boost model for clip vector there is one mistake in the fourth image so the accuracy is 0.8 for bed embedding the system has mistakes in the third fourth and fifth image the accuracy is 0.4 I have analyzed 50 image and get the following result for cut boost model the average accuracy is 0.97 for the clip vector 0.95 and for the bed bed factors is less than 0.85 so this is a unique multi-model recommender system for images model is based not only image features but also text features such as embedding and matrix however there are additional training for bed model and build more additional training for bed and build more perfect vectors looking for another NLP model is getting additional text for pictures and finding additional features so the multi-model needs to improve constantly so in fact there will be more researches for the topics and the system has a wonderful future thank you for attention Mikhail thank you for your talk extra points for making it short and precise we have quite a few questions on the audience yes let's start thank you for your talk I just want to clarify you mentioned that for the cut boost algorithm it was 6 and you mentioned that it was optimal one can you please explain how did you determine that it was the optimal one did you perform some research or did you fix the over parameters and just run for the depth only or something else thank you would you like to please repeat the question yes in simple words why did you choose your depth to be 6 in cut boost model the depth to be 6 is the optimal depth of the many decision trees algorithm it's not it's optimal depth it's not it's not neat the depth more than 6 and the less depth is not perfect thank you another question thank you Michael thank you for the talk if you can open the slides the examples were one moment very interesting it seems that using the query you find similar images one moment one moment with pictures so you consider this as the criteria for quality the accuracy here is the similar image is the better right yes yes of course the similar image first quality is the related image yes and okay I have two questions why is that actually a recommender system for me it looks like like similarity search looking for similar images or something like this can you comment on this and the second question you know in practice if you recommend to users all the images that are similar that they like your system will get positive feedback and will degrade the quality I mean if you show nice cats to a user and user likes it next time you show more nice cats after several iterations user will get only the nice cats pictures in his you know recommender system how your approach will solve such problem yeah and the first one is just why you call it recommender system if it is just a similarity search why as I say that my this recommender system is based on deep and Bert embeddings so the recommender system is using these features of course and recommend system according to main two parts clip embeddings and Bert embeddings so the future work of this recommender system is to learn the Bert model for the result the result that the Bert model have the less accuracy the smaller accuracy than the clip and the cat boost model and the Bert model will be more advanced for learning the Bert model is the best way to improve this recommender system of course okay I didn't get if it is the answer but what about the degradation of the model in practice if you recommend the same images to the user all the time degradation degradation as a model has a good future the wonderful future wonderful future wonderful future when the model will learn will be learn will be learn regularly with Bert model and in the advance in the future it can be better to use charge gpt model instead Bert or the more more complex model than Bert of course yeah okay thank you very much any other questions from the audience yeah maybe I would just also like to note on this very work so it's kind of this multimodal approaches there kind of new territory and one of the reviewers also had this question why is the recommender system if it's like you know just the similarity with the recommender system and it's really again hard to account for all the for the problems and part complications of the newer approach but I think we could all agree so the multimodal approaches like you know in this work they yield better results and like should be considered when building the parts of the recommender system yes I guess we have no questions from the audience so let's send the speaker again yeah so dear online participants unfortunately one of our speakers is unable to deliver the yes yes unfortunately our speaker right now is unable to deliver the the report in English we're willing in the dark and weird times and unfortunately we also have some responsibilities international conference so like in terms of the international the next report will be of the record but we will give a chance to disseminate the knowledge between our knowledge speakers so once again unfortunately the next report will be on Russian only so yes we would like to thank our international guests so the next report will be in Russian excuse me to all international colleagues who don't understand Russian and the report is called application of dynamic graph cnn and FICP for detection and Alexander will now report to us hello again colleagues I apologize for such a situation in which I got from myself personally and our entire scientific team I represent the report application of dynamic graph cnn and combined iterative algorithm of the nearest points for such application tasks detection and research we have to say that the results we got they are connected with the implementation of our Russian scientific fund that we got this year as a result of the implementation of this project we have to solve two scientific tasks first of all create methods for detection of archaeological monuments if we talk about the mathematical basis of these methods they are based on methods of machine learning method of geophysics and cartography we have to say that recently archaeologists practically studied the detection of archaeological monuments manually for example they used to solve these tasks these are photos and space photos in the left part of the slide we can see the photos but in the right part the results are deciphered but in the last probably 15 years methods of accurate science got quite a wide distribution in archaeology including method of geophysics and machine learning quite a lot of work is being done in this direction I will probably stop only at one work which was conducted by a group of German archaeologists they just researched the settlements of the Middle Ages but these settlements were presented to some at the present day with these settlements about the depth sensor which they got with depth camera and time-lapse camera which was installed on unmanned aircraft here you can see some visualization of these urals this work was interested for that simple reason that archaeological monuments located on the territory of the Ural and these circles located on the territory of Germany have similar signs of deciphering I will tell you a little bit about what these monuments are archaeological monuments on the territory of the Southern Ural most of them are the so-called Bronze Age and are associated with migrations of the Indo-Europeans on the territory of the Ural and the South-West of Siberia there they organized a whole range of cities many of them are located in the so-called Sint-Achestin culture well, the most famous monument of this complex is the city of Arkaim but actually on this day there are about 20 such fortified buildings on the territory of Arkaim in the 80s of the last century Professor Zanovich was published a fundamental work in which he explored all these cities and actually there were signs of deciphering of these archaeological monuments which we used in our project there were 9 classes of interest 5 of which are marked as K1, K5 these are the classes of purgants that have various textures that is, some kind of scattered surface or some kind of communist surface 2 objects are graves one of them is the Bronze Age other graves are associated with the further development of the Bronze Age and also the settlement fortified and not fortified according to the classes of K1 and K2 but on this slide there are some signs of deciphering there are actually quite a lot of them and many of them, of course will not be able to be used in the methods of machine learning such as the size of the object or the position of the object of relatively each other or at least such signs exist if the German archaeologists used their work only one source of data then we rely on 6 sources of information the first source is the materials of the air these are photos that were produced in the last century when there was no economic development of these territories very big significance because on them many of these archaeological monuments are still visible after a small wave of lightning of course, it is much more difficult to explore these objects the second source is the data-distancing land we use such data from the satellites Sentinel, LandCas, ResourcePay and Kamopuzbi we have to say that we have 50 such photos for a fairly long period this is several hundred photos and it is very important that these photos are with a high spatial resolution that is, up to 0 centimeters per pixel we were also conducted some research with the help of such metric sensors in particular sensors such as the Trimble and it was built about 40 models which describe in the form of a cloud these are the settlements of the Bronze Age on the territory of the Southern Arak other photo plans that the archaeologists received in 2006 on these areas they were also used in the project besides that a rather interesting source we started working only this year and here colleagues from OroRAM the Institute of Theophysics came to us for help and performed with the help of the MQ-14 device a magnetometric shooting of three settlements Stepnoe, Verneuralske and Lvivoberezhne and we also use these depths this year for two settlements with the help of a pilot and depth cameras these settlements received a detailed three-dimensional map all sources of data can be used principally for two types this is two-dimensional data and three-dimensional data for the development of two-dimensional data we use the remaining neural networks that is, the Internet network in general, we go on the path of German colleagues and for the analysis of three-dimensional data we were offered and I published in the results of this report a new method we call it the method DGCNN with a star or it has another name which is the model MGCNN the so-called multimodal graph of the core neural network we must say that methods for classification, segmentation of three-dimensional models and data can be divided into two large groups these are direct and indirect methods but first of all of course they are interested in direct methods bright representatives which are methods DGCNN and RGCNN but for the solution of our task these methods are not quite suitable and this is due to their disadvantages the first of these disadvantages is the limitation of the size of the data these methods work well with industrial design when the size of the task is several thousand points up to ten thousand points if the point becomes bigger and we have order it was 13,000, 22,000 in the cloud these methods no longer work well the second problem is what shape has three-dimensional models these methods work well for exhaustive models of data archaeological monuments of the Bronze Age they are located in the steppe or in some kind of desert territory therefore visually these buildings represent enough exhaustive objects there are also problems related to that these methods mainly use one or two modalities but the information of the light is not used to overcome these disadvantages we were offered a new architecture which we called dynamically completed multi-modal welding graph neural network it has input data which contains 12 signs these are coordinates points these are normal points and the result of this set is normalized coordinates each point has 12 signs I will also add since rgb is not independent then we process rgb in hsv and thus we manage to improve the quality of the segmentation task here the essence of our concept is that we represent in fact our block in the form of a dynamic graph which is described with the help of the matrix of rgb at the same time the construction of this matrix of rgb takes place in each rgb of our neural network in the area of these rgb in the graph we call h convolution in addition to this architecture we added our own metric classifier which is based on two multi-layered neural networks and one classifier with radial basis functions on this slide we present an algorithm which illustrates the main steps of this algorithm I will add that there is a precise solution to the task of spectral filtration but it is also important that it is high enough for the manufacturer therefore we use the polynomial of the Chebyshev order with coefficients 6,5,3 for approximation of spectral filtration this is the main cycle of processing in our graph of the neural network of the following stages this is the construction of matrix of rgb the normalization of the component of this matrix approximation of the signal of the graph with the help of the polynomial of Chebyshev in this case of the third order the execution of the graph and the formation of the output but then there are some nuances of the local and global signs and the result of the radial basis functions is generated also what should be paid attention here in the architecture was added another layer of preliminary processing which makes an increasing discretization of the cloud but for what it is actually because as a rule with deep cameras it gives the data quite noisy uneven which in the most negative way indicates the quality of segmentation and with the help of the method KNM we add three points in each point we get a dense, homogeneous even point and here you can also see that we have in our fourth neural network two outputs the first output decides the classification of objects that is, the global meaning which of the objects is depicted on the picture and the second output of the fourth neural network is associated with the solution of the task of segmentation of the data now let's talk a little bit about what is the use of the function to apply to our graph but in general this function is a multi-class cross-entropy but in it is added another auxiliary parameter which is associated with the smoothness of the signal in the graph and this parameter allows us to make the objects of narrow points in the area more similar and also stop at one point that is, these blocks of three-dimensional data can also be obtained in different ways for example, we can take the techemical sensor well, since it is a fairly smooth surface we can get the whole model of our archaeological memory in one shot if we will already use some LED installed on the pilot then we will not be able to take the archaeological memory we will need a few shots the same can be said for example, the surface is not even that is, it is hollow and in these cases in order to match the data from different angles such a task is decided as the registration of the data we have long enough for the solution of another task simultaneous navigation and cartography a new algorithm was proposed we call it combined iterative algorithm of the nearest points was used including in the framework of this research which I am now talking about for registration of the data of archaeological memory but I will stop only on the main two properties of this algorithm this algorithm uses besides blocks of points special points for solving two problems this iterative algorithm of the nearest points the problem is connected with the selection of the initial value of the matrix of the rotation we decide at the beginning of the registration of the visual data we choose some parameters and thus we get these initial values and it becomes much better than to choose these values R and T some empirical path you can see from the functional view this functional contains two layers that is, we are actually performing a joint solution of the task regarding visual signs and three-dimensional data some experiments were carried out experiments so-called controlled conditions under uncontrolled conditions we understand various noises of the receivers of the world-renowned lighting but we see that our algorithm is marked with a punctured line it has some advantages compared to the methods of registration of the ICP known as ICP with a metric of point-to-point point-to-point extrapolation and metric of point-to-plane point-to-plasticity now let's talk a little about computer modeling well, if we talk about digital models of point-to-planes that are obtained when shooting from a drone then they look like this as I said before us there are two main tasks the first task is to detect an archaeological monument with the help of a sufficient neural network it is important for us to determine the place that there may be a monument but it is not necessary with this red light the neural network found a potential archaeological monument another shot here you see two settlements set up in a positive way and corgans but if we talk about corgans there may be a lot of mistakes that will show the results of computer modeling as such an object was detected we can highlight this area of interest and in this area of interest we have a suitable object and actually for this area of interest there are various methods of research including the method of geophysics Alexander I'm sorry we kill time a little let me finish and that's it thank you this year we had field work we went to the new city Verkhovna Vralskaya we have to say that in general the fortifications were opened from 22 to this year and this year we were able to open two new fortifications of these methods we have now taken to study one of them is the settlement of Verkhovna Vralskaya here are the data and the depth of this photo but actually it is a digital model and it looks like this we see this archaeological monument this is the corresponding area of this photo this monument in different quality received in the 50s and 60s of the last century and these are the results of magnetometric shooting we also plan to use methods of machine learning in order to work with these data but for now this is the closest perspective on this photo you can see the results of the archaeological monuments the resistance of the magnetic field in surface layers but this is a little deeper in a few meters after this monument was discovered we also apply known methods including the DJCNN method or as we call it for segmentation of its space we are still interested in two types of classes this is actually the presence of glass and residential cameras these are the results of segmentation of this monument here you can clearly see that there is such a cloud and here are the gradations of the yellow-red engraved but the blue-green is residential and so it turns out without the scope to determine the location of the archaeological monuments and the understanding of the internal structure such a shielding method of research of archaeological monuments will allow archaeologists to study without scope the structure of the archaeological monument to understand where it is where it is where it is and so on we were conducted some experiments in terms of comparative analysis we compared both direct and indirect methods but for indirect methods we took a multi-view CNN we used, and as a direct method we took DJCNN and used some combination DJCNN with the YOLO detector that is, we used two channels of information but as a result of such experiments we were able to find out that for some classes of Kurganov K1, K2 and K4 it turns out to be quite impressive results, and proposed in one method it is here in the lower part it has advantages but such classes which are like K3 and K5 this so-called rusted surface or so-called Kurganov-Kurganov-Sus they are quite poorly detected as well as known methods and our methods so there is still a lot of space for some work and on the next slide there are results of detecting already fortified and not fortified facilities and models and we see that the use of our method for segmentation and classification of facilities gives good results but with help we can't get any results That's all for me Thank you for your attention once again, sorry for this problem that arose I planned to come to Armenia, I already bought tickets but I got sick but I got into such a bad organization as infecution department of the hospital I am here now Let's talk about the report We are taking off time for lunch but if there are questions from someone in the hall I have one quick question I know what you want to answer I am very interested Thank you for your interesting research I am interested in part of the data that you used from satellites most satellites make shooting on various waves not only in the spectrum but also in others Is it used or in work because in your video you have everything in RGB I will say now We use all channels including red channels We don't use only the satellite data but we use it at night and we don't take natural pictures that contain the cloud Since we didn't try to remove the cloud nothing good happened So we take all available channels but we exclude pictures that contain some atmospheric phenomenon cloud which is wrong during shooting I got it Thank you So thank I would like to thank everyone Yes, let's thank the presenter again I would like again to thank everyone for participating in our section both online and offline presenters Thank you for your contribution Please have a nice rest of the conference Hello everyone one more time This session is on theoretical machine learning and optimization and since unfortunately we do not have the session chair offline let me start this session with my talk is scheduled My name is Dmitry I'm from HSU University in Moscow and I'm working at the data analysis and artificial intelligence department and also chairing the lab which is doing all sorts of machine learning and NLP for example but this talk will be mostly about theoretical stuff I would say I'm sorry, Dr. Ignatov Do you hear me? Yes, yes I can hear you I'm a chairman of this session Yes, that's correct But you are online if you are able to introduce me in that manner this is my pleasure to restart it Okay Okay, the colleagues so we can start our session about theoretical machine learning and optimization assets and we have only three presentations this session so we can be more free with our time limits but I ask all the authors or the speakers to keep the regular time limit about 25 minutes including the answers so our first speaker is my old friend and colleague Dr. Ignatov from high school economics university Moscow and I thought it is very very interesting theoretical talk about mathematical logic and formal concepts so you can see this title so Professor Ignatov to host yours Dear Mikhail thank you for the introduction let me start the talk this talk will be about partition lattices and specific problem maximum change in partition lattices their size and enumeration not I can say that these results are how to say breakthrough results but at least they would add some breaks to the current state of the art on the problem since our reference was asked about applied introduction I decided to include motivation from data mining perspective and I omitted some theoretical stuff here is the outline of my talk after that I'm going to go on to the motivation from combinatorics perspective on the problems which are and then we'll go through partition lattices as concept lattices then we'll consider the problem statement actually three problems we are considering here they are closely interrelated and I will talk about our solutions of these problems both the theoretical ones and the practical ones meaning that we added some new numbers into online encyclopedia of integer sequences counting the respective patterns so motivation from data mining perspective might be well familiar to you if you go to the supermarket you can buy some stuff there and some of the items you can buy frequently as many other customers on a daily basis one of such product items the collection of product items was the diapers and beer and here you can see two researchers from the area one of them a gift diapers and beer to the developer of the best algorithm in terms of I don't know performance time which is able to find such patterns in the data what kind of data we mean here those are transaction data and they can be represented as binary tables also in transaction database format or maybe in vertical database format but in essence we have transaction IDs one, two, three, four, five, six customers for example or the same customer on different days doesn't matter and five items A, B, C, D, E one means that a particular person or a particular transaction contains a particular item here is just the same reformulation but we have a transaction ID and also an operator I which results in the in the set of items that were bought in particular transaction and similarly for vertical databases we have operator T which says what are the transactions where a particular item was bought it closely related to inverse indices and information retrieval but let's have a look at the result of this search performed by one of the algorithms a priori for finding such frequent patterns for example we have item B which was bought six times in a particular transaction database so we also have item sets that were bought five times so at least five customers bought together B and E four times for this collection of item sets A, B, E for example was bought three times the minimal treasure here for the number of such purchases is called minimal support and the structure which lies behind the fundamental structure that lies behind is the lattice which is the lattice of closed item sets or the concept lattice as we call it in formal concept analysis I will talk about it a bit later so here we have some items with the same supports like A it has been it was bought in four transactions and also A, B, E it was also bought in the same four transactions and such such classes form the partition those are equivalence classes actually they have a unique representative the closed item set in terms of support it cannot be extended without violation the maximality of support here in the respective class and there are also the so called minimal generators but they are not necessarily unique you may think of them as proper subsets of closed sets which cannot be further further diminished okay what you can read in this text book by Zakir Mayra on data mining that the concept of closed item sets is based on the elegant lattice theoretical framework of formal concept analysis by Gunther and Wille and we'll use this as the main tool for enumeration later also I would like to mention some some works, related works where not only such concept lattices are used but also partition lattices and we can show that partition lattices can be represented as concept lattices partition lattices may be considered as a search spaces for clusters if we solve the problem of partitioning or community detection in SNA they also can be used for granular computing or to build for building functional dependencies in relational databases and even for a variation of binary data analysis which is called independence data analysis here are some of the links but let's go to the theoretical motivation which dates back to the problem of rota which was published in the journal of combinatorial theory he states the problem as follows it's well known that for Boolean lattice the largest size of anti-chain that is the family of sets which are not subsets of each other is given by the middle binomial coefficient and he proposed to prove or disprove the following generalization of this theorem whether for the partition lattice the size of the largest anti-chain coincides with the sterling number of the second kind so that was his proposal and since the name of Emmanuel Sperner was mentioned it's a good time to mention that Emmanuel Sperner studied Boolean lattices and he formulated a theorem which proves that the central binomial coefficient gives the size of maximum anti-chain of elements or sets in the Boolean lattice here you can see two such anti-chains for the case of three elements sets in red and in blue and in its tour and the problem studied by Emmanuel Sperner dates back to the problem posed by Richard Dedekind on the number of anti-chains in Boolean lattice so the first one is the beginning of the 19th century and the last one is the end of the 19th century as for partition lattices a lot has been done already by the end of the 70s and in the paper by Ronald Graham for example the co-founder of Donald Knutt on the famous book Concrete Mathematics he summarized the state of the art by that time you can see the partition lattice on four elements from his paper here you can see for example the level anti-chains here for example and here by R the sets of elements of a certain rank are shown so we will also use this terminology in slightly modified manner later and what he told us in this paper is as follows that at most for n less or equal to 20 the size of the largest anti-chain coincides with the sterling number of the second kind and unfortunately we do not know whether discrepancy arises but this discrepancy exists and I'm sorry and Rodney Canfield showed that the largest anti-chain actually is not the level anti-chain so it does not coincide with the largest with the longest level in such a lattice according to Canfield discrepancy may arise for n which is very big and Ron Graham says that we will never know where it should happen but at least Canfield and one of his and Graham Cofer's Harper so what is the size of such an anti-chain asymptotically from below together with Harper as far as I remember and from above that's the latest result by Canfield but already in 1998 so the theorem says that the size of such maximum anti-chain divided by the sterling number of the second kind the maximum for the specific can lies between these two values where A is given as the constant is given as follows and this paper contains the following statement the symbols C1, C2, the constants here for example they denote positive real constants and it would be possible to find them but destructing to replace these by explicit values so is Canfield right is the main question in the title of the paper just tell us that we need to have a look at these numbers what they are and we are going to use also representations of partition lattices as cross tables or formal contexts known in an applied branch of modern lattice theory so here you can see cross table representing this partition lattice but instead of this second level where the partitions of three elements should be given we have only pairs of elements which are together in one block it is done deliberately okay what are the problems here to address first what is the size of the largest anti-chain for a given N the problem was solved by Canfield asymptotically and a few numbers a few beginning numbers we counted them explicitly as for the number of anti-chains and maximal anti-chains we also can count them explicitly not asymptotically but asymptotically it is also possible depending on this DP the first proposition says us what are the constants the constants are obtained from the first order conditions for this function this function so it has its maximum and minimum on the left it has maximum N2 and X minimum is given by this N in integers that's not integer if we compute it directly and if a little bird would tell us where the discrepancy arises we can refine the coefficients nicely you may think that this is not a little bird but just an oracle okay we can also we can find these coefficients but in terms of inequalities using the first order conditions moreover since Canfield used this substitution for N in the original computations we can use it as well and we can use the principal branch of Lambert W function which is given here as a graph and refine these coefficients even for N greater than 1 so these coefficients C1 and C2 with tilde okay so the proof is given here so we simply took the final expression from the paper by Canfield and Harper and made the corresponding substitutions using our knowledge about the maximum value that is first order condition and similarly for the upper bound and we used the form of the function and the Lambert function as well here Bn the rotated Bn means bell number it gives us the size of the whole partition lattice the number of all partitions for a given N there are two remarks that we can recover this coefficient even for and greater or equal 1 that is Canfield didn't consider N equals 1 but we can do that because of direct usage of Lambert W function and similar propositions can be formulated for zero discrepancy intervals that is up to 20 as Krem reported the results of our of direct computations with algorithms from formal concept analysis buff confirmed the results for the number of anti-chains in the partition lattice and for the number of maximum anti-chains in the partition lattice this sequence was edited by the OS editors Neil Sloan so we somehow extended the state of the art and confirmed the known values but what was the machinery we used this binary representation we used the result in concept lattice the numbers here are just binary codes and you may think that they are ideas of the nodes in the lattice and we order them according to this relation and build another lattice and this is the lattice of anti-chains and if we simply remove the equality side from here like here sorry yeah remove it the diagonal will appear we obtain the lattice of maximum anti-chains in the partition lattice and this is the case for N3 so for larger lattices it's a fast growing sequence it takes a lot of time and here you can see for example that from milliseconds for some of the beginning ends for N6 we already spent more than a month and our computation did not finish by that time now we have some progress but it's not finished yet here there are some graphs sorry for a small scale we compare the number of maximum anti-chains in Boolean lattice and in the partition lattice for N6 this red dot is somewhere here what we are trying to reach maybe it's better to have a look at these figures in the paper already but we also try to formulate some inequalities bound in the number of anti-chains in the partition lattice from below and the number of maximum anti-chains from above we used the level wise partition of the partition lattice into anti-chains so that's why we have this amount here mainly but we also use inter-level anti-chains meaning that if we have different levels the elements, the sets of different trunks in the corresponding partition lattice we can consider anti-chains from different levels and somehow improve those values actually the improvements are good only for beginning values so here delta means the discrepancy between the number of anti-chains and maximum anti-chains and the L and the L plus are our inequalities so for 1, 2, 3 this is 0 relative error but for 4 the maximum that we could get is 0.32 for 5 0.57 so we need to count more elements and as an illustration I would like to mention the case 4 and 4 here you can see the binary representation of the corresponding partition lattice or even two layers of the original partition lattice and here we consider this relation on its partition partition and we can count 30 patterns or 30 concepts or anti-chains, maximal anti-chains and if we add two more we exactly obtain the number of maximal anti-chains in the original lattice also using the tool which is called Lattice Miner we can build the concept lattice diagram sometimes called Hasser diagram but Hasser was not the first to use this name so line diagram is a bit safe to say here this diagram can help us to extract all the other anti-chains not only maximal by hand so we inspect those concepts we consider different bipartite graphs say with free and one vertices in different parts extracted from the corresponding concepts of the corresponding levels so here for example you can see that in one part there are two nodes and in the other part there are also two nodes but they are given as a corresponding partition's names and we can sum all the total counters here and also we should work not only with the concepts represented in this lattice but also with proper bipartite sub-graphs not only maximal bipartite sub-graphs they are formal concepts and count the types of anti-chains represented by bipartite graphs 2-1, K-1, K-1 but your time was 2-3 minutes ago okay the last slide here when we sum it up we obtain 344 patterns and we also need to include this partition where all elements are in and this partitions where each element in a separate block and if you ask me about a single block to compute it you can find it here as well unfortunately the time is over and I'm only ready to answer questions, thank you thank you very much so, Dr. Ignatov would you like to ask any other question? from my side no questions but we are expecting the questions from the audience then I have a question can I ask you my question so I like your theoretical work, it is very interesting for me but our conference you can agree with me that your topic is quite borderline according to this conference, so please could you explain in more detail the machine learning applications of your high levels here in two ways yeah so I would rather think about this topic and this direction as not fury for machine learning but machine learning and I would even say data mining and formal concept analysis for combinatorics so we can take algorithms from machine learning for enumeration of closed item sets and we can compute these values which are not yet known at least in the OAs and extend our knowledge maybe someday and you earlier will come and give us nice formula but so far machine learning helps us and data mining I would say helps us to find this number at least scrape them from the data thank you thank you very much such nice cooperation between two fields all the theoretical informatics yeah and it is really my pleasure to contribute to this section which you chair many years I believe and thank you very much very interesting presentation thank you there are some questions from the audience I don't know Mikhail whether you can see I have a question your title includes a question is confilarate is he right or not yeah it's a bit provocative question to attract attention to this kind of problem because if you check there were no progress since the beginning of millennium because canfield solved the problem theoretically asymptotically but this coefficients should we know them should we try to know them is a very good answer which can stimulate us so canfield decided at that time that they are distracting but we try to find them out and maybe if we can can prove theorems about the concrete N where discrepancy arises we can refine these coefficients coefficients and at least we know what is the gap and this is nice it might be like a competition a new mathematician or practitioner will come and refine it in a more elaborated way okay maksim panov here would like to ask a question yeah so I have also I don't know probably funny question not really related to the topic you had one of the slides where you were referring to some book of foundational paper where there were three authors but you striked one out why in this story the original book conformal concept analysis mathematical foundation is offered by Bernard Gunther one of my supervisors from the German side and by his supervisor Rudolf Wilde who passed away then they translated it into English the translator was Franz Ke his systems indexed Franz Ke but his contribution was not actually as an offer but as a translator and in the community we discussed it many times we decided not to give his name but at least I decided to give his name but strike it out sorry colleagues we should proceed with our program thank you very much once again and our second speaker will be Maksim Panov with his presentation about distributed by Bolsheviks course sets you're welcome Maksim please thank you very much for the introduction and that's my pleasure today to present this work that was essentially a master thesis which was defended a year ago by my student Vladimir Milusik so well it would be more essential if he would be presenting this work but he's currently a PhD student in the University of Missouri and it's well not feasible both to travel here and also to to speak online because the timing is not very good so I will be presenting on his behalf partially also on my behalf okay so the topic here is basically the inspiration for this topic is Bayesian approach to machine learning and well we consider the standard parametric model like probabilistic parametric model where you have input X you have output Y you have a certain probability distribution which is parametrized by some parameter vector so it might be linear regression it might be neural network it might be whatever and usually we consider the situation when we are given with a data set so paris, x and y so your historical observations and then people write likelihood function which basically is joint density of y joint probability of y usually it's convenient to rewrite it while log likelihood functions so and basically then the total likelihood is exponent of sum of log likelihoods for individual data points well I'm writing it here because for my future task it will be important so such a representation and in the Bayesian approach you usually assume that there exists some prior distribution on the weight defined as p0 on my slide and then if you have you have a likelihood and you have a prior distribution then you can write a posterior so the posterior is given by this formula so you have a product or its bias formula well known and you have this product of likelihood and prior and then you have a normalizing constant in the denominator which is basically the fitter and well if you think about some machine learning model behind that what you eventually are interested in you are interested in making predictions so and for this people usually employ in this Bayesian constant people usually employ so-called posterior predictive distribution where you average likelihood at a new point over your posterior and in practice usually of course you can't compute this integral explicitly and people do sampling like Monte Carlo approach and also should mention that because you have very rich object potentially the posterior distribution then you are not constrained by just looking on expectation you can look on the moments of this distribution on the variance on other moments so basically you can also reason about uncertainty and that's why generally Bayesian approach to machine learning is thought to be like an interesting thing because you not just do a point prediction but also you can do certain uncertainty quantification however this formula is a bit problematic because of different things so the mainstream I would say problem which people consider is that actually the numerator denominator is very problematic to compute numerator is given you have some prior you have likelihood it's all given but in denominator you have something which is an integral and this integral is hard to compute usually so of course there exist cases like case of conjugate distributions when it's easy to compute but in general case it's hard to compute and people consider various approaches to sample from this distribution or to approximate this distribution so most well known are Markov chain Monte Carlo and variational inference however in fact if you consider like modern applications when you have a lot of data like think of modern like image classifiers which are trained on millions of data or think of probably language models which are trained on trillions of data there is another part which is actually might become complicated to compute that's likelihood itself why? because this summation is very large so you have many many summons and then imagine you want to do Markov chain Monte Carlo on top what is Markov chain Monte Carlo you iteratively try different points until you get something which is your generated sample and well this process requires to evaluate likelihood very many times and if it's n is million or billion or trillion it might be very very hard to compute and that's why in the research community it was considered like certain approach which is called core sets basically what you do you consider the weighted likelihood of reduced weights WI and you want to construct these weights in such a way that first you have many zeros among the weights so you reduce the summation and second that this posterior is close to the initial posterior okay so you want to select some subset of points which approximate possibly weighted so it might be not zero one it might be something else but you want to have many zeros a small subset of points which well approximates your initial distribution and then if you succeed with this task then you will be able to do your computations very very quickly if number of non-zero W is small well basically little bit the same notation again I'm sorry I just want to mention that we almost in the experiments later we almost consider the problem of supervised learning which I motivated the problem most of you will consider the problem of density estimation and sampling from this density so that's why I'm writing just as I have just X here but essentially it doesn't change the formula for posterior and well there exists quite established literature on the construction of core sets including Bayesian core sets and actually all the algorithms we are aware of they follow more or less the same structure so what you do you you introduce some distance between the local likelihood functions and one possible distance might be expectation of the posterior over the posterior of a difference of local likelihood function yeah and then what you do you want to find the weights which minimize this distance between the full posterior and your weighted one under the constraints that you want to have not more than K non-zero weights so that's a certain optimization problem it of course depending on likelihoods here it can have kind of a different complexity however well that's a general approach one useful notion which was introduced in the literature and which helps a lot is so called notion of sensitivity basically you look on the ratio between the likelihood of one point and all other points and if this ratio is high then probably this point is important and you should include it if this ratio is low then point is not important and you exclude it but of course you have a free parameter here so you need to do something with it you can say my definition will be via Supremum like the worst case or the best case depending on the look what is your contribution to this likelihood and this notion is called sensitivity and the majority of works they consider different algorithms based on these sensitivities of course well it's a big question how you compute them in this talk well you need to do some approximations and basically the works which currently exist they have they basically have two main parts first one you can either can do iterative methods which gathers the core set step by step so you choose one point you choose another point you choose third point and so on or you do sampling so if you do sampling it's easy you sample random points with probabilities that are proportional to sensitivities and then you get something in sequential approaches you try to choose the point which improves your current score the most and you do it one by one and the main difference between the methods for iterative is whether you try to like solve if you consider this function exactly or you may try to convexify this constraint somehow okay and the question which we asked in this research is well the iterative approaches they in fact give pretty good performance and practice usually but they might be very slow because you choose points one by one you have a huge data probably your final core set will hold the data but still pretty large so the process can be slow can we parallelize it and in the literature there were considered some works which were doing this parallelization for non-Beijian concepts which considered just approximation of likelihood without any posterior and we wanted to do something for this Beijian approach and basically well very simple idea the idea of this work is pretty simple that what we want to do we will be sampling so we have a data and we will be sampling a point from it based on the likelihood values so we have some say maximum likelihood or maximum posteriori estimate of the parameters we plug it in and we sample points proportional without replacement proportionally to the ratio which we obtain for this fit ahead maximum likelihood and basically we expect that on different and what you actually do you sample several points for the first computer so we distribute computations then several points for second and so on and hopefully they will be sampled in a way that sort of form clusters so first you probably will sample the most probable points for the second part so so once again we have several workers or processors or computers we will do corset selection on each of them separately so first we sample points then we do corset selection for them separately and then we merge the resulting corset and each of them has a size of a corset k over the number of workers so well what are the results so we in this work we considered well very pretty simple example so we start from just density estimation and Gaussian data which is very simplistic first let me explain you which algorithms are compared so we have three algorithms so one is the baseline iterative approach I don't know why I call it sequential here but well it's a synonym and then we have two distributed approaches one is based on random splitting and the other one is based on our well we call it machine learning split our sampling proportional to something to sensitivities and we present two plots first of all on the x-axis the size of a corset and on y-axis we compute the KL divergence so the distance, the smaller the better and on the second plot we represent time computed in seconds here so basically what we see is that for the multivariate Gaussian example so well very simple problem of course and we observe that here sequential approach works better so our distributed approaches when we distribute and merge they work worse however well as expected they work much faster so here I think we had seven or eight cores working in parallel then more interesting things start to happen so if we look on the Gaussian mixture first univariate Gaussian mixture then already the distributed methods they start to beat the sequential approach why because sequential approach is greedy and what I expect that well probably we need to do a little bit more in depth analysis but what I think that basically different cores they successfully approximate pretty accurately different modes of our mixture so each of them becomes good at one mode while this one is jumping from mode to mode and something weird happens so and here already we see certain improvement over the random split by ML approach and well computationally we see there is a big difference in speed compared to the sequential one then we can see that multivariate Gaussian mixture here it's more or less the same story pretty good gap between sequential and ML and distributed approaches and a certain benefit of using better sample and probably the final plot which I wanted to show that's a bit different that's already supervised learning so there was some classification problem and what we do we report the result inaccuracy of classification so what you do you make a core set and then you use some model for classification based on this core set right so instead of full likelihood you use a weighted likelihood and what we report here now it's not a KL delvergence but our downstream quality so the accuracy right well what is nice that at least accuracy is improving with core set size so we sort of get more information and also what we see that interestingly the distributed approaches doing kind of better than the sequential one so again we actually benefit from distribution not only in terms of time but also in terms of quality and probably the final plot I wanted to show is the dependence on the number of the processors used so generally what we see is that you can get an improvement when you increase the number of processors there's some fluctuation here not sure why but generally we see the trend for improvement and also well you see the improvement in time the more processors you have you improve your computations or apparently diminishing returns so it's not linear well everyone would be happy if it's linear decrease in time but in fact it's not so there is certain overhead well that's the same so and well to summarize so we really as you've seen we really scratch the surface however generally my motivation to start this work was actually to use this Bayesian core sets for actual computations for example with neural networks because currently Bayesian neural networks well you can find many papers and top conferences but eventually you see that there are little to no benefit from their usage because it's size because usually you can't you can't apply them to really meaningful data sets so you take almost any Bayesian neural network paper and you end up classifying meniste not anything like image net or something like that why well you need to do MCMC over rational inference on hundreds of thousands of parameters on millions of examples that just doesn't work so and I think that here we need the kind of a combined approach so we need to use core sets we need to use parallelization and we need to some advanced variational inference on MCMC algorithms so that's a some very small step in this direction so I think I'm done and thank you for your attention thank you very much colleague please thank you very much the colleagues please for your questions to the speaker thank you for interesting talk in the beginning you were talking about the distribution of parameters prior distribution so how you choose this distribution thank you very much that's actually the question I think everyone not from Bayesian community is asking all the time there is no like a final answer on that so there exist different approaches one of the approaches that you try to use the distribution which is like disturbing your likelihood as small as possible so I forgot the name in English but basically something which has actually very flat so something which doesn't like put you somewhere and I think it's kind of a reasonable approach for large scale applications when you actually don't have any idea however I also should mention that for modern like say modern models like neural networks generally the question what should be good prior is very very open why because well putting a prior on each weight from 100 thousand being some Gaussian well does it make much sense it's not very clear and the final point here that on some like smallest applications sometimes people have kind of a domain knowledge so if you have I don't know linear or logistic regression then the practitioners might have some idea what kind of weight should be for the particular factor then you can have some Gaussian around this value or I don't know uniform distribution on some segment and you can use that any more questions maybe maybe one question on terminology I like the term corset but maybe you know about its origination why these two words that's a very good question actually I don't remember the paper where it was introduced so I will need to look so actually this area is relatively rich and there are more general definitions of corsets for example here I used only weights but sometimes people want to have a kind of a synthetic examples so you start to tune your eggs so that you can approximate the whole likelihood with a few points so generally it's a kind of a broad topic well I will do my research and send you where it originated from excuse me colleagues can I talk some words about answer for Dr. Ignato go ahead please it seems to me that this concept corset initially was introduced in computational geometry in the constant context of Bruno Goudrich approximation algorithms for the so-called heat inside problem so it is very very interesting combinatorial optimization and computational geometry problem so it's very interesting for me as a specialist in combinatorial optimization to listen your presentation because it is also one more proof of the deep cooperation between machine learning and combinatorial optimization thank you very much thank you Michael it seems there are no more questions in the audience ok thank you again and now we sorry I think we have an online talk yes we can proceed with our final talk the list the last but not the least by Professor Eduard and Alexander on the problem of finding several given diameter spanning trees of maximum total weight in a computer so it seems to me Alexander will be the speaker Alexander can you hear me I just ok hello my name is Alexander and I want to present our work with about the problem of finding several given diameter spanning trees of maximum total weight in a complete graph first of all let's formulate this problem we have arbitrary h weighted complete undirected graph and positive integers which satisfies the following inequality and we want to find m joint spanning trees T1, Tm of the maximum total weight of edges in these trees and diameter equals d let's let us remind the diameter of a tree the maximum number of edges we see in the tree connecting a pair of vertices this work is based on two works by us which was published in 2022 and 2023 in the work from in the work 2022 we consider one maximum weight spanning tree with given diameter and in the work of 2023 we consider several edges join spanning trees but for minimization problem and we want to reduce problem with several maximum spanning trees to the minimum case and it done by using statement from this work we have two graphs g and g prime in graph g as a result weight function we and in graph g prime we look prime e and there is a tree with total weight which is calculated with w prime then and only then if it's a tree with such such weight constructed in original graph g and weights of function weights linked by this relation so we apply algorithm from the work of 2023 where we solve the minimization problem so we have input output of this algorithm and some steps let us describe them by the example and we prove in this work was proved the feasibility of the algorithm and the time complexity of this algorithm so we have initial complete graph I want to describe the steps of this algorithm and we have two spanning trees which I want to construct and a diameter of each tree is equal to five so we choose two subsets v1 v2 d plus 1 vertices and all other vertices we put in v prime on the first step we construct Hamiltonian path using heuristic go to the nearest unvisited vertex of course it's not exact solution it's just approximate solution and then we divide each path in two halves and connect first half of the first path with first half of the second path by the shortest stage with inner vertices then second half of first path by the shortest stage with inner vertices of second half of second path so we do it in parallel manner and then we construct we connect this path using cross manner so we connect first half of second path with second path second half of first path and second half of second path is first half of first path it's done to avoid the possible dependency of random variable during work of algorithm if we just connect in another way we can one edge twice and it will broke the independence property of random variables and on the third path third step we add edges from v prime and connect them again with the shortest path to inner vertices of corresponding trees of course we connect with inner vertices because we do not want to increase the diameter of constructed tree so if we connect with inner vertices the maximum distance between trees will be on the path but it was description of algorithm for minimization problem we want to solve the maximization problem and this done by changing the weight function of graph G to weight function v prime obtaining graph G prime and apply algorithm 8 prime to the graph G prime so we on step 2 we solve the minimization problem and the construct is planning trees T1, Tm a solution for the maximization problem and again in this work we prove feasibility of this algorithm and time complexity since the change of weight function can be done in all from n squared and as was mentioned previously step, second step is performed in all from n squared time so the total time complexity of algorithm i is all from n squared to there are some notations that must be introduced by f a f sub a from i from i we do not respectively the approximate obtained by some approximation algorithm and optimal value of objective function of the problem on input i and we said that algorithm a have performant guarantees epsilon delta if such inequality holds where epsilon is estimation of relative error and delta is a fair of probability which is equal to the proportion of case when the algorithm does not hold the relative error epsilon or does not produce any answer at all and we say that approximation algorithm is asymptotically optimal on the class of input data if epsilon and delta tends to infinity as it tends to zero as n tends to infinity it's quite common common definition for probabilistic analysis and we want to prove that algorithm i is asymptotically optimal so it means that epsilon and delta goes to zero as n goes to infinity there are some qualities which are used in the probabilistic analysis if we denote random variable equal to minimum of the curve variables from the unit 0, 1 by x, k and w prime from a prime be the total wage of 3s t1, tm constructed by algorithm a prime so it's obvious that this w prime a prime will be equal to some of some of weights which are added on the first, second and third steps of algorithm a prime so on the first step we just use the realistic goal to linearize the vertex so on the first step we have depossibilities because set contains d plus 1 vertices then d minus 1 etc to k equal to 1 where we find the minimum over 1h and this repeated n times on the second step we consider all the pairs between all the pairs of past and construct is connecting them so we have multiplicator c from c m2 and then connect for possibilities for edges and for end vertices the first summand is for pairs and the second one is for end vertices and on the third step m times the connection of each from n minus m multiplied by d plus 1 vertices from set v prime and connect them by the shortest edge to inner vertices and there are d minus 1 inner vertices in each pass and according to statement which was mentioned on the second slide we have obtained such quality which connect the weight of graph the weight of result of graph a and the weight of result of algorithm a prime and we prove the first lemma which postulates them epsilon and delta for our algorithm a for problem for maximization problem and the crucial statement for our analysis is serium by Petrov which consider independent random variables one xn and introduce constants t h1 hn which are satisfied the following inequality and if you set s equals to sum of xk and h sum of hk small we can obtain such probabilistic inequality which will be help which helps to carry out our probabilistic analysis and there are some statements some lemmas from work from our work 2000 23 in the first lemma we prove that condition of Petrov serium holds and in the second lemma we bound hb from above and in the third lemma we construct the upper bound for mathematical expectation of weights which get by algorithm a prime so we reduce the problem of maximization to the problem of minimization and use some results for the minimization problem and the main result of our work is that if d is equal or greater than logarithm then we get the following probability as you can see probability is immediately tends to 0 if n goes to infinity but for epsilon it must be considered the case when d goes to infinity as n goes to infinity as in the case d is greater than logarithmic and it must be noted that similar results are obtained for union a sub n b sub n since we can always reduce the problem with arbitrary a nnbn which holds these inequalities by reduce this problem to normalize random variables which are distributed on the interval from 0 to 1 and in contrast to the case of minimization problem there is no need to impose additional condition on the scatter of weights like was in the work of 2023 and as conclusion we have generalized result from work of 2022 for one maximum span in trees with given diameter to several in span in trees and use algorithm from this work which has time complexity over from n square and apply it to our case with modified weight function for continuous uniform distribution of weights on interval 0, 1 we can get analogous result follows from continuous uniform distribution of h weights on interval a sub n b sub n as I said previously and it will be interesting to investigate this problem on discrete distributions thank you for your attention thank you very much so who would like to ask some questions to Alexander it seems there are no questions in this case I would like to ask some small question yes yes so Alexander your brilliant work seems to me is a nice another nice example of cooperation between machine learning and computer optimization in this field developed by professor Gimodian Yu seems to me very very very interesting in the case of our millennium problem about p and n p so that is very very interesting and the question is in your proof you use Petrov's theorem which appears to be classic major concentration result and classical result since that result there are many more more results more strong major concentration results would you like to incorporate them in your framework it would be very interesting I'm not professional in this domain but I think we just we should watch other approaches so thank you very much thank you thank you so other questions please oh oh it seems to me that all are clear so we should send our presenter once again thank you very much and dear colleagues unfortunately I should close our session thank you for your very interesting presentation and for the attention thank you very much thank you dear colleagues I just reminded that the parallel session is still going on so we can go there and also in one hour there will be a poster session so you'll have some interesting posters so please attend