 It's a nice one. I guess it should start with a lot of new sections. Is it going to be online? So there's one offline talk and the rest of them will be online. And I understood this last minute because it's changed. This one is going to be online? It's offline. And the rest will be online. So we'll just have a large Zoom call and say... Yeah, no big question. When there will be online talks, will we see the presentation here? I guess it's manageable. Yeah. You're the only offline. So she started her talk and the mention of the GD. That's the most touching moment. That's the most touching moment. Yeah, I mean... Yeah, it's... Does she come from the U.S. to Washington U.S. to Washington U.S.? No, no. She's an healthcare professor and she's a few years old. How many kids do you have? In computer science department, we've got... Oh. So we have four months of school. We have a number of rolling courses. So we have one like humanity faculty that is just teaching English. How many students are there? The biggest in our business school as well. Do you think you said... No, I think we should... I think we should start more or less now. Is it going already? So thanks everyone for coming. So I would like to start our section on data analysis and machine learning with big thanks to our host in Armenia. Now Armenia going through hard and dark times and I would like to especially thank the host for having the strength and courage to help us still organizing this event at these hard times. So my name is Sivgini Symbalov and I have a PhD in computer science and right now I'm a machine learning scientist in Aptec and together with Dr. Maxim Panov from TII Technology Institute in UAI we will be chairing this session on data analysis and machine learning. This year we had around 17 submissions and only five made it to the final publication paper. A little bit more but we would like to... and unfortunately most of our speakers couldn't come for different reasons so I would like to thank the speaker who came for coming and now we will be having a first offline talk. Oh I can read it from there. So is it about detecting design patterns in Android applications with code-bird embeddings and CK metrics and authors are Djivir Dlamini, Ahmet Usman, Leonel Karkvank and Vladimir Ivanov. Please thank our offline presenter. Thank you. Do I have a point? Yes, so you have 18 minutes to speak. I will tell you in two minutes and we will have five minutes for questions. Can I use a clicker? Is it here? Good. So limited time but nevertheless the topic is fascinating for me but first of all a couple of words about why I am working in this direction. I am head of the NLP lab for software engineering at the University and we have a big contract supported by the Russian government for like seven years to develop tools for software engineering automation for generative models and generative pipeline in software engineering domain specifically and especially my lab is focused on code and text representation for understanding of code, source code and for summarization and similar tasks and analysis of the text show information and this talk is about motivated by specific problems that software engineers but not only software engineers but some managers of software projects may have in real life. So many applications are being developed, they just stored somewhere but sometimes managers of the program managers or project managers they have this trouble with managing, with understanding what is the if the requirements, if the structure of the project specifically we will talk about Android applications here follow some predefined agreements about the architecture, about the design about the patterns that developers agreed to use when they developed the implemented system and the motivation of this talk is that well, we can try to solve this task disproportionately using the machine learning tools and there are typical design patterns, architectural design patterns on the right part of this slide. Raise your hand if you don't know what is MVC, MVP, MVVM well, basically they are different approaches to splitting the internal part of the application in Android like splitting the visualization part from the data management part from the modeling part and making these three parts decoupled in order to do the development, software development more easy, robust and so on and maintainable, so classical metrics for source code they used like back maybe 20 years, 1 in 20 years, 30 years before to analyze what is going on in the software and basically they extract some information we will talk about the object-oriented programming here and it's well-known approach called CKmetrics when they can be calculated almost from any piece of object-oriented code they deal with the structure, the methods and the classes how they organize, how many methods in the class and so on and so forth and these metrics are widely employed also in pattern detection not just for some quality of the code but also to detect patterns there are approaches, I will not spend too much time for this but there are approaches that are devoted to analysis of the byte code we discuss Android applications there are approaches of analysis of the semantic features related to the source code like training a Vortwek model to extract some semantic features from the source and then put them into the embedding in the form of the embedding and then make a decision about it train a classifier based on this Vortwek representation and there is a third approach like using the CKmetrics only now you know what is a CKmetric so you run through the project calculate all the CKmetrics for the classes and then make a decision about the pattern our research questions was related to these two things like the first one is it actually important to add something on top of the CKmetrics if it is possible to use a pre-trained machine learning model that will extract features that know something about source code many projects and extract features on top of the CKmetrics that will be useful and if it is possible to improve then how well such model should work and I will talk about these two contributions basically the approach that is very simple maybe you will have some questions in the end but it's not that hard to understand what's going on and the analysis of the embeddings and classical features if they are useful for the pattern so the methodology is a bit complicated on this picture if you take a look on this picture maybe a bit complicated but nothing fancy is happening you have a project of your application that is Android application, manage our files you can calculate CKmetrics and then you can calculate the embeddings that represent representations, the vectors that correspond to each file so for each Java file this pre-trained model outputs a fixed size vector and then the second step is to pull to combine, aggregate those vectors into one single vector basically and then again train a classifier on top of this so we have this first path using only the CKmetrics and the second path from the source files application of the CodeBird model that is extracting the features like the big deep learning model that extracts features and then we can concatenate the two types of features and then decide either using both of them or only CKmetrics for the classification part we use the CutBoost that is a state-of-the-art model the thing is that here we cannot train basically this part but the usual approach is fine tune the CodeBird or some kind of transformer that extracts features here but here we don't have enough data for training and that's what's the idea why we don't just use the pre-trained frozen layers of CodeBird just to show the whole process step by step so these metrics representing the one source file ends up in one long list of numbers each number corresponding to a specific metric and then you have this array of metrics like around 10 or maybe 20 that's actually more because we use extended packages 82 metrics and our hypothesis is that these metrics actually they uniquely identify some object-oriented part and they measure complexity of the design part of the design of the project of the design of this class or whatever the source code is and they can capture some course-level metrics related to course-level patterns that are related to the target task for the CodeBird embeddings I haven't said too much about this but the CodeBird maybe you heard about BIRT the pre-trained model from Google that is kind of yeah and the CodeBird is actually similar idea but almost the same architecture but pre-trained on the huge corpus of source code instead of the Wikipedia and some internet corpus they use the huge like basically all GitHub something like this and the pre-train and predict the train the mass language model on this data and then you have this model that is of course it's not the state of the right right now it's still a kind of application of natural language methods in software engineering world to extract some information from the source code okay and it seems that this is again just a hypothesis but this may capture on the patterns on the fine level more fine level than CK metrics because this captures the token representation but the problem here is that for each file you have a list of kind of documents for each file you have set of representations and the question is how do you combine them the idea is to we tested several ideas like of pooling of aggregation like measuring the maximum measuring the average it ended up that we use for the max pooling we use the maximum or not maximum but summation of all these values and that will get better results with this approach about the experimentation so this methodology again just gets the data and the question about where we get the data the data is coming from the repository of like the already annotated data set with the pattern label so we have it's very small data anyway 26 like 22 numbers 22 projects that doesn't have any design pattern but still it's if you don't consider training if you consider only the testing part it's doable to apply this methodology in the testing mode so there's the data set I discussed a bit the CAD boost already just hyperparameters you can find them in paper of course and yeah we calculated the classification measures classification metrics to evaluate the quality of course and 5-fold cross-validation was used because we have diverse in the classes somewhat somewhat in balance but not really huge in balance but still we can sample from the data and calculate the error the results here the first table shows not very fascinating results but still it's for some patterns it achieves better results than the original paper on the coach and that actually compares the CK metrics only and model that combines metrics so you can see that for some patterns there is a like drop of the quality so some the same but in most cases this is the improvement and this is also a slight improvement it can be considered as improvement again it's I'm a bit pessimistic about these numbers because it's still far from good quality anyway so that's and here we compare this regarding the pattern type here we compared the how the CK metrics compared to code bird metrics code bird embeddings plus CK so in most cases you can see that the improvement over the plain CK metrics without any embeddings is like outperformed by embeddings here but sometimes you have this example here that is well CK metrics is better in this case okay and so discussion about the our initial research questions so we also analyze the importance of different metrics and this is a kind of when it comes to the practical application so you can you can train a model you can train a cut boost model that will predict the classes design patterns but when it comes to real-world to practice it's more important to get some interpretation why the model predicts this or that so here the interpretation about the this part of the work was about finding the top five metrics that contribute to decision and there actually there is analysis that confirms that the some of the metrics some of those 82 metrics are actually not relevant not responsible for the design pattern but this metrics that you should or you as a manager should pay attention to whatever you use machine learning or don't use machine learning you can compute this metrics from the source code from the project and say okay this is like an indicator of this or that design pattern or the absence of some design pattern I will not go deep because it's kind of related to domain knowledge here and you should be a kind of developer in android or developer in to understand what does it mean and maybe it's not that interesting the improvement that ML gives us on top of the CK metric as I said it's not that big and we call it moderate, you can say it's maybe not even improvement but it opens a big question about the applicability of machine learning of that kind of models of machine learning source code analysis it shows clearly that is not easy task the current models that like large language models applied to code also struggle with the analysis of bigger contexts large context, large projects and so on they usually process like function level or class level context size and it's still it's still open question it's still ongoing research and I will conclude with these limitations so the major limitations of course limitation is the size of the data of course and the interpretability of the embeddings themselves which is just a list of numbers and if you talk about the interpretation of this list of numbers it's useful because when you use CK metrics at least they mean something but when you consider the birth embeddings they may be less meaningful so we are going to continue this work and develop more robust and interpretable approach and do the oblation study without dimensionality reduction that was maybe for the sake of the performance we reduced the dimensionality of embeddings but maybe we can try to do this in a more regular more granular basis to understand the exact impact of codebird do I have time? one minute okay so it's just a conclusion you can find code and source code and data here this work was supported by Russian Science Foundation and University and I ready to answer any questions on this thanks for your really interesting talk so it's going to be more and more code and we need to process it on a more automated way to make it work so the preference now is for offline participants to ask questions do you have any questions anyone okay please use the microphone so our online speakers will also hear us it's for online speakers I want to just to ask is codebird available only in one size I mean the size of the model the model the number of parameters is 500 I just understood sorry the question is about number of parameters I think the codebird is we use only one set of parameters like one pre-trained model it's not like the bird itself it comes into several sizes small, large and so on but the problem here is maybe not the number of parameters not a big number anyway compared to large language models but the problem here is the input length so it's constrained with 512 tokens and if you want to process the whole project you should squeeze all these source files into 512 tokens some of them just too big so we cut all the comments cut some so there was a pre-processing stage to fit even if you cut some of the files they still not fit into the input of the codebird that's another limitation of this approach so one idea could be to move to the bigger models that actually have larger input window size of the input context that could help or just switch to some principle different models that can process the whole repository not just file by file did I answer your question? Thank you for your presentation so it's not a question but more of I want to discuss a little bit so I think it would be cool if your work if you'll extend your work in finding the anti-patterns in code and well I think that might be helpful when if there is such a system created to say that when a new code is published and changed then changes in the pattern appear and so not like not to measure the whole pattern on the project but maybe like in a small part of the code compared to the whole project I think it's a good suggestion thank you there are such works for classifying the anti-patterns finding the anti-patterns in the source code and as I understand your suggestion was classifying of the commits that can break some pattern and that's it there are such works and definitely it's beginning back to the first slide about the enopolis group actually we have not only my lab in enopolis working on this but other people are trying to approach different problems these times it's a bit hard to even to find the idea good idea that will be available so there is a number of problems of course for practice and for research but thank you for suggestion it was not a question are there any questions from the zoom meetings are there we don't really have any time for that so let's thank the speaker again for the talk and the following talks will be held online and our next talk is a data-driven approach for identifying functional state of hemo-analysis entropic complexity and formal concept analysis by Ekaterina Zvorikina Yuri Bishasnov and Vasily Gromov they will be talking via zoom please dear online speakers share your screen or report to me if you have any problems with that you have 18 minutes for the talk and 7 minutes for the questions, thanks I think you have to enable the screen sharing then I can share could my colleagues please help me with that so is it really problematic you don't have the host rights sorry okay, okay, sorry we have some technical problems we'll be back in a couple of minutes so we have some technical issues now you have some time to grab a tea and prepare for a talk and listening thanks for your understanding you Ekaterina, can you please try to share your screen now it still doesn't work Stepan, can you please try to give the screen sharing option so Ekaterina could open the presentation, thank you try now please thank you 2 minutes before your time you can see the slides you can hear me well okay so I would like to start my work with a big story that I'm not present offline but let's try to maintain good quality only so our work is devoted to time series analysis in a very interesting case we have Arterivinus fistulabrut diagnostics so I will give you make it full screen so it will kill better is this better not exactly maybe you'll try a different button so now it gets complicated okay maybe like this no, no it's the whole slide is the presenter mode should be can you just try control L yeah, I'll try again from another screen right in the middle yeah, it's now great, thank you yeah, so I would like to give some context about what's the problem with Arterivinus fistulabrut so as you know chronic venous diseases are the most common pathologies that we have now in humans and there is a big occurrence of AI methods used for diagnosing these problems in our work we decided to tackle this a big issue only for one group of people with chronic kidney failure disease especially in dialysis patients the prevalence of this chronic renal failure in these patients is growing now especially after COVID pandemic because people who are going to dialysis they're facing many issues and if they cannot go to the hospital in timely manner and have to stay in unbuilt conditions it of course results in many other issues so what is this Arterivinus fistulabrut it is a special insert into the blood vessels in their hand of dialysis patients which helps them to maintain the always available vessel to provide dialysis procedures so it's a small tube that comes into the artery and the vein and stays there for some time while the patient has dialysis so it can be for years it's a very well researched method it has very good results in dialysis but the main problem with this issue is that the patients can get thrombosis through this small channel because of course it's not a natural vessel that's why it's easy to get some toxic waste from other cells and also to get the basic thrombosis especially in elderly patients the issue for the patient is that when he stays at home the development of this condition can be very fast for example in a few hours or one day and he cannot diagnose it by himself and it can lead to lethal conditions but the good thing is that this fistula makes a very characteristic sound that is called a brood and usually doctors when the patient visits them for dialysis they also check how is the sound of this small channel and the sound is a result of the blood going through it that's why it resembles a bit the heartbeat but of course a bit different like a pulse and usually the experienced doctor can easily diagnose if something is wrong and if the patient needs the replacement of this channel but of course in many situations we are not living in an ideal world that's why the patient cannot go to the doctor every day and as I said this condition can develop very fast in any time that's why we decided to use some mathematical approaches to tackle this issue and in this study we analyzed 290 patients from different sites in Russia and used the help of the doctors to analyze them so our motivation was to propose two mathematical methods of analyzing this fistula brood sounds as a time series and try to distinguish them on the basis of the fact that we think that normal function fistula makes the sound of laminar blood flow that is normal and if it's pathology it's close to turbulent flow and we can easily distinguish these two things during this project we decided to first of course investigate interesting approaches then calculate our methods which will be an overlay of entropy and complexity metrics of each sample on entropy complexity plane and then apply two different classroom methods to classify the results of these two metrics and of course interpret our results in compared to the real situation that was given us by the medical professionals so how do we see how they look like this fistula brood noises. If it's a normal fistula we can see that it's more or less non-cowatic time series and if it's a pathological one it looks very different in our case unfortunately we had for us unfortunately but for patients of course it was very well most of the patients had normal fistula sounds only 50 patients out of 700 records had dysfunctional fistula and also we had 61 cases when the doctors were not sure if it's pathology or not of course for the doctor it's easy he can see the patient next day and decide if something is wrong or not so for us it wasn't so easy but yeah this is the problem of the method analyzing these noises we also decided to go with a short element of sound because we don't see a big difference in the results of our methods if we take a very long time series a long record or not for entropy complexity analysis we use Shannon entropy and Jensen Shannon divergence for complexity then we used very well we started clustering algorithm to cluster the results so for each time series we assign the two metrics entropy and complexity and then basically cluster them on the plane another approach included attribute object graph construction based on the same two metrics and the time series for each time series we tried to make it as a binary matrix based on the the record based on the analysis and then this binary matrix was organized to attribute object graph where each element was an object and for each object we checked if this binary matrix values were aligned and this is a simpler example of such a graph with here we have one, two, three, I think six objects I will show later how it looks for the the time and series that we actually used and it will be much bigger but less easy to follow how it works so the most interesting part the results for the first method for entropy complexity analysis this is how our cluster results look like and we can easily see that there are three groups of clusters here if the complexity and entropy are low of course we think that this is a very organized non-cautic time series and we think that this still works well then the yellow cluster means high complexity and high entropy so this is very cautic time series and we think that this still is interesting and we have relatively big cluster in the middle which we consider is we cannot give a certain answer if it's working on non-working fistula and this cluster is the question for the next part of our work how we can analyze this data also we got a very small cluster in the top where complexity and entropy is very very high and I think this is due to the bad recording of the pistol sound because it's recorded by a dictaphone or a microphone and usually doctors speak during this procedure that's why sometimes we cannot eliminate this sound for the object attribute approach these are the clusters that we got and as you can see if it's an organized time series we have a very nice scheme and if it's not organized it's looked more like a mess because most of the objects, almost of the clusters, they had a lot aligned and this is considered as a bad one also we tried to show how it can look in XY plane so it's easy to follow and in this case this method actually gave better more distinguishable results we can see that there are two clusters which is good one means healthy fistulas and another one means that the patient should go to the doctor and these two methods they aligned but the question is how we can interpret the third cluster is not very chaotic and not very organized and why we don't see it here so when we tried to combine the results of two methods by simple linear regression we got I think five clusters but still the tendency looks the same we can distinguish chaotic and organized time series quite well but we have a very big cluster a few ones actually in the middle that can be that we cannot assign if they're healthy or not which also after the talk with the doctors aligns with the reality because they also cannot sometimes analyze and design and diagnose these patients when we compared our results with the doctor's analysis we got pretty much alignment even though we had less than 300 samples to discuss with them so as a conclusion I can say that we developed these two new methods of not only time series analysis but also diagnostics for fistula brutsa both of these approaches were good and aligned with the doctors individually and in combination and we have observed the significant correlation between the classification results and also we compared it with not only doctors analysis but also with some Doppler ultrasound for some patients and it worked well regarding the limitations of the study of course it's relatively small sample size of patients less than 300 also the data was anonymized that's why we couldn't compare for example male patients we couldn't know if they are overweight or not which can also result in some different sounds also there were, yeah we don't know if they're the age of the patient that can also be a big difference and it's still a black box for us we cannot consider if our clusters are the same functional categories that doctors are assigning to the patients for the future development of this project our perspectives of course have a bigger sample group also know more about the patients maybe the age or the weight or some other comorbidities that they're having for example some comorbid illness also for during the fistula construction fistula procedure the fistula first months it has undergone through a few stages of growth and we can also assume that the third cluster that we see in the middle can be due to the fistula is still how the doctors call it Yan and that's why we cannot compare it with other fistulas and that's why that can also help to see if the postoperative fistula will be healthy and will work for a long time or the patient has to have another operation soon that's it thank you for listening me and if you have any questions I'm here to answer them thank you Yucatirina special thanks from me the chair so you like made it in less than 14 minutes so we could try to keep on track are there any questions from the offline yes okay I will start with you with Dimitri thank you very much for your work I'm especially happy that you used formal concept analysis since this is one of our favorite topics in our department for example and the question is about clustering whenever you use formal concepts and concept lettuces they're like big search spaces rather than single clustering what kind of selection tools to find a good clustering a good partition you use here can you comment on it let me come back to the clusters actually we didn't develop any new one we used already described another article the only thing that I think we only used the extra tool to go through the alignments of different objects through the lettuces the second time just to see if they are combined to more objects or not so it looks more like heuristic way to extract clustering I must say that even brute force was faster to go through the lettuces than any other methods we used it only because it was faster for us in this case let me allow one more comment in the beginning of millennium I was also a part of an international dialysis center working as a system administrator and I remember those people who took these procedures three times per week it's very important to support their lives and one more comment from the program committee member he had some relatives who also passed away and he supported your work very much, thank you thanks it's great to hear it's not useless thank you Dmitry thank you for a very nice talk I have actually a couple questions the first maybe to figure out this clustering was done because you don't have the ground through labels of the samples or am I missing something because on the screenshot on the slide you have labels like 0 and 1 the question is why don't you try it as a supervised problem because here we wanted to try even for this analysis what would better to use is not clustering and classification as most of the people who work in healthcare AI I think but here we decided to try clustering because we were not sure how many clusters we will get at the first place here yes we can see the two clusters but we also would be happy to see more than two and if I analyze more data and for example if I take another batch with the patients from for example different hospital site it can be more than two clusters and here we are very cautious about giving labels because we are not also sure if either these clusters are for health and non health patients or just chaotic non chaotic time series okay thank you and the second question as I understand you analyze time series and the time series comes from the sound am I correct is it a sound wave like amplitude or was it yeah it's like an amplitude so usually it's just a sound from the microphone just a record and then we reviewed this amplitude yeah yeah okay but my question then have you preprocessed it or transformed it on frequency domain and if you didn't do this why maybe there is some motivation behind this analysis of the like pure amplitude data in your case but usually it's like people transform it to frequency domain like doing some preprocessing the only things we did it was just trying to clean up the sound with different methods but even after cleaning up it didn't help very well because the sounds were made by the records were made by different doctors in different conditions that's why if we try to make it ideally clean then we lose the actual sound and other methods if we try to make it 20% cleaner then there is no change in the results and but yeah I appreciate this idea we should try to analyze it as a frequency domain yeah it's kind of classical approach if you want to clusterize or classify sound maybe it's worth to try at least to convert to frequencies anyway thank you very much yeah thank you are there any questions from the online? okay it seems like we don't have any questions from the zoom, any questions offline? no? okay so let's thank the Ekaterina again for her talk and the next yeah thank you the next talk is an application of Dynamic Graph CNN and IFICP for Detection and Research Archaeology sites by Alexander Vakhmintsev Olga Kristo-Dula Andrey Milnikov and Matvei Ramanov dear presenters are you here? yes we see some screen sharing welcome dear colleagues yes it happened to me a rather unpleasant story before the conference it turned out to be a rare variant of the coronavirus could I please ask you to we have an international conference and like some of the speakers they don't understand Russian could you please switch to just one of the problems that occurred to me the main doctor allowed me we have some military hospitals that don't allow me to speak English if you forgive me then it will be possible to speak in such a way well could we I don't know unfortunately it would be so sorry yes ok for the yes forbidden to speak English by doctor yes kind of not clear unfortunately we really do some of the speakers not understanding Russian maybe we could switch it to the end of the talk so yes so if we could please then make your presentation then the last one so the people who are not really prepared to you know to hear for the report in one could like stay after so they could actually like left to other section and stuff so we couldn't waste their time would this be acceptable for you please yes unfortunately yes ok thank you so much for understanding so I should ask Vladimir Belikov who is the next could you please stop sharing the screen then with us Vladimir Belikov who is the next oh we have we have a second offline could you please have applause because we were thinking about you having your offline so the talk is named we will cluster with heterogeneous transfer learning by Vladimir Belikov and you have your presentation uploaded right or is it your presentation or it's not can please offline presenters stop sharing the presentation just one minute thank you for coming offline we have some small technical difficulties so at least from my perspective I was surprised because I had information so you will be given a talk online I got false information sorry for that maybe that's the reason I had this in timetable thank you dear colleagues let me introduce my presentation on some combination of clustering ensemble models domain adaptation or transfer learning so what is transfer learning or sometimes called domain adaptation so we have two domains the first domain is target domain or domain of interest and besides that we have some additional information or source domain for example we have text in English and we can use this information to improve classification or clustering of text in German on some related language there are different kinds of transfer learning supervised or weekly supervised depending on the labeling process so some data maybe completely labeling or precisely labeled and in this work we consider so called heterogeneous transfer learning then the domain have quite different space information so we have target domain and it is required to obtain a partition of this data set on some number of clusters and additional information is source domain where objects are described in different feature space with different feature dimensionality so source domain is labeled and we should perform clustering of target domain and it is hypothesized that the domain have something common in their structure some common regularities and these regularities can be revealed by cluster analysis and used as additional information to improve clustering what means by improving we shall discuss later so we use cluster ensemble methodology when we have a number of clustering results and we combine them to obtain some consensus variant of partitioning and there are some known works on this problem but these works have some limitations for example they consider common feature spaces or the time complexity is very hard of cubic complexity and is too much for many applications or we should have multiple domains additional domains or source domains these domains are not easily found in practice there are those methods which have four basic stages the first one is independent analysis of data so we perform clustering of both source and target domain in parallel independently we use lower rank representation of obtained similarity matrix to decrease computational cost the second stage is extraction of knowledge we use some supervised classification algorithm and find the classifier for prediction of the elements of co-association matrix in source data because we know labeling in source data and transfer them to target data using so-called meta features which describe some common regularities in data structure for example number of clusters or form of clusters or some characteristics of the form and they do not depend on initial feature spaces then we use the found regularities to predict co-association matrix in target domain and perform on the last step final clustering we construct the partition of target data using the predicted co-association matrix a few words about ensemble clustering clustering here you can see some example we have several partitions of data obtained by different algorithms or by one algorithm like initializations or different working parameters and so on then we update average co-association matrix each element is the frequency for given power pair of objects frequency are falling to the same cluster and then using this matrix we find the contents of partition using some algorithm which takes this information about similarities in the objects for example hierarchical clustering algorithm spectral cluster and so on this matrix or co-association matrix can be represented in a low rank form using rectangular matrices or association matrices and this gives significant memory savings or computational savings because it's not necessary to save in memory a large quadratic matrix for example million by million is the huge number of elements the steps of the algorithm are shown here so we perform independent analysis of source and tangent data using this algorithm and use spectral clustering for low rank represented matrix of similarities now some previous papers we have considered some probabilistic properties of cluster ensembles so if we suppose there is some ground truth variable that determines for which power if it belongs to the same cluster or different clusters and we can define the conditional probability of classification error and it is possible to use some regularity assumptions probabilistic assumptions to prove that the algorithm the classification error converge to zero when the size of the ensemble grows and other things being integral diversity in the ensemble gives the smaller error however in practice some or many or all violations or all assumptions can be violated so we have for example some we have not much memory and so on so for small number of ensemble membranes we use additional data source data to improve the clustering results as meta features first of all use frequencies of the assignment of objects it is the same clusters or elements of association matrices from source and target domain then we use meta feature based on silhouette index it is well established index internal clustering index and we define it for each pair of objects and use additional information and then we calculate co-incidence matrix co-association matrix and find the decision function co-classifier for predicted elements of the matrix dependent on the meta features used we can use machine learning algorithm such as render for support with a machine or artificial neural networks and so on some well known techniques can be used to evaluate the performance of the quality of the classifier or find important meta features and then it is possible to transfer the found classifier to the target domain for predicting the association matrix and the final step is clustering based on the predicted matrix but there are some problems that this matrix cannot be directly used for clustering because some metric properties can be related for this matrix so we apply and approximate solution we start from some initial partition of data of target data and then migrate different points to another classes to get the best improvement of the criteria this is the steps of the algorithm so they perform three or four stages of the algorithm independent analysis finding meta features finding classifier and transfer transferring the classifier from source to target domain and find the final partition of the data unfortunately time and memory complexity of quadratic order because we need to see the old pairs of elements it can be improved because we can use some methods such as stochastic gradient and using this method it's possible to consider only part of data not all pairs of data points by some sub-samples and this can reduce memory we have performed experiments with some data sets with artificial data sets using Monte Carlo simulation with data so we generate data multiple times then perform clustering and define the quality of the result and then we can average the results of all experiments here I show some example examples generated data use k-means as a basic algorithm and random forest and support vector machine for classification and knowledge so the next example is more real based but I think also some illustration of the method so we used the NIST data set of hand written digits and perform classification using fit forward artificial network so we use batch normalization so it's a gradient descent and in addition to the buff machine meta features we use also additional meta features such as normalize pair-wise distances between objects and average distance to close the centroid to evaluate the quality of clustering we applied external class-to-validity index there are different types of external indices for example adjusted index it's estimated the degree of similarity between two partition the first one is obtained and the second is the ground truth partition and this index is corrected for chance estimator probability of covariate assignments of object pairs to same or different parameters the formula is given here the more closer to one the better is the matching between the two partitions and index close to zero indicates near random correspondence and the results of experiments first of all for official data it can be seen that the proposal algorithm gives some improvement of quality clustering quality and this is the example of decision boundary obtained support vector machine algorithm so it can be seen that silhouette-based index is also give some information on the decision boundary boundary both of the features are useful for classification but of course co-association matrix base is more important and these are the examples results of experiments for real data omnis database so it can be seen that this algorithm also gives some improvement not the ideal of course but some improvement of clustering quality the average best result and worst result and here is given an example of clustering you can see that the first cluster includes correct assignments and there are some mistakes in the second cluster and this is a conclusion we propose some modification of ensemble clustering based on transfer learning and of course this is not ideal algorithm and some future works are planned for example we are going to I mean me and my students from the north of the state university other types of meta features and application of the method in different fields for example for text document analysis so this is all thank you for your attention thank you for an interesting talk any questions from the audience well if I may allow I myself have one question so to the best of my understanding and please give me the rope here so maybe I didn't understand correctly so your work adds another layer on top of classification clusterization to solve this problem so you're kind of doing it on the meta level so my question is actually about the interpretation explainability of the features because when you're adding more and more even if you're underlying classifier is kind of interpretable you can interpret this like is this interpretability passed through your algorithm what do you think could some approaches in this direction help the ones who want to do it yes I think it's possible to evaluate the importance of the features using some methods such as random forest or so on yes I think it's possible to understand the studio okay so you are proposing to make an interpretability on the produced features right okay thank you any other questions okay maybe the second question from myself is related to the amount of hyperparameters you have because you haven't like let's say this kind of really complicated data analysis and feature engineering on top of it so I'm interested in like let's say how many hyperparameters does your let's say meta layer adds to the like simpler parts correct I think about a dozen of hyperparameters on meta features using different literature but we tried to use some of them but the effect was pretty small okay and I assume there are also no online question because I actually expected people are doing for the zoom but there may be some from the youtube translation so let's thank the speaker again thank you so much and now our next talk should be an online talk it's the work is named metamorphic testing for recommender system by Sofia Yakusheva and Anton Hrytankov sorry if I misspelled something let us just switch to the zoom translation now are the speakers here yes I'm here can you hear me yes we can hear you super clearly could you please share the screen just a second so hello everyone can I start my presentation just a second so offline speakers offline listeners would also enjoy this yes okay please start thank you very much so hello everyone I'm very happy to see all of you today I'm very happy that I can present our work to you today so my name is Sofia Yakusheva I am assistant at the department of algorithms and programming technologies at Moscow Institute of Physics and Technology and Anton Hrytankov is my supervisor so today we will talk about metamorphic testing for recommender systems so recommender systems is a popular topic today but the problem is that they can badly hurt users if we don't test them carefully so we should test them as carefully as we can but we face a lot of problems for example lack of test data or needs for human judgment all of these problems have very high cost so we can't use humans a lot another problem is the stochastic nature of recommender systems because not all the methods can be applied to these systems and the last but not the least test oracle problem so what is that test oracle is a partial function to the set of sorry to interrupt you could you please I think we see it's on your screen so that's the part of the zoom on your side which is covering the part of the slide on the right yeah could you please is it better? yes now much much better sorry I'm very sorry so what is the test oracle problem as I said test oracle is a partial function in simple words this function says if the test is passed or not the simplest example is if we have for example right answer for the test we can just compare the answer of the program and the right answer and say if the test is passed or not but the test oracle problem says that sometimes it's computational expensive or even impossible to get a test oracle for some problems and testing recommender systems is such a problem that in general don't have test oracle but fortunately there are some testings to test such problem and one of them is metamorphic testing the key idea of this method is not to check every single output and instead of that we have many inputs of the program and many corresponding outputs our task is to check if there is a some relation between these inputs and outputs the smallest example is for example if we have a database and we request this database for the first request we use the filter A and the second we use filter A and B so the answer for the second request will be subset of the answer for the first request and we don't check if the first or second answer is correct to themselves we just check the relation between them so we try to apply this method for testing stochastic systems I should say that some articles on metamorphic testing use statistic methods but they use only criteria and do not pay much attention to the general review of the stochastic testing so we make some generalization of this proposed stochastic metamorphic relation so what is that classic metamorphic relation is just a deterministic function with many inputs to the set of 0 and 1 so we consider this function as a composition of sampling procedure and the function of determination so the sampling procedure is stochastic and function of determination is something like statistic criteria maybe and in this case we will have much more information about our system that if we use only classic metamorphic relation we familiarize some requirements for recommender systems in general so you can see all these requirements it's technical reproducibility ability to learn response to changes comparison on different models of the genity of parameters some symptoms individual features of algorithms and the response to linear transformations of distribution parameters you can see somewhere the word bandit our method to the multiarm bandit problem multiarm bandit is a model of slot machine with several arms if the user selects some arms he can gain a reward and the task of the user is to maximize this reward and this model used in some services like Amazon or Spotify when we have a list of options and we want to to get it for user but we can get only short sub list of this list so we propose some metamorphic relations for multiarm bandit problem we propose six but I want to show you only two of them the other four can be easily derived from the requirements but these two are more interesting I think so the requirement for the first was assumption about homogeneity of parameters that means that if we permit bandit size the reward should remain the same so that means that the algorithm do not pay much more attention to the number of the arm and I think that's a very important property of the algorithm and the second is comparison of less and more profitable bandits so if we apply some linear transformation to the probabilities of getting reward our reward will change correspondingly so it will be or it will be bigger but for optimal algorithm this reward can be get with linear transformation that we apply to our probabilities I hope I explain this okay so we test a lot of algorithms of bandit some of them were stationary some of them were non-stationary for example FD double VTS is non-stationary and for comparison we use random algorithm and optimal algorithm and we use multi-arm bandit model with Bernoulli distributions on each arms and these distributions were non-stationary in time so we have some interesting results for example we compare different parameters for X3 algorithm and we notice that for some of these parameters these algorithms work a lot worse than for others so our stochastic metamorphic relations are useful for detecting better algorithm parameters for FD3TC we notice that permutation of arms have a lot of impact and it is not very well there are some examples of failures that we detected this picture shows failure in configuration so our metamorphic relation was applied only for the first algorithm in the bench of algorithms and the other were the same so it was our mistake and we fortunately correct this another example is for configuration fails too in our project the configuration files were a bit complicated so this picture shows almost identical experiments that were supposed to be completely different and it is another example and the most interesting result we applied the sixth algorithm for algorithm and this SMR was about linear transformation of probabilities so we expect that if we apply this linear transformation for the first experiment we will get the same reward as for the second experiment where the probabilities were transformed already but we don't have such a result these values were different for random algorithms these values were the same but for the others this was completely different and you can see at the picture for the purple line that the algorithms don't even multiply the reward like the others and this was FDSWDS algorithm which is not stationary and there we use our stochastic part of entomorphic relations to analyze the error so you can see that for the random algorithm this difference was almost zero and for X3 it is but still not zero for Thompson sampling algorithm this difference is less during the time and for the X3 it's rising I think and for FDSWDS algorithm this difference is significantly bigger than zero and it's more bigger than for the other algorithms we consider this result as individual feature for FDSWDS algorithm because the model which is used in this algorithm is more complex than for others so in conclusion we consider the problem of recommender system specification we propose some stochastic metamorphic relations formulate requirements for recommender systems in general derive stochastic metamorphic relations for more time than the problem there's a problem and find some failures so our code of our experiments is acceptable in the internet you can see it we are links on the screen and a small addition from the experiments that we don't include to our paper we test some bendits contextual bendits which use addition information of context for example the season or the time of day or the number of day in the week and we consider two algorithms C2-UCB and LIN-UCB we test them on the constant context and we discover then LIN-UCB algorithm learn very well and C2-UCB doesn't learn at all and we make conclusion that C2-UCB algorithm is much more focused on the context than the LIN-UCB algorithm so our stochastic metamorphic relations are useful for discovering such thing so thank you for your attention my talk is over I will be happy to answer your questions Dear Sofia, thanks for your interesting talk and other additional things from me the chair so you just made it in 12 minutes so we will be just on time, unlike the German trains so do we have any questions from the audience? I have a question about this approach I'm actually interested is in some so to the best of my understanding you propose another way of doing let's say A-B testing in a way and you propose yes another approach to the multi-term problem I'm curious whether you thought about of the combining strategy of like combining your algorithm with the other known algorithms in the field such as Epson Greedy or Thompson sampling whether it could yield better results whether the combination of the algorithms work for example like switching from one mode of decision making on the test to another thanks so metamorphic testing is more about a flying testing of the algorithm when we test just models of our realization of verification of requirements and not the A-B testing but the idea of this CMR is quite the same because we use some statistics for future of our work we will try to to make some compositions of this metamorphic relations for complex system so we can propose some relation for the company to the system but for the whole system it's much more complex hard complicated I don't know so this is a vector of our future work but about A-B testing so maybe it's a good idea but unfortunately for A-B testing we should use users and this is quite expensive for our small research we just make some experiments and have a result but your idea is very interesting thank you yeah thank you we have a question from the audience thank you for the talk am I right that you tested everything Bernoulli rewards the model at this stage yes just banded with Bernoulli reward this reward is 0 or 1 at every step yes I see the question is do you plan to like extend it to different models of reward like distributions of course we plan this good thank you any other questions well it seems like we are out of question now but we still have some additional time so let's thank the speaker again thank you very much for your attention yeah thank you very much and our next speaker should our next talk is titled machine learning for image recommendation systems by Mikhail Fanyakov Anatoly Berdukov and Delya Makarov are the presenters here one moment are you here to me yes we can hear you clearly one moment are you seeing yes if you could just wow now it's just great thank you you have 18 minutes one moment good morning my name is Mikhail Fanyakov and I am graded of high school of economics of the program master of data science my supervisor is Anatoly Berdukov the topic of my paper is application of multimodal machine learning for image recommendation system we are living in the era of an abundance of information it can be very difficult to find necessary information or some important content people use online stores YAKAMIAS platforms marketplace but they need to systemize all this information people start to upload the information upload the information images for everything they they preferred it can be product or landscape some funny photo so images became the basis for development of recommendations and main data types and one of the main data types of recommender system what is the recommender system it's a class of machine learning algorithm that use data to help predict and find what people are looking for among an exponentially growing number of options this system have one basic goal to solve the problem of information of allowed is there for the users to search the goods however this traditional recommendation algorithm are not perfect there are many types of data that can be used for recommendation purpose additional purpose of the goods descriptive fact of the goods or some metrics which we can obtain this property so characteristic are called multimodal information and the system that use the information are called multimodal what's multimodal? multimodal the application is the application of multiple approach within one medium what's multimodal data? multimodal data is data with different types such as different embeddings, text descriptions various types of metrics it allows us to implement different features which simplify the training process and obtaining high quality recommendations for building my recommended system we need some data set for learning our model I use data set with images from Yandex and it has the following structure initial image initial image it is it can be a completely random image candidate image the picture which is recommended or isn't recommended for the initial image and of course the target is the pair target binomial target of the pair of the images one the image are similar and zero the image are not similar so more than 75,000 pairs have target one and about 30,000 have the target zero that the data has its imbalance structure and needs some processing for building multimodal system we should use different features for each images an important part of our recommended system is the description text of the picture I have passed it from Yandex picture using the beautiful soup python library this is the simplest and the most accessible method I obtain the text in different language for each picture I processed all images using clip it's a very useful technique for image embeddings with a lot of models for text processing I used another technique which is called BERT I used some multimodal multilingual model in 100 and 4 languages the next step is of the processing data is matrix counting for each pair of images and text I used two metrics to count pair distances between two clips between two clips vectors for images and two BERT vectors for text cosine similarity and the square including I chose from four models is decision tree classifier random forest classifier XG boost classifier and cut boost classifier with squares and without queries I used groups for candidate images it is equal to number of initial images we see that the cut boost classifier has the best results so our data set has the following structure it's ten features two embedding features of clip dimensions two embedding features for BERT dimensions four numerical features for matrix one feature is the binomial target variable and one feature is a query categorical feature with a number of the related image here we see that the optimal learning rate is 0.001 0.001 for my multimodal recommender system I will the model based on cut boost algorithm it is an algorithm for gradient boosting on decision tree which was developed by Yandex engineer in this algorithm I use cut boost classifier package the classificator of the cut boost algorithm because my data set has only two target variables 0 and 1 for model constructing my data needs in some preprocessing matrix feature was scaled using standard scale method I learn my model with following hyperparameters iterations is at 2000 it is optimal number of iteration for learning model for such data early stop rounds is 20 the algorithm stops in training if the parameter dictate learning rate is 0.001 I need a small gradient step size it helps to learn my model more accurately much depth is 6 it is the optimal depth of the trees for classification model and scale post wide 2.64 as I have imbalanced data parameter with this value which is equal to ratio of majority class to minority class model result as a loss function I use algorithmic loss function this is matrix of learning the performance by binary classification model as a custom metric I used ALG ALG is an effective method to visualize the performance of the model after the learning model we have ALG is equal to 0.83 it says the probability of prediction positive class is higher than the prediction of negative class the loss function loss of the model is 0.29 however I will use predicted probabilities of positive classes output of our model it will help us to range the result of our future experiment profound metrics profound show so the quality of ranging as we will range our images its indicated is 0.77 one moment sorry one moment so we need some experiment for with our data for the experiment the data needs in another preprocessing what the data preprocessing is we set three experiments using only clip embeddings using only BERT embeddings and using our model with clip and BERT embeddings for experiments my data set needs some preparation I divided it using package camins from the escalon lab library this is the most popular clusterization method I separated my data set into 1000 classes according to clip and BERT vector also I have obtained the cluster distance for each cluster of each image then using this data I counted three nearest cluster for each images and I have separated my data set into pairs according to the launching in three classes for each pair I counted its cosine similarity and a clear distance for clip and BERT vectors so I have data set with the following structure the clip vector for related image the clip vector for candidate image the BERT vector for related image the BERT vector for candidate image the cosine similarity between two clip vectors the cosine similarity between two BERT vectors the a clear distance between clip vectors and the clear distance to BERT vectors so my experiment we group a certain number of candidates higher, with a higher focus on similarity and with the lower for a clear distance matrix of clip vectors, bear tractors, or probabilities of the positive plus for each image. In this case, the number is equal to 5. For gas-cast-boot model and clip vector, the system selected the correct images without any mistakes. So the accuracy here is equal to 1. For the bear term-bedding, the system have two mistakes in the first and the second images. And for the correct images for the bear vector, the system made two mistakes in the first and the second images. Its accuracy is 0.4. The second experiment, cut-boot model has mistakes in the last image. The accuracy is 0.8. Clip model has mistakes in the fourth and the fifth image. The accuracy is 0.6. Bear model has no correct images. Of course, the accuracy is 0. The experiment number three, cut-boot model, clip model, and bear model has no mistakes. So the accuracy is 1. The experiment number four, cut-boot models has no mistakes. The accuracy is 1. Clip model has mistakes in the second image. The accuracy is 0.8. Bear model has mistakes in the third, fourth, and fifth image. The accuracy is 0.4. Experiment number five, for the cut-boot model and clip vector, the system selected the correct images without any mistakes. So the accuracy here is equal to 1. For the bear term-bedding, the system has mistakes in the third image. Its accuracy is 0.8. So I also analyzed 2,025 images and understand the following results. For cut-boot model, the average accuracy is more than 0.8. The accuracy of clip vector is less than 0.8. And the accuracy of bear vector is less than 0.5. Also, I have experimented for Flickr image data set. The Flickr image data set is simpler than the Yandex data set. So in the experiment, all models have accuracy equal to 1. The second equal to 1, 2. And the third experiment, there are no mistakes for the cut-boot model for clip vector. There is one mistake in the fourth image. So the accuracy is 0.8 for bear-bedding. The system has mistakes in the third, fourth, and fifth image. The accuracy is 0.4. I have analyzed 50 images and get the following results. For cut-boot model, the average accuracy is more than 0.97. For the clip vector, 0.95. And for the bear-bed vector, it's less than 0.85. So this is a unique multimodal recommender system for images. Model is based not only image features, but also text features such as embedding and metrics. However, there are additional training for bear model and build more additional training for bear model and build more perfect vectors. Looking for another NLP model is getting additional text for pictures and finding additional features. So the multimodal needs to improve constantly. Perfect. There will be more research for the topics. And the system has a wonderful future. Thank you for attention. Mikhail, thank you for your talk. Extra points for making it short and precise. We have quite a few questions on the audience. Yes, let's start with you. Thank you for your talk. I just want to clarify. You mentioned that for the CatBooth algorithm, your depth was six. And you mentioned that it was optimal one. Can you please explain how did you determine that it was the optimal one? Did you perform some great search or did you fix the over parameters and just run for the depth only or something else? Thank you. Would you like to please repeat your question? Yes, in simple words, why did you choose your depth to be six in CatBooth? The depth to be six is the optimal depth of the many decision trees algorithm. It's optimal depth. It's not neat. The depth are more than six. And the last depth is not perfect. Thank you. Another question? Thank you, Mikhail. Thank you for the talk. If you can open the slides, the examples were... One moment. Very interesting. It seems that you find... Using the query, you find similar images. One moment. One moment. Yeah, with pictures. So you consider this as the criteria for quality. The accuracy here is the similar image is the better, right? Yes, of course. The similar image, the first color is the related image. Yes. The question is... OK, I have two questions. Why is that actually a recommender system? For me, it looks like a similarity search looking for similar images or something like this. Can you comment on this? And the second question, you know, in practice, if you recommend to users all the images that are similar, that they like, your system will get positive feedback and will degrade the quality. I mean, if you show nice cats to a user and user likes it, the next time you show more nice cats and after several iterations, a user will get only the nice cats pictures in his, you know, recommender system. How your approach will solve such a problem? Yeah, and the first one is just... Why you call it recommender system if it is just a similarity search? Why as I say that this recommender system is based on clip and BERT embeddings? So the recommender system is using these features, of course, and recommend system according to main two parts, clip embeddings and BERT embeddings. So the future work of this recommender system is to learn the BERT model. For the result, the result show that the BERT model have less accuracy, the smaller accuracy than the clip and the cat boost model. And the BERT model will be more advanced. For learning the BERT model is the best way to improve this recommender system, of course. Okay, I didn't get if it is the answer, but okay, but what about the degradation of the model if you, in practice, if you recommend the same images to the user all the time? Degradation, yeah. Degradation, as I said, the model has a good future. They have a wonderful future. Wonderful future, okay. Wonderful future when the model will learn, will be learned, will be learned regularly with BERT model. And in the advance, in the future, it can be better to use charge-GPT model instead BERT, or the more complex model than BERT, of course. Yeah, okay, thank you very much. Any other questions from the audience? Yeah, yeah, maybe I would just also like to note on this very work. So it's kind of this multimodal approach is kind of a new territory. And one of the reviewers also had this question, why is the recommender system if it's not just a similarity system? And it's really, again, hard to account for all the problems and part complications of the newer approach. But I think we could all agree. So the multimodal approaches, like in this work, they yield better results and should be considered when building the parts of the recommender system. Yes, I guess we have no questions from the audience. So let's thank the speaker again. Yeah, so dear online participants, unfortunately, one of our speakers is unable to deliver the... Yes, yes, Mikhail. Yes, yes. Yeah, unfortunately, our speaker right now is unable to deliver the report in English, we're willing in the dark and weird times. And unfortunately, we also have some responsibilities as an international conference. So like in terms of staying international, the next report will be of the record, but we will give a chance to disseminate the knowledge between our speakers. So once again, unfortunately, next report will be on Russian only. So yes, we would like to thank our international guests. And just a second, we need to figure out how to solve the recording if we could... You see the presentation, right? How long can we start? Sorry, we have again some technical difficulties. It's an extreme situation to be in. Okay, that's not really possible. Okay, okay. The next report will be on Russian. Sorry, to all international colleagues who don't understand Russian. And the report is called Application of Dynamic Graph CNN and FICP for Detection and Research of Ecology Sites. And Alexander will now report to us. Thank you. Hello again, colleagues. I apologize for the situation in which I got into it personally and this is all our scientific collective. I present the report on the application of dynamic graph from the ERC and combined iterative algorithm of the nearest points for such application tasks as detection and research of archaeological monuments. We need to say that the results we got are related to the implementation of our collective of the Russian scientific fund grant that we got this year. As a result of the implementation of this project we have to solve two scientific tasks. The first is to create methods for detection of archaeological monuments and the second group of methods is for research of archaeological monuments. If we talk about the mathematical basis of these methods they are based on machine learning methods, geophysics methods and cartography. We need to say that in recent times archaeologists have been doing detection of archaeological monuments manually. For example, they used to solve these tasks the data of aerial photography and space photography. In the left part of the slide you can see the aerial photography but in the right part of the results it was decyphered. In the last 15 years, the methods of detection of the scientific fund have received a wide range of research of archaeological monuments including geophysics and machine learning methods. A lot of work is being done in this direction. I will probably stop only at one work that was conducted by a group of German archaeologists. They have been researching the settlements of the Middle Ages and these settlements were represented by some Kurgani today and were doing detection of these settlements according to the data of the depth sensor that they received from the depth camera or the time-lapse camera that was installed on unmanned aerial cameras. Here you can see some visualization of these Kurgani. This work was interested for the simple reason that archaeological monuments that are located on the territory of the southern Ural and these Kurgani that are located on the territory of Germany have similar signs of decyphering. I will tell you a little about what these monuments are. These are archaeological monuments on the territory of the southern Ural. Most of them belong to the so-called Bronze Age and are associated with the migrations of the Indo-Europeans on the territory of the Ural and the south-western Siberia. There they organized a whole range of cities and many of them are located within the so-called Sintashtine culture called the Great Sintashtine. The most famous monument of this complex is the city of Arkaim. Today there are about 20 such fortified cities near Arkaim. In the 80s of the last century Professor Zanovich was published a fundamental work in which he explored all these cities and actually there were signs of decyphering of these archaeological monuments that we used in our project. We were given 9 classes of interest 5 of which are designated as K1, K5 These are the classes of the Purgans that have various textures that is, some scattered surface or some communist surface. These are the magicians. Some of them refer to the epochs of the Bronze Age. Other magicians are connected with the later development of these territories, in particular with the Bronze Age. Also the settlements are fortified and not fortified. Accordingly, the classes P1 and P2. In this slide there are some signs of decyphering. In fact, there are enough of them and many of them will not be able to be used in the methods of machine learning such as the size of the object or the position of the object relative to each other. Nevertheless, there are such signs. If German archaeologists used their work only one source of data we rely on 6 sources of information. The first source is the materials of aerial photography that were produced in the last century when there was no economic development of these territories and they have a very big importance because many of these archaeological monuments are still visible on them. After there was the smell of impregnation, of course, these objects become much more complicated. The second source is the land. We use these data from the satellites Centinel, LandCas, ResourcePy and Kamopuzbi. It is necessary to say that we have a large enough collection of such photos for a fairly long period. This is several hundred photos and it is very important that these photos are with a high spatial resolution i.e. up to 0 centimeters per pixel. There are also some research using telemetric sensors, in particular, Trimble sensors and it was built about 40 models which describe in the form of a black dot these settlements of the Bronze Age on the territory of the South Oral. There were other photos which archaeologists received in 2006 on these areas. They were also used by the project. In addition, a rather interesting source with which we began to work only this year and here colleagues from the Aurora Institute of Theophysics came to us for help and performed with the help of the AMC-14 device a telemetrical shooting of three settlements, both on the left and on the right. And also we use these depths this year for two settlements we shot with the help of a pilot and a depth camera these settlements received a detailed three-dimensional map. All sources of data can be divided principally into two types two-dimensional data for processing two-dimensional data we use the remaining neural networks that is, the Internet network in general, we go on the path of German colleagues. And for the analysis of three-dimensional data it was proposed to us and I published in the results of this report a new method we call it the method DGCNN with a star or it has another name which we gave the MGCNN multi-modal graph neural network We must say that methods for classification segmentation of three-dimensional models of data are different they can be divided into two large groups these are direct and not direct methods but first of all I am very interested in direct methods bright representatives which are methods DGCNN and FGCNN but the solutions set by our task these methods are not quite suitable and this is due to their drawbacks the first of these drawbacks is the limitation of pure size of data these methods work well with industrial design when the size of the task is a few thousand points up to ten thousand points if the point becomes bigger and we have order it is 13,000 22,000 these methods do not work very well the second problem is which form has three-dimensional models these methods work well for exhaustive models of data archaeological monuments of the Bronze Age are located in the steppe or even in the desert therefore visually these buildings have a sufficient output of objects there are also problems associated with that these methods mainly use one or two models but the information of the light is not used at all to overcome these drawbacks we were offered a new architecture which we called dynamically completed multimodal beam graph neural network it has input data which contains 12 signs this is coordinates this is normal this is color signs and also the result this set is added with more normalized coordinates therefore at every point there are 12 signs I will also add because in RGB these are not independent signs then we in our work in RGB in HSP and thus we also get the quality of the solution of the segmentation the essence of our concept is that we represent in fact our block in the form of a dynamic graph which is described with the help of the matrix of the curve this matrix of curves takes place in every vertical layer of our neural network and in the area of these graphs dynamic spatial graphs these vertical layers are called H-involution layers in addition to this this architecture was added to our metric classifier which is based on layered neural networks and one classifier with a radial basis function with a radial basis function on this slide we present an algorithm which illustrates the main steps of this algorithm I will add that there is a precise solution to the task spectral filter but it is also important to have a high performance therefore, we use Chebush's polynomial 6.5.3 for approximation of spectral filter and the main cycle of processing in our graph the vertical neural network is at the next stage this is the construction of matrix of curves the normalization of this matrix approximation of the graph signal of the Chebush polynomial in this case of the third order the execution of the graph and the formation of the output but then there are some nuances of the concatenation of local and global values and the result of the output is generated also what should we pay attention to here in the architecture was added another layer of processing which increases the discretization of the cloud but for what it is actually done because as a rule with deep cameras they give quite noisy data which in the most negative way indicates the quality of the segmentation and with the help of the method K&M we add three points in each point and here you can also see that we have in our vertical neural network two outputs the first output decides the classification of the objects that is, the global meaning which of the objects is depicted and the second output of the vertical neural network is associated with the solution of the data segmentation now let's talk a little bit about our graph this function is largely a multi-class cross-entropy but in it is added a second auxiliary parameter which is associated with the smoothness of the signal in the graph and this parameter allows us to make objects of snow points in the cloud more similar and also stop at one moment that is, these clouds of three-dimensional data can also be obtained in different ways that is, we can take this hemispherical sensor well, because it is a quite flat surface we can immediately get the whole model of our archaeological memory for one shooting if we are already going to use some LED installed so it will not work and in order to shoot the archaeological memory we need several pictures or dots the same can be said when the surface is some kind of uneven, that is, rocky and in these cases in order to match the data from different angles in computer vision such a task is solved as data registration the task of simultaneous navigation and cartography was proposed by a new algorithm we call it combined iterative algorithm of the nearest points and this algorithm was applied including in the framework of this research that I am telling you now for the registration of these archaeological memory but I will stop only on the main two properties of this algorithm this algorithm uses in addition to the blocks of dots special dots to solve two problems this iterative algorithm of the nearest points the first of the problems is related to the selection of the initial value of the matrix of the rotation of the vector of the transfer but just like that we decide at the beginning of the registration of relatively visual data we select some parameters and thus we manage to get and it becomes much better than to select these values R and T some empirical path and the second, as you see from the functional view it contains two slogans that is, in fact, we do joint solution of the task regarding visual signs and three-dimensional data in the work were conducted some experiments experiments so-called controllable conditions and uncontrollable conditions under uncontrollable conditions we understand various noises noise of matrix receivers uneven lighting and so on but we see that our algorithm is marked such a functional line it has some advantages in comparison with the methods of registration of the SCP known as ICP with a metric point-to-point point-to-point extrapolation and with a metric point-to-plane point-to-plasticity now let's talk a little about computer modeling well, if we talk about digital models which are obtained when shooting with a drone then they look like this as I already said before us research there are two main tasks the first task is connected to the data of the archaeological monument with the help of the remaining neural network and it is important for us to determine the place that there may be a monument but not necessarily here, just with such red light the neural network found a potential archaeological monument but from another picture here you see two settlements which are highlighted positively but if we talk about purgans of course, there may be a lot of mistakes that will show the results of computer modeling after such an object was detected we can highlight this area of interest and in this area of interest we have a suitable object and already, in fact, for this area of interest there are various methods of research including the method of geophysics Alexander, I am sorry we are taking a little time tell me how many minutes you need I will finish in 5 minutes and then I will be back thank you very much this year we had such field work we went to the new city of Verkhovna Vralska we have to say that in general, in the fortifications there were 22 until this year and this year we were able to discover the methods of two new fortifications which we have now taken to study one of them is the village of Verkhovna Vralska here are the data and depth of this picture and the digital trough model taken from the sensor of the depth it looks like this this is an archaeological monument this is the corresponding area of photo of this monument of different quality received in the last years in the 50s and 60s of the last century and these are the results of magnetometric shooting we also plan to use the method of machine learning to work with these data but for now this is the closest perspective and you can see the results of archaeological monuments the resistance of the magnetic field in surface layers but this is a little deeper in a few meters of this magnetic field after this monument was discovered we also apply known methods including the method or as we call it the same method for segmentation of its space we are still interested in two types of classes this is the presence of some dwarfs and residential cameras these are the results of the segmentation of this monument on this slide here you can clearly see that there is such a cloud and there are gradations and this is how it turns out without a cluster to determine the location of these archaeological monuments and understand the internal structure such a shielded distance method of research of archaeological monuments will allow archaeologists to study without a cluster the structure of the archaeological monument I don't know where it is, in fact, by the way, Rv, where it is with a personal camera, and so on. We were conducted some experiments in terms of comparative analysis. We compared both direct and indirect methods. But for indirect methods, we took a multi-view CNN. We used, and as a direct method, we took Digis-CNN and used some combinations of Digis-CNN with the YOLO detector. We used two channels of information. As a result of such experiments, we were able to find out that for some classes of Korgans, K1, K2 and K4, the results are quite impressive, and the method offered here is in the lower part. It has advantages. But these classes, which are like K3 and K5, this so-called solid-sparked surface, or so-called Korgans, Kurgans with themselves, they are definitely not detected well. As a known method, not so in our method, so there is still a lot of space for some work. Well, on the next slide, there are results of detecting already fortified and not fortified sediments. And we see that the application of our method for segmentation and classification of sediments gives good results, but we cannot get the results by helping others. Well, that's all I have. Thank you for your attention. Once again, I apologize for this problem that arose. Well, I planned to come to Armenia, I already bought tickets, but I got sick. Well, then I got into such a bad organization as the infusion department of the hospital. Here I am now. Questions? Let's clean up the report. We, of course, get out of time a little bit, we take away the time for lunch, but if there are questions from someone who lives in the hall. Yes, I have one quick question. I know what everyone wants to ask. I am very interested, thank you for the interesting research. I am interested in parts of the data that you used from satellites. As far as I can imagine, most of the satellites they make shooting on various wave lengths and not only in the visible spectrum, but also in others. Please tell me, is it used somehow or in work? Because as far as I can see, you have everything in RGB, that is, everything from the visible spectrum has been calculated. Yes, I will say now, we use all the channels, including the red channels, we do not use only the satellite data, that is, what is shot at night and evening, and such shots happen, but not often this time. And we do not take, of course, shots that contain the cloud. Well, since we did not try to remove this cloud, nothing good happened there. Therefore, we take, in fact, all available channels, but we exclude shots that contain some atmospheric phenomena, cloudiness, and if you get such, sometimes it happens, at some wrong time of shooting. I understand, thank you. Thank you very much. So thank, I would like to thank everyone. Ah, yes, let's thank the presenter again. Yeah, I would like again to thank everyone for participating in our section, both online and offline presenters. Thank you for your contribution. Please have a nice rest of the conference. Let's talk about it, cheers. Hello everyone, one more time. This session is on theoretical machine learning and optimization, and since, unfortunately, we do not have the session chair offline, let me start this session with my talk scheduled. My name is Dmitry Ignatov. I am from HSE University in Moscow, and I'm working at the data analysis and at the official intelligence department, and also chairing the lab, which is doing all sorts of machine learning, and all people, for example. But this talk will be mostly about theoretical stuff, I would say. I'm sorry, Dr. Ignatov. Aha. Do you hear me? Yes, yes. I'm a chairman of this session. Yes, that's correct. But you are online. If you are able to introduce me in that manner, this is my pleasure to restart it. Okay? Okay, the colleagues. So we can start our session about theoretical machine learning and optimization methods. And we have only three presentations in this session, so we can be more free with our time limits. But I ask all the authors, all the speakers, to keep the regular time limit about 25 minutes, including the answers and questions. So our first speaker is my old friend and colleague, Dr. Dmitry Ignatov, from High School Economics University, Moscow. And I thought it is very, very interesting theoretical talk about mathematical logic, maybe formal concepts. So you can see his title. So Professor Ignatov, the hall is yours, welcome, please. Dear Mikhail, thank you for the introduction. Let me start the talk. This talk will be about partition lattices and specific problem, maximum anti-chains in partition lattices, their size and enumeration. Not, I can say that these results are, how to say, breakthrough results, but at least they would add some breaks to the current state of the art on the problem. Since our reference asked about applied introduction into the problem, I decided to include motivation from data mining perspective and I omitted some theoretical stuff. Here is the outline of my talk. After that, I'm going to go on to the motivation from combinatorics perspective on the problems which are announced and then we'll go through partition lattices as concept lattices. Then we'll consider the problem statement, actually three problems we are considering here. They are closely interrelated and I will talk about our solutions of these problems, both the theoretical ones and the practical ones, meaning that we added some new numbers into online encyclopedia of integer sequences counting the respective patterns. So motivation from data mining perspective might be well familiar to you. If you go to the supermarket, you can buy some stuff there and some of the items you can buy frequently as many other customers on a daily basis. One of such product items, the collection of product items was the diapers and beer and here you can see two researchers from the area, one of them given a gift, diapers and beer to the developer of the best algorithm in terms of, I don't know, performance time, which is able to find such patterns in the data. What kind of data we mean here? Those are transaction data and they can be represented as binary tables, also in transaction database format or maybe in vertical database format, but in essence, we have transaction IDs, one, two, three, four, five, six customers, for example, or the same customer on different days, doesn't matter. And five items, ABCDE, one means that a particular person or a particular transaction contains a particular item. Here is just the same reformulation, but we have a transaction ID and also an operator I, which results in the set of items that were bought in particular transaction. And similarly for vertical databases, we have operator T, which says what are the transactions where a particular item was bought. It closely related to inverse indices and information retrieval, but let's have a look at the result of this search performed by one of the algorithms, like a priori for finding such frequent patterns. For example, we have item B, which was bought six times in a particular transaction database. So we also have item sets that were bought five times. So at least five customers bought together B and E. Four times for this collection of item sets. ABE, for example, was bought three times. The minimal threshold here for the number of such purchases is called minimal support. And the structure, which lies behind the fundamental structure that lies behind is the lattice, which is the lattice of closed item sets or the concept lattice, as we call it in formal concept analysis, I will talk about it a bit later. So here we have some items with the same supports, like A, it was bought in four transactions and also ABE, it was also bought in the same four transactions and such classes form the partition. Those are equivalence classes. Actually they have a unique representative, the closed item set in terms of support. It cannot be extended without violation, the maximality of support here in the respective class. And there are also the so-called minimal generators, but they are not necessarily unique. They, you may think of them as proper subsets of closed sets, which cannot be further for the diminished, okay. What you can read in this text book by Zaki Meira on data mining, that the concept of closed item sets is based on the elegant lattice theoretical framework of formal concept analysis by Gunther and Wille and we'll use this as the main tool for enumeration later. Also I would like to mention some works, related works where not only such concept lattices are used, but also partition lattices and we can show that partition lattices can be represented as concept lattices. Partition lattices may be considered as a search spaces for clusters. If we solve the problem of partitioning or community detection in SNA, they also can be used for granular computing or to build for building functional dependencies in relational databases and even for a variation of binary data analysis, which is called independence data analysis. Here are some of the links. But let's go to the theoretical motivation, which dates back to the problem of Rota, which was published in the Journal of Combinatorial Fury. He states the problem as follows. It's well known that for a Boolean lattice, the largest size of anti-chain that is the family of sets, which are not subsets of each other pairwise, is given by the middle binomial coefficient and he proposed to prove or disprove the following generalization of this theorem. Whether for the partition lattice, the size of the largest anti-chain coincides with the sterling number of the second kind. So that was his proposal. And since the name of Emmanuelle Spörner was mentioned, it's good time to mention that Emmanuelle Spörner studied Boolean lattices and he formulated a theorem, which proves that the central binomial coefficient gives the size of maximum anti-chain of elements or sets in the Boolean lattice. Here you can see two such anti-chains for the case of three elements sets in red and in blue. And in its turn, the problem studied by Emmanuelle Spörner dates back to the problem posed by Richard Dedekind on the number of anti-chains in Boolean lattice. So the first one is the beginning of the 19th century and the last one is the end of the beginning of 20th century and the last one is the end of the 19th century. As for partition lattices, a lot has been done already by the end of the 70s and in the paper by Ronald Graham, for example, the co-founder of Donald Knut on a famous book, Concrete Mathematics. He summarized the state of the art by that time. You can see the partition lattice on four elements from his paper. Here you can see, for example, the level anti-chains. Here, for example, and here, by the sets of elements of a certain rank are shown. So we will also use this terminology in slightly modified manner later. And what he told us in this paper is as follows that at most for n, less or equal to 20, the size of the largest anti-chain coincides with the sterling number of the second kind. And unfortunately, we do not know whether discrepancy arises, but this discrepancy exists and, I'm sorry. I don't know how to go back to the full screen mode with a touch screen here. Can you help me? And Rodney Canfield showed that the largest anti-chain actually is not the level anti-chain. So it does not coincide with the largest, with the longest level in such a lattice. According to Canfield discrepancy, may arise for n, which is very big. And Ron Graham says that we will never know where it should happen. But at least Canfield and one of his and Graham co-authors, Harper, they showed what is the size of such an anti-chain asymptotically from below together with Harper as far as I remember and from above that's the latest result by Canfield, but already in 1998. So the theorem says that the size of such maximum anti-chain divided by the sterling number of the second kind, the maximum for the specific can, lies between these two values where A is given as the constant is given as follows. And this paper contains the following statements the symbols C1, C2, the constants here, for example. They denote positive real constants and it would be possible to find them but destructing to replace these by explicit values. So is Canfield right is the main question in the title of the paper. Just tell us that we need to have a look at these numbers, what they are. And we are going to use also representations of partition lattices as cross tables or formal contexts known in an applied branch of modern lattice theory. So here you can see cross table representing this partition lattice, but instead of this second level where the partitions of three elements should be given. We have only pairs of elements which are together in one block. It has done deliberately. Okay, what are the problems here to address first? What is the size of the largest anti-chain for a given N? The problem was solved by Canfield asymptotically and a few numbers, a few beginning numbers are known. We counted them explicitly. As for the number of anti-chains and maximal anti-chains, we also can count them explicitly, not asymptotically, but asymptotically it's also possible depending on this DP. The first proposition says us what are the constants? The constants are obtained from the first order conditions for this function. So it has its maximum and minimum on the left. It has maximum N2 and X minimum is given by this N in integers, that's not integer, if we compute it directly. And if a little bird would tell us where the discrepancy arises, we can refine the coefficients nicely. You may think that this is not a little bird, but just an oracle. Okay, we can also, we can find these coefficients but in terms of inequalities using the first order conditions. Moreover, since Canfield used this substitution for N in the original computations, we can use it as well and we can use the principle branch of Lambert W function, which is given here as a graph. And refine these coefficients even for N greater than one. So these coefficients C1 with a wave and C2 with a tilde. So the proof is given here. So we simply took the final expression from the paper by Canfield and Harper and made the corresponding substitutions using our knowledge about the maximum value, that is first order condition. And similarly for the upper bound and we used the form of the function and the Lambert function as well. Here BN, the rotated BN means Bell number. It gives us the size of the whole partition lattice is the number of all partitions for a given N. There are two remarks that we can recover this coefficient even for N greater or equal one that is Canfield didn't consider N equals one, but we can do that because of direct usage of Lambert W function. And similar propositions can be formulated for zero discrepancy intervals that is up to N20 as Krem reported. The results of our computations, of direct computations with algorithms from formal concept analysis both confirmed the results for the number of anti-chains in the partition lattice and for the number of maximum anti-chains in the partition lattice. This sequence was edited by us, I've accepted by the OIS editors, Neil Sloan. So we somehow extended the state of the art and confirmed the known values. But what was the machinery? We used this binary representation. We used the result in concept lattice. The numbers here are just binary codes and you may think that they are ideas of the nodes in the lattice. And we order them according to this relation and built another lattice. And this is the lattice of anti-chains. And if we simply remove the equality side from here, like here, I'm sorry, yeah, remove it, then the diagonal will appear. We obtain the lattice of maximum anti-chains in the partition lattice. And this is the case for N3. So for larger lattices, it's a fast growing sequence. It takes a lot of time. And here you can see, for example, that from milliseconds for some of the beginning ends, for N6, we already spent more than a month and our computation did not finish by the time. Now we have some progress, but it's not finished yet. Here there are some graphs, sorry, for a small scale. We compare the number of maximal anti-chains in Boolean lattice and in the partition lattice for N6, this red dot is somewhere here, what we are trying to reach. Maybe it's better to have a look at these figures in the paper already. But we also try to formulate some inequalities bound in the number of anti-chains in the partition lattice from below and to the number of maximal anti-chains from above. We used the level-wise partition of the partition lattice into anti-chains. So that's why we have this amount here, mainly. But we also use inter-level anti-chains, meaning that if we have different levels, the elements, the sets of different ranks in the corresponding partition lattice, we can consider anti-chains from different levels and somehow improve those values. Actually, the improvements are good only for beginning values. So here delta means the discrepancy between the number of anti-chains and maximal anti-chains and the L and the L plus are our inequalities. So for one to three, this is zero relative error, but for four, the maximum that we could get is 0.32 for five, 0.57. So we need to count more elements. And as an illustration, I would like to mention the case for M4. Here you can see the binary representation of the corresponding partition lattice or even two layers of the original partition lattice. And here we consider this relation on its partition and we can count 30 patterns or 30 concepts or anti-chains, maximal anti-chains. And if we add two more, we exactly obtain the number of maximal anti-chains in the original lattice. Also using the tool, which is called Lattice Miner, we can build the concept lattice diagram, sometimes called Hasse diagram. But Hasse was not the first to use this name, so line diagram is a bit safe to say here. And this diagram can help us to extract all the other anti-chains, not only maximal by hand. So we inspect those concepts. We consider different bipartite graphs, say with free and one vertices in different parts, extracted from the corresponding concepts of the corresponding levels. So here, for example, you can see that in one part there are two nodes and in the other part there are also two nodes, but they are given as a corresponding partition sneeze. And we can sum all the total counters here. And also we should work not only with the concepts represented in this lattice, but also with proper bipartite subgraphs, not only maximal bipartite subgraphs, they are formal concepts. And count the types of anti-chains represented by bipartite graphs to one, K to one, K to one. Literally, I'm sorry. K to one. But your time was over three minutes ago. Okay, the last slide here. When we sum it up, we obtain 344 patterns and we also need to include this partition where all elements are in and this partitions where each element in a separate block. And if you ask me about a single formula to compute it, you can find it here as well. Unfortunately, the time is over and I'm only ready to answer questions. Thank you. Thank you very much. First of all, I would like to ask the host to open my video because it's not easy for me to have no my own video on Zoom, please. And the second, I would like to ask all the colleagues to ask one or maybe two questions. To the Dr. Bernatov about his very, very interesting talk. So you are, please. I'll open your video, it's not here. Even the microphone. It might not hear you because you didn't talk to the microphone. I'm sorry, your video is not on, we can't see you here. I'm sorry, I cannot to switch on my video because host was prohibited this. Oh, wait. Or maybe we should go to the participants list, try to enable it out here. This is weird. The reason is because this is not the host computer. Okay. Okay, thank you. So, Dr. Bernatov, would you like to ask any other question? From my side, no questions, but we are expecting the questions from the audience. Then I have a question. Can I ask you my question? So, hello, do you hear me? Yes, yes, we can hear you. Very nice. So, I like your theoretical work. It is very interesting for me, but our confidence, you can agree with me that your topic is quite borderline according to this conference. So, please, could you explain in more detail the machine learning applications of your high worlds here, in two words? Yeah. So, I would rather think about this topic and this direction as not fury for machine learning, but machine learning, and I would even say data mining and formal concept analysis for combinatorics. So, we can take algorithms from data mining for enumeration of closed item sets and we can compute these values, which are not yet known, at least in the OAs, and extend our knowledge. Maybe someday, and you earlier will come and give us a nice formula, but so far, machine learning helps us, and data mining, I would say, helps us to find this number, at least, scrape them from the data. Thank you, thank you very much. Such nice corporations between two fields for the theoretical mathematics. Yeah, and yes. It's really my pleasure to contribute to this section, which you chair many years, I believe, and thank you very much. Thank you very much. Thank you. Very, very interesting presentation. Thank you. So, there are some questions from the audience. I don't know, Mikhail, whether you can see. Uh-huh, okay. I have a question. Your title includes a question. Iskians feel right. So, to summarize, iskians feel right or not? Yeah, it's a bit provocative question to attract attention to this kind of problem, because if you check, there were no progress since the beginning of millennium, because Kenfield solved the problem theoretically, asymptotically, but these coefficients, should we know them, should we try to know them? It's a very good answer, which can stimulate us. So, Kenfield decided at that time that they are distracting, but we tried to find them out, and maybe if we can prove theorems about the concrete N where discrepancy arises, we can refine these coefficients, and at least we know what is the gap, and this is nice, it might be like a competition, a new mathematician or practitioner will come and refine it in a more elaborated way. Okay. Maxim Panov here would like to ask a question. Yeah, so, I have also, I don't know, probably funny question, not really related to the topic. You had one of the slides where you were referring to some book or foundational paper where there were three authors, but you striked one out. Why? It's also a funny story. The original book on formal concept analysis, Mathematical Foundation is offered by Bernard Gunther, one of my supervisors from the German side. And by his supervisor, Rudolf Wille, who passed away, then they translated it into English. The translator was Franzke, and some of the systems indexed Franzke, but his contribution was not actually as an offer, but as a translator and in the community. We discuss it many times. We decided not to give his name, but at least I decided to give his name, but strike it out. Sorry. So, Likulovs, we should proceed with our program. Thank you, thank you very much. Once again. And our second speaker will be Maxim Panov, with his presentation about distributed by Bolshian corsets. You're welcome, Maxim, please. Yeah, thank you very much. Thank you very much for the introduction, and yeah, that's my pleasure today to present this work. That was essentially a master thesis, which was defended a year ago by my student, Vladimir Milusik. So, well, it would be more essential if he would be presenting this work, but he's currently a PhD student in University of Missouri, and it's, well, not feasible both to travel here and also to speak online, because the timing is not very good. So, I will be presenting on his behalf partially, also my behalf. Okay, so the topic here is basically the inspiration for this topic is Bayesian approach to machine learning. And, well, we consider the standard parametric model, like probabilistic parametric model where you have an input X, you have output Y, you have a certain probability distribution, which is parametrized by some parameter vector theta, so it might be linear regression, it might be neural network, it might be whatever. And usually we consider the situation when we are given with a data set, so paris, X and Y, so your historical observations. And then people write likelihood function, which basically is joint density of Ys, joint probability of Ys. And usually it's convenient to rewrite it while log likelihood functions. So, and basically then the total likelihood is the exponent of sum of log likelihoods for individual data points. Well, I'm writing it here because for my future task, it will be important, so such a representation. And in the Bayesian approach, you usually assume that there exists some prior distribution on the weight defined as P naught on my slide. And then if you have a likelihood and you have a prior distribution, then you can write a posterior. So the posterior is given by this formula, so you have a product, or it's Bayes formula, well known, and you have this product of likelihood and prior, and then you have a normalizing constant in the denominator, which is basically just an integral over the theta. And well, if you think about some machine learning model behind that what you are eventually are interested in, you're interested in making predictions. So, and for this, people usually employ, in this Bayesian constant, people usually employ so-called posterior predictive distribution where you average likelihood at a new point over your posterior. And in practice, usually, of course, you can't compute this integral explicitly and people do sampling like Monte Carlo approach. And I also should mention that because you have very rich object potentially, the posterior distribution, then you are not constrained by just looking on expectation. You can look on the moments of this distribution on a variance on other moments, so basically you can also reason about uncertainty. And that's why generally Bayesian approach to machine learning is thought to be like an interesting thing because you not just do a point prediction, but also you can do certain uncertainty quantification. However, this formula is a bit problematic because of different things. So the mainstream, I would say, problem which people consider is that actually the numerator, sorry, denominator is very problematic to compute. So numerator is given, you have some prior, you have likelihood it's all given, but in denominator you have something which is an integral and this integral is hard to compute, usually. So of course there are exist cases like case of conjugate distributions when it's easy to compute, but in general case it's hard to compute and people consider various approaches to sample from this distribution or to approximate this distribution. So most well-known are Markov chain, Monte Carlo and variational infants. However, in fact, if you consider like modern applications when you have a lot of data, like think of more than like image classifiers which are trained on millions of data or think of probably language models which are trained on trillions of data, there is another part which is actually might become complicated to compute, that's likelihood itself. Why? Because this summation is very large. So you have many, many summons and then imagine you want to do Markov chain, Monte Carlo on top. What is Markov chain, Monte Carlo? You iteratively try different points until you get something which is accepted in which is your generated sample. And well, this process requires to evaluate likelihood very many times. And if it's n is million or billion or trillion, it might be very, very hard to compute. And that's why in the research community it was considered like certain approach which is called core sets. Basically what you do, you consider the weighted likelihood. So you introduced weights WI and you want to construct this weights in such a way that first you have many zeros among the weights. So you reduce the summation. And second, that this posterior is close to the initial posterior, okay? So you want to select some subset of points which approximate possibly weighted. So it might be not zero one, it might be something else. But you want to have many zeros and you want to have a small subset of points which well approximates your initial distribution. And then if you succeed with this task, then you will be able to do your computations very, very quickly. I mean, if number of non-zero W is small. Well, basically a little bit the same notation again. I'm sorry, I just want to mention that we almost in the experiments later, we almost wouldn't consider the problem of supervised learning which I motivated the problem. And we mostly will consider the problem of density estimation and sampling from this density. So that's why I'm writing just as I have just X here. But essentially it doesn't change the formula for posterior. And well, there exists quite established literature on the construction of a core sets including Bayesian core sets. And actually all the algorithms we are aware of, they follow more or less the same structure. So what you do, you introduce some distance between the local likelihood functions. And one possible distance might be expectation of the posterior over the posterior of a difference of local likelihood function. Yeah, and then what you do, you want to find the weights which minimize this distance between the full posterior and your weighted one under the constraints that you want to have not more than K non-zero weights. So that's a certain optimization problem. It of course, depending on likelihoods here, it can have kind of a different complexity. However, well, that's a general approach. One useful notion which was introduced in the literature and which helps a lot is so-called notion of sensitivity. Basically you look on the ratio between the likelihood of one point and all other points. And if this ratio is high, then probably this point is important and you should include it. If this ratio is low, then point is not important and you exclude it. And to have, but of course, you have a free parameter fitter here. So you need to do something with it. You can say, okay, my definition will be via Supremum like the worst case or the best case, depending on the look, what is your contribution to this likelihood? And this notion is called sensitivity. And the majority of works, they can see the different algorithms based on these sensitivities. Of course, well, it's a big question how you compute them. I won't touch it in this talk. Well, you need to do some approximations. And basically the works which currently exist, they basically have two main parts. First one, you either can do iterative methods which gathers the core set step by step. So you choose one point, you choose another point, you choose third point and so on. Or you do sampling. So if you do sampling, it's actually pretty easy. You sample random points with probabilities that are proportional to sensitivities. So and then you get something. In sequential approaches, you try to choose the point which improves your current score the most and you do it one by one. And the main difference between the methods who are iterative is whether you try to solve if you consider this function exactly or you may try to convexify this constraint somehow. Okay. And the question which we asked in this research is well, the iterative approaches, they in fact give pretty good performance and practice usually, but they might be very slow because you choose points one by one. You have a huge data. You probably, your final core set will be much smaller the whole data but still pretty large. So the process can be slow. Can we parallelize it? And in the literature, there were considered like some works which were doing this parallelization for non-Beijian cassettes which considered just approximation of likelihood without any posterior. And we wanted to do something for this Bayesian approach. So and basically, well, very simple idea actually. The idea of this work is pretty simple. That what we want to do, we will be sampling. So we have a data and we will be sampling a points from it based on the likelihood values. So we have some say maximum likelihood or maximum posteriori estimate of the parameters. We plug it in and we sample points proportional without replacement proportionally to the ratio which we obtain for this fit ahead maximum likelihood. And basically we expect then that on different, so and what you actually do, you sample several points for the first computer for, so we do distribute computation then several points for second and so on. And hopefully they will be sampled in a way that sort of form clusters. So first you probably will sample the most probable points for the second, the less probable points will go and so on. Yeah, so that's some, yeah. So once again, we have several workers or processors or computers. We will do corset selection on each of them separately. So first we sample points, then we do corset selection for them separately and then we merge the resulting corset. And each of them has a size of a corset K over the number of workers. So well, what are the results? So we in this work we considered, well, very pretty simple examples. So we start from just the density estimation and Gaussian data, which is very simplistic. First let me explain you which algorithms are compared. So we have three algorithms. So one is the baseline iterative approach called, I don't know why called sequential here, but well, it's a synonym. And then we have two distributed approaches. One is based on a random splitting and the other one is based on our, well, we call it machine learning split. Our sampling proportional to something, to sensitivities. And we present two plots. First of all, on the X axis, we increase the size of a corset and on Y axis we compute the KL divergence. So the distance, the smaller, the better. And on the second plot, we represent time computed in seconds here. So basically what we see is that for the multivariate Gaussian example, so well, very simple problem of course, and we observe that here sequential approach works better. So our distributed approaches and the distributants and merge, they work worse. However, well, as expected, they work much faster. So here I think we had seven or eight cores working in parallel. Then more interesting things start to happen. So if we look on the Gaussian mixture, first univariate Gaussian mixture, then already the distributed methods, they start to beat the sequential approach. Why? Because sequential approach is greedy. And what I expect that, well, probably we need to do a little bit more in depth analysis, but what I think that basically different cores, they successfully approximate pretty accurately different modes of our mixture. So each of them becomes good at one mode while this one is jumping from mode to mode and something weird happens. So, and here already we see certain improvement over the random split by ML approach. And, well, computationally, we see there is a big difference in speed compared to the sequential one. Then we can see that multivariate Gaussian mixture, here it's more of the same story, pretty good gap between sequential and ML and the distributed approaches and a certain benefit of using better sample. And probably the final plot which I wanted to show, that's a bit different, that's already supervised learning. So there was some classification problem. And what we do, we report the result inaccuracy of classification. So what you do, you make a core set and then you use some Bayesian model for classification based on this core set, right? So instead of full likelihood, you use a weighted likelihood. And what we report here, now it's not a Kildale virgins, but our downstream quality. So the accuracy, right? Well, what is nice that at least accuracy is improving with core set size. So we sort of get more information. And also what we see that interestingly the distributed approaches doing kind of better than the sequential one. So again, we actually benefit from distribution, not only in terms of time, but also in terms of quality. And probably the final plot I wanted to show is the dependence on the number of the processors used. So generally what we see is that you can get an improvement when you increase the number of processors. There are some fluctuations here, not sure why, but generally we see the trend for improvement. And also, well, you see the improvement in time. The more processors you have, you improve your computations, or apparently diminishing returns. So it's not linear. Well, everyone would be happy if it's linear, decrease in time, but in fact it's not. So there is sort of an overhead. Well, that's the same. So, and well, to summarize, so we, I think really, as you've seen, we really scratch the surface. However, generally my motivation to start this work was actually to use these Bayesian core sets for actual computations, for example, with neural networks. Because currently, Bayesian neural networks, well, you can find many papers and top conferences, but eventually you see that there are little to no benefit from their usage. Because it's size, so because usually you can't, you can't apply them to really meaningful data sets. So you take almost any Bayesian neural network paper and you end up classifying minutes. Not anything like ImageNet or something like that. Why? Well, you need to do MCMC operational inference on hundreds of thousands of parameters on millions of examples. That just doesn't work. So, and I think that here we need the kind of a combined approach. So we need to use core sets. We need to use parallelization. And we need to some advanced variational inference on MCMC algorithms. So that's a some very small step in this direction. So I think I'm done and thank you for your attention. Thank you very much. Colleagues, please ask, sorry, thank you very much. Dear colleagues, please ask your questions to the speaker. Thank you for interesting talk. In the beginning, you were talking about the distribution of parameters prior distribution. So how you choose this distribution? Yeah, thank you very much. That's actually the question. I think everyone, not from Bayesian communities is asking all the time. And there is no like a final answer on that. So there exists different approaches. One of the approaches that you try to use the distribution which is like disturbing your likelihood as small as possible. So I forgot the name in English, but basically something which has actually very flat. So something which doesn't like put you somewhere. And I think it's kind of a reasonable approach for large scale applications when you actually don't have any idea. However, I also should mention that for modern models like neural networks, generally the question what should be good prior is very, very open. Why? Because, well, putting a prior on each weight from 100,000 being some Gaussian, does it make much sense? It's not very clear. And the final point here is that on some like smallest applications, sometimes people have kind of a domain knowledge. So if you have a, I don't know, linear or logistic regression, then the practitioners might have some idea what kind of weight should be for the particular factor. Then you can have some Gaussian around this value on some, or I don't know, uniform distribution on some segment, and you can use that. Any more questions? Maybe one question on terminology. I like the term core set, but maybe you know about it's origination. Why is this towards? That's a very good question. Actually, I don't remember the paper where it was introduced, so I will need to look. So actually this area is relatively rich and there are more general definition of core sets. For example, here I used only weights, but sometimes people want to have a kind of a synthetic examples. So you start to tune your X so that you can approximate the whole likelihood with a few points. So generally it's a kind of a broad topic. Well, I will do my research and send you where it originated from. Excuse me, colleagues. Can I talk some words about answer for the Dr. Ingbanta? Go ahead, please. It seems to me that this concept core set initially was introduced in computational geometry. In the constant context of Browning-Algudrych approximation algorithms for the so-called heating set program. So it is very, very interesting combinatorial optimization and computation geometry problem. So it's very interesting for me as a specialist in combinatorial optimization to listen to your presentation because it is also one more proof of the deep cooperation between machine learning and combinatorial optimization. Thank you very much. Very interesting. Thank you, Michael. Seems there are no more questions in the audience. Okay. Thank you again. And now we, sorry. I think we have an online talk, right? Yes, yes. We can proceed with our final talk. The list, the last button on the list by Professor Eduardo Gingegimadi and Alexander Stepa, the problem of finding several given diameter spanning trees of maximum total weight in a computer app. So it seems to me, Alexander will be the speaker. Alexander, you're back. Yes, can you hear me? Yes. Okay. I just... So you're welcome. Please, please share your screen, please. Okay. Do you see it? Yes, yes. Okay. Hello. My name is Alexander Stepa and I want to present our work with Eduardo Gingegimadi about the problem of finding several given diameter spanning trees of maximum total weight in a complete graph. First of all, let's formulate this problem. We have arbitrary h-weighted complete undirected graph and positive integers M and D, which satisfies the following inequality. And we want to find M joint spanning trees, T1, Tm of the maximum total weight of edges in these trees and diameter equals D. Let us remind the diameter of the tree the maximum number of edges within the tree connecting a pair of vertices. This work is based on two works by us which was published in 2022 and 2023. In the work from... In the work 2022, we consider one maximum weight spanning tree with given diameter and in the work of 2023, we consider several edges join spanning trees but for minimization problem. And we want to reduce our problem with several maximum spanning trees to the minimum case. And it's done by using statement from this work. We have two graphs, G and G prime. In graph G, there is a weight function W e and in graph G prime W prime e. And there is a tree with total weight, which is calculated with W prime. Then, if it's a tree with such weight constructed in original graph G and function weights are linked by this relation. So we apply algorithm from the work of 2023 where we solve the minimization problem. So we have input output of this algorithm and some steps. Let us describe them by the example. In this work was proved the feasibility of the algorithm and the time complexity of this algorithm. So we have initial complete graph. I want to describe the steps of this algorithm. And we have two spanning trees, which I want to construct and the diameter of each tree is equal to 5. So we choose two subsets V1, V2 with D plus 1 vertices and all other vertices we put in V prime. On the first step, we construct Hamiltonian paths using heuristic go to the nearest unvisited vertex. Of course, it's not exact solution. It's just approximate solution. And then we divide each path in two halves and connect the first half of the first path with the first half of the second path by the shortest stage with inner vertices. Then the second half of the first path by the shortest stage with inner vertices of the second half of the second path. So we do it in a parallel manner. And then we construct, we connect this path using cross manner. So we connect first half of second path with second path. Second half of first path and second half of second path with first half of first path. It's done to avoid the possible dependency of random variable during work of algorithm if we just connect in another way we can consider one edge twice and it will broke the independence property of random variables. And on the third path, third step we add edges from V prime and connect them again with the shortest path to inner vertices of corresponding trees. Of course it's obvious that we connect with inner vertices because inner vertex, because we do not want to increase the diameter of constructed tree. So if you connect with inner vertices, the maximum distance between trees will be on the path. But it was a description of algorithm for minimization problem. We want to solve the maximization problem and this is done by changing the weight function of graph G to weight function V prime obtaining graph G prime and apply algorithm A prime to the graph G prime. So we on step two we solve the minimization problem and the constructed spanning trees T1, Tm are solution for the maximization problem. And again in this work we prove feasibility of this algorithm and time complexity. Since the change of weight function can be done in all from n squared and as was mentioned previously step, second step is performed in all from n squared time. So the total time complexity of algorithm i is all from n squared to... There are some notations that must be introduced by F sub a from i and from i. We denote respectively the approximate obtained by some approximation algorithm and optimal value of objective function of the problem on input i. And we said that algorithm A have performant guarantees epsilon delta if such inequality holds where epsilon is estimation of relative error and delta is the fair of probability which is equal to the proportion of case when the algorithm does not hold the relative error epsilon or does not produce any answer at all. And we say that approximation algorithm is asymptotically optimal on the class of input data if epsilon and delta tends to infinity tends to zero as n tends to infinity. It's quite common definition for probabilistic analysis and we want to prove that algorithm i is asymptotically optimal. So it means that epsilon and delta goes to zero as n goes to infinity. There are some qualities which are used in the probabilistic analysis. If we denote random variable equal to minimum overcome variables from the unit 0, 1 by x, k and w prime from A prime be the total weight of 3s T1, Tm constructed by algorithm A prime. So it's obvious that this w prime A prime will be equal to some weights which are added on the first, second and third steps of algorithm A prime. So on the first step we just use probabilistic goal to the nearest and visit vertex. So on the first step we have depossibilities because set contain d plus 1 vertices then d minus 1 and et cetera to k equal to 1 where we find the minimum over 1 h and this repeated n times. On the second step we consider all the pairs between all the pairs of paths and construct edges connecting them. So we have a multiplicator c from cm2 and then connect for possibilities for edges and vertices. The first summand is for paths and the second one is for end vertices. And on the third step we repeat m times the connection of each from n minus m multiplied by d plus 1 vertices from set v prime and connect them by the shortest h to inner vertices and there are d minus 1 inner vertices in each path. And according to statement which was mentioned on the second slide we have obtained such quality which connect the weight of graph the weight of result of graph A and the weight of result of algorithm A prime. And we prove the first lemma which postulates them epsilon and delta for our algorithm A for maximization problem. And the crucial statement for our analysis is Serium by Petrov which consider independent random variables x1, xn and introduce constants t, h1, hn which are satisfy the following inequality and if you set s equals to sum of sum of xk and hk we can obtain such probabilistic inequality which helps to carry out our probabilistic analysis. And there are some statements some lemmas from work from our work 2023. In the first lemma we prove that condition of Petrov Serium holds and in the second lemma we bound hb from above and in the third lemma we construct the upper bound for mathematical expectation of weights which get by algorithm A prime so we reduce the problem of maximization to the problem of minimization and use some results for the minimization problem. And the main result of our work is if d is equal or greater than logarithm then we get the following failure probability and the relative error. As you can see, failure probability immediately tends to 0 if m goes to infinity but for epsilon it must be considered the case that d is d goes to infinity as m goes to infinity as in the case of d greater or equal than logarithmic error. And it must be noted that similar results are obtained for union a sub n b sub n can always reduce the problem with arbitrary a and n b n which holds these inequalities by reducing this problem to normalize the random variables which are distributed on the interval from 0 to 1. And in contrast to the case of minimization problem there is no need to impose additional conditions on the scatter or on the weights like was in the work of 2023. And as conclusion we can say that we have generalized results from work of 2022 for one maximum spun in 3 with given diameter to several disjoint spun in 3s. And use algorithm from this work which has time complexity all from n squared and applied it to our case with modified weight function. And for continuous uniform distribution of h weights on interval zero one, we can get analogous result of follows from continuous uniform distribution of h weights on interval a sub n b sub n as I said previously. And it will be interesting to investigate this problem on discrete distributions. Thank you for your attention. Thank you very much. So who would like to ask some questions? To Alexander. It seems there are no questions. In this case, I would like to ask some small question. Yes, yes. So Alexander, your brilliant work seems to me is another nice example of cooperation between machine learning and computer optimization and this field developed by Professor Gimodian Yu seems to me very, very, very interesting in the case of our millennium problem about P and NP. So that's very interesting. And the question is for us. And in your proof, you use Petrov's theorem, which appears to be a classic major concentration result and classical result. Since that result, there are many more results, more strong major concentration results. Would you like to incorporate them in your framework? It will be very interesting. I'm not professional in this domain, but I think we just should watch other approaches. So thank you very much. Thank you. Thank you. Two other questions, please. Oh, it seems to me that all are clear. So we should send our presenter once again. Thank you very much. Thank you. And dear colleagues, I should, unfortunately, I should close our session. Thank you for you all for a very interesting presentation and for the attention. Thank you very much. Thank you, Mikhail. Thank you. Dear colleagues, I've just reminded that the parallel session is still going on. So we can go there. And also in one hour, there will be a poster session. So you'll have some interesting posters. So please attend. Say thank you. Thank you very much. Thank you.