 Okay, good morning, dear colleagues. I suggest that we slowly start. We are very happy to see that many people, relatively early in the morning after, I guess many people had a great evening yesterday. So it is our great pleasure to have our third keynote speaker, Professor Samuel Horvat. Samuel currently working in Mohammed bin Zayed University of Artificial Intelligence in Abu Dhabi. And before that, he did his PhD in Saudi Arabia in Kaust. And Sam is a very well-known expert on distributed and federated learning, and he will be talking exactly on some aspects of this topics today. So Sam, go ahead. Thank you, Maxine, for a very generous introduction, and thanks to all organizers for inviting me. So what I want to talk about today is federated learning, and let me just briefly discuss outline of my talk. So first, what I'll try to do, I'll try to give you some kind of like gentle introduction to what we mean by federated learning and how it developed. Now I'm going to discuss several practical and research challenges that needs to be addressed in order to deploy federated learning in real-world scenarios. And then I'm going to talk a bit about our recent work that we did on decomposable models that are targeted to face some of the challenges of the federated learning. All right, so let me start with the motivation of why we do federated learning. So if we look at kind of like a traditional machine learning, so originally what we did, you would have a single machine, you would just kind of put all your data in and try to train your machine learning model. But as we kind of progressed, what we see that essentially this model gets bigger and they require more and more data, so therefore we moved from local solutions to cloud-based solutions. Now the thing with that is like with this ever-increasing data collection, what we have to consider is that majority of those data comes from clients that might have and that data collection might have some negative privacy implication of these data collections. On top of that, many countries do have different privacy initiatives such as GDPR in the European Union or there is CCPA in the US that essentially kind of forbids direct data collections. So with this in mind, essentially if we want to further progress much in learning by collecting ever more data, that might not be kind of a kind of standard solution, might not be the feasible solution because not respecting those privacy initiatives, we can almost completely lose the access to the data. Now this is where Ferrethe Learning kind of comes to save us by bringing training actually to the edge, so to those clients that own data and the main premise of Ferrethe Learning is following. So we do assume that we have some orchestrator. Okay, so I'll just point here if that works. Yeah, so we have some orchestrator who orchestrates the whole training process and what it asks those clients to do, it asks them to compute some kind of updates to the model when we do training and those updates are supposed to be very focused and they are intended for immediate aggregation. So essentially this orchestrator or you can think about as a central server only sees the aggregated information in order to prevent any data leakage. So therefore, kind of Ferrethe Learning is kind of its basic definition, what it gives us, it gives us at least a hope to train on this large decentralized data sets. On top of that, there is an increasing demand to have something what's called data locality paradigm, meaning that the data should be processed where they were actually collected. And on top of that, there are several recent studies that show that when you care about the carbon footprint of your model, then if you design your Ferrethe Learning algorithm well, then you actually have a lower carbon footprint when compared to standard centralized learning. So some of the kind of a success stories of Ferrethe Learning, so where it's already commercially applied, this is in a big industry players. For instance, Apple have it in their HaySeries or QuickType. Google uses kind of the same thing for Hay Google and Gboard. Then where we see this kind of a Ferrethe Learning as a next game changer. So this is pretty much all applications that historically wouldn't have much of a data. And the reason for that would be like very strong privacy constraints. This might be, let's say like smart health applications. There are already several startups working on that, such as Dokei or Oaken. And also a lot in fintech applications where essentially privacy is a key. And there are several banks already working on that, how to do kind of decentralized flow detection, for instance, VBank. And I mean, the thing is like the applications of Ferrethe Learning, you can think about that as essentially any machine learning task that you would like to do, but you have to collect a lot of private data. The good thing about machine learning and why it's kind of getting more and more widespread is several open source framework for Ferrethe Learning. There is, among the most popular, you have a flower. FedML, also Meta has its own Ferrethe Learning simulator, same for Microsoft. And also quite popular is the NVIDIA Flare, which is the SDK for Ferrethe Learning. I also do a bit of an advertisement. So when you're looking for the resource to get to know more about the recent research about Ferrethe Learning, so we've been running for, I think roughly three years, so ever since COVID, you're running the online seminar where we have more than 100 talks about different aspects of Ferrethe Learning that are all on YouTube. So if you are interested, just Google Flow Seminar, register and you will have access to all the talks that we have. And also some of the great resources are some recent review articles or there is even a recent book on Ferrethe Learning as well. All right, so when we talk about Ferrethe Learning, there are two main settings that we usually consider. So first is something called a Cross-Side of Ferrethe Learning. And what we have in Cross-Side of Ferrethe Learning, we have different organizations. You can think about them as a kind of a big organizations that want to collaborate. For instance, the example that I have here are hospital that have a patient records. They would like to train some smart health assistance based on all of these records, but we have to realize that the data are private, so what you kind of care about here the most is the privacy of the data. So and here what we usually assume that the kind of a number of the institutions that try to collaborate is relatively small. And the main thing that we would care about in this setup is privacy. So the setting that I will consider more in this talk is Cross-Side of Ferrethe Learning. So in Cross-Side of Ferrethe Learning, what we have, so all those clients that we have are devices, you can think about them as a mobile devices or different IoT devices and they want to collaborate in order to solve some machine learning problem. You can think about that as a recommendation problem where the whole process is orchestrated by a central server. You can, this one you can think about as a provider and what's their goal? What's their goal is to train this model in this ferrethe learning way where we only communicate the very focus update that are intended for immediate aggregation and also one other thing that we have to respect here that this is all trained within those devices and also the model that we eventually gonna obtain is going to be deployed for those devices. And one of the kind of main challenge here is essentially the different heterogeneity that comes with the client and also the fact that the number of clients you can think about that as a number of phones that might be in millions or hundreds of millions. All right, and that brings me to this kind of second part of the talk where I would like to discuss challenges and challenges that essentially makes or challenges that if we can address those we would be able to make ferrethe learning practical. So the first one and one of the most prevalent one is the communication bottleneck. So for those of you, so the example that I list here is for distributed training. So for those of you who are familiar with this, we know that essentially just by adding more and more GPUs that doesn't necessarily leads to this perfect linear scaling and the reason for that is at some point what you're gonna hit is the fact that communication is actually much slower than computation. So for instance, I have in this example right here what you have is essentially if you run, let's say deep light, which is a very communication heavy model on AP 100s with still relatively fast network, then what you're gonna end up with is that essentially more than 90% of your training time gonna be spent on communication. And in many applications, this would be the major limiting factor. Now if we move to ferrethe learning, now here this issue is even more prevalent because now we do not have those machines, those clients being in a single data centers. The issue is that they are all connected through some wireless links or some other end user internet connections. And this would be even slower than the data center links. Also on top of that, you operate in a system where you have actually a lot and a lot of clients. So even the kind of the capacity to aggregate those updates becomes the bottleneck. And especially what we see in the real world application is that like the model download is still kind of a slow. So when the clients don't know the model from the master or from the orchestrator, but then the uploading is the key limitation of the system. So thankfully there are several remedies that one can employ. So you can think about the communication compression which means that essentially those updates from the client before they are sent back for the aggregation, they're actually compressed and there is a nice line of research linking this that if you design your compression very well, you can essentially add more privacy into the systems then another possibilities are that essentially you do not communicate very often. What you can do, you can try to define the local problem that is some kind of harder than just giving the local update. By local update, you can think about just computing the gradient with respect to your local data, but you're gonna do kind of more local work such that you reduce the amount of communication or what you can try to do in some smart way, you can try to limit the number of devices that will communicate and what that smart way might be. You can think about like how to kind of figure out which clients might have updates that are more important than others but still respecting the privacy of the clients. Now another issue and that's something that we're gonna look at later more in detail is the system heterogeneity. So the thing with system heterogeneity is like when you train in this ferrative network, a lot of the clients might be either unreliable or very heterogeneous. And this vast heterogeneity is also one of the key challenges when you wanna deploy and train this models in kind of in this world. So remedies that you can hear incorporate is try to kind of devise the algorithms that would be stragglers resilient, meaning that like if you have some clients that kind of fall behind the training schedule, you can still allow them to update the global model later. Also very popular here is the asynchronous updates but here the challenge with the asynchronous updates is that usually kind of once you have asynchronous updates it's much harder to guarantee privacy or another thing is like when you have the devices that are not able to compute and even like a store or work with the model then you can just simply drop them. Another challenge that also quite prevalent with ferrated learning and maybe not very standard in standard centralized learning is client availability. So when we talk about this cross device ferrated learning what is the standard consideration is that in order not to reduce the user experience you only participate in training when you are connected to fast network. Also you are connected to the charger and also with you there is a large number of devices connected because we want to kind of hide your updates such that your privacy doesn't suffer. And with this essentially it creates quite non-trivial constraints and with that you have actually a lot of variations that might be undesired, meaning that they actually can kind of break your optimization algorithm and this is something that needs to be addressed. Another issue with generally like with deploying ferrated learning systems onto devices is the fact that like systems are really not very mature. So for instance kind of what's already happening for quite a long time is actually on device inference. So that's already deployed in many devices but the full kind of ferrated learning loop is still kind of a work in progress but it's we are kind of moving towards there and nowadays there is actually couple of providers already allow to run like a forward and backward on the devices. Another issue is the issue of limited labels. So while we do have like when you think about your mobile phone like you do have a lot of let's say texts, a lot of photos that maybe we can train on but the thing with that is there's very few labels and that's for instance one reason if we would go back to kind of this first applications which was a next word predictions for instance for G-Board, the thing is there we don't need labels and this is why kind of the first applications of ferrated learning was for next word prediction. Now so line of work that people looking at is how to incorporate semi-supervised learning while exploiting the data structure within the ferrated learning how to actually incentivize clients to label data and then okay and then we can talk about personalization. So now the first thing about ferrated learning is that we might be actually training on very different kind of within the clients we might have very different subset of data and then the question is whether the kind of a global model which is kind of standard way of doing ferrated learning is the right thing to do. So let's say when we look at this application right here which would be next word prediction that's very much depend on the context and of the uses of here that would be not a case. So what the social kind of a possible direction for the research and things that people are looking at is how to incorporate kind of a meta-learning approaches into ferrated learning. There's actually a lot of work on trying to kind of link ferrated learning and meta-learning together and also how to kind of discover kind of interesting structures within our kind of a federated network by looking at the clustering or some kind of a balance between LOCA and global model whereby LOCA you mean you kind of think you can think about that as a model that trained purely on the kind of a distribution of the data that you don't see locally and globally that would be the global distribution and then other kind of popular subset of the ferrated learning is a split learning. Another kind of a very important issue that maybe was not discussed at kind of the rise of the ferrated learning or when it was originally introduced is privacy guarantees. So the thing with this is while we have by construction that the data they actually never leave the device we do not have any formal guarantees that from the updates that we sent even though they are aggregated and there is like a lot of techniques to make them kind of private we do not have or we did not have originally formal guarantees that ferrated learning does not introduce any privacy loss. And also when you think about that essentially you cannot have zero privacy loss if you have the learning. So and then also there was a couple of works that essentially showed that even when you do ferrated learning in a very in a way that you try to have very focused updates you average or aggregate over many clients still if you're smart then privacy leaks are possible. And this is where remedies such as general cryptography comes so that that's more like a kind of having a system more safe but when we talk about true privacy of the clients that's where differential privacy pops up. And for those of you who are not familiar with differential privacy what differential privacy generally says that kind of the informal definition is that if I train the model and if you use my data or do not use my data the actually output of that model doesn't change much and by how much change that's essentially like but by the upper bound how much it can change that's the guarantee that you have in respect of differential privacy. Now another issue that comes with the ferrated learning is that essentially we can think about ferrated learning as a collaboration among many and potentially mutually entrusted clients. So it's very easy target for poisoning and you can think about that essentially because everything is open you might have a competition people might be training the similar model so there is incentive to kind of destroy your competitive model. So when designing the ferrated learning algorithm that you're gonna deploy in a while you have to be aware of that and but thankfully so in order to kind of prevent that there are several defense mechanism that can at least remedy this poisoning attacks. Now the last thing that I wanna discuss is about the challenges as this intensive incentives to participate. So when we think about like why would I even kind of give you my data to participate in this like ferrated learning training that's something that has to be somehow clearly defined also with respect to label for instance like your model would improve if I labeled the data but why would I do that? Also with a local training there is a cost incurred with like energy privacy and so on and also another like last issue is also whether I do even benefit from the federation. Now also that's kind of from the point of a client but from the point of this kind of orchestrator we also would like to know like what's your contribution? So do you even contribute to federation like by now having your data do I have any benefit? And our issues comes with like the ownership of the model so who actually like owns the model is it a community or is it some kind of this global service provider? And this actually opens up a very interesting question especially if you work on economics. So in general we already have kind of a when we talk about these three edges which is computer science, economic and statistics we have already like a lot of discipline when it comes to the intersection of two such as the economics plus computer science would be algorithmic game theory then statistics with computer science is essentially the basics of machine learning statistics plus econometrics statistics plus economics econometrics but the thing is when we look in the middle so this is where we kind of see the ferrated or collaborative learning when addressing all of these issues. So with this I hope that I at least a bit appreciate you that ferrated learning presents kind of a new realm of unique complex challenges when we want to overcome these we must devise some kind of a system aware efficient optimization techniques and the key area to focus is optimization theory networking and scheduling techniques and with this in a hope that these techniques will help streamline this data processing and improve our performance and therefore enhance the overall effectiveness of ferried learning models. All right so that was kind of that was the first part and now the second part that I want to focus on is these decomposable models. Now with this I would like to start with the motivation. So our motivation comes from like a very popular linear algebra techniques that are widely used in machine learning. So there's principle component analysis and single valid decompositions. So why these are very popular because they can give you dimensionless reduction, noise reduction, you can extract features and also you can significantly improve your computational efficiency. And now the question that we start with is what we can make kind of a PCA or SVD version of neural networks. So in order to answer that question let me first kind of try to look at the SVD, how we could potentially represent that as a neural network. So if you look at this decomposition right here then how we can have it as a neural network is we can consider the neural network that has two one hidden layer without any activation, without any bias that would essentially represent this mapping. Now the nice thing that you can know about that is like if we do this kind of a pruning within this hidden layer then what you can get you can get the SVD but the reduced SVD where we only keep the first let's say here two singular vectors and so on and so forth. Now so this at least kind of gives us the good example that we can represent this matrix factorization within the neural networks. Now maybe the most interesting part is whether we can actually learn this. And the thing is that we can so if we just define kind of general matrix decomposition we can define it as a following optimization problem. Now if you wanna retransform it into SVD then what you can, I mean SVD would essentially solve this problem individually for each kind of each rank K. So here this one means the first K columns of matrix U. Now the nice thing with that you can even put it all together. So you can have it as a summation across all the ranks. I mean summation, the nice thing about summation, you can represent with like expectation. And once you see this and you're already kind of familiar with what we do when we have like many terms in machine learning where you can do you can just simply apply SGD and that leads to our construction of what we call the order dropout. So essentially what we show that essentially if you define your problem in this way you apply SGD but SGD not in the data points you essentially like apply SGD to the models then you can actually learn SVD with a standard kind of machine learning training loop. And also the nice thing about that if you're kind of interested in optimization you can still see this problem as over parameterized problem although even it's kind of a nut but because the gradient at the optimum even for each sub problems are zero. Now how do you like so how does this look in practice? So in practice what we do is like if we were about to learn the let's say like SVD so we would apply this order dropout techniques in the hidden layer and what in each step we do we have some distribution over the width in each step we sample the width and that's the number of neurons that we keep. So essentially that's what we call a dropout and not the dropout because it's preserved the natural order of the neurons. Now the nice thing about that is here you can recover I mean with this construction as I saw you can recover SVD as a special case. So for SVD if you have your mapping between the data and the output to be linear and then you sample your data from the uniform ball that essentially you can say that the training loop like the standard training loop with this network without the dropout recover SVD as a special case and that's what I have here. So essentially you see that across any of these like sub matrices so here that would be essentially this rank K matrix that it converges to the best K rank approximation of the matrix A within the single loop and it's also within the sampling. So we do not need to evaluate with respect to every single matrix but we can just sample one at a time. Now also you can recover PCA with this in this very kind of a similar scenario where the only difference is that now I'm gonna sample uniformly random from my data set but the thing is the mapping between the features and the labels essentially that's the same thing here and what you can see so here the example that we have here is we only have the data to come from like a three dimensional space and you can see that the quickly the network discovers that and essentially provides you this provides you with this three principle components and zero out the rest. Okay and then in general case that's I'm not gonna discuss much about that but essentially if we talk about this general case for this linear network that is something in between what it does is essentially does PCA on the transform data set by the mapping A. Now how do we generalize that to neural networks that essentially comes or boils down to this two works that I'm gonna present. So first this title Fair and Accurated Learning and the heterogeneous targets where we're gonna use the order dropout and this is essentially the work where we introduce this notion of order dropout. So first let me acknowledge my collaborators here Steve, Mario, Stelios, Elias and Nick. So just to remind you the problem that we're trying to solve here is we're trying to look at heterogeneous devices. So we wanna address the issue of having different tiers of devices within our training but we do wanna avoid this kind of a standard construction where when you look at freight loop in general that divide the accepted norm is that the model that you have so all the local models everything that you're gonna deploy has to have exactly the same architecture with the global model. So in order to achieve that you can drop low tide devices or essentially limit the global model size in order to accommodate all your clients. Now the drawbacks or of those approaches that I just discussed is first when you drop low tide devices you might have very limited participation and also now Fair and Accurated Learning is kind of training on the very also not just like heterogeneous devices but very heterogeneous data. So by dropping low tide devices you actually might lose a very important part of your data set and therefore have a lot of bias due to unseen data. Also by and then by limiting the global model size you might actually degrade performance for high tier devices. Now what will be our goals and what we wanna achieve with this work is we wanna have a fairness in participation meaning that every single client can participate and also we wanna have very competitive performance meaning that all devices should have as good performance as possible considering their local network constraints. Now how do we achieve that? So I mean very kind of naive way is first we're just gonna simply drop this assumption that every single client needs to run or store the same global model and what we're gonna do is for these lower tide devices we're gonna deploy the thinner models I'll show you how we do it later where the width that we're gonna deploy will depend dynamically on the local network and network constraints. That's my, the memory, computational capabilities load, battery level or let's say limited bandwidth. So and how do we achieve that? It's exactly through the order dropout that I just discussed. So here, so if we have a general network as the one on the left then what we can do is we can apply our order dropout technique in the following way. So what we're gonna do, we define set of relative submodal width so these are numbers between zero and one and we define distribution over those width and what we're gonna do in each step, in each step we're gonna sample from the distribution and based on the width we're gonna kind of apply that width limitation into every layer but the input and output layer to keep the structure of the network unchanged. Now just to give you kind of a brief motivation or the how this relates to kind of a standard random dropout. So for order dropout the motivation is not necessarily only regularization but the motivation is that once we apply this technique we kind of enforce as much knowledge as possible towards the left of the network while the standard random dropout the motivation is to prevent co-dependence of the neurons and also another kind of a difference between the random dropout and the order dropout is that our inference is exact while random dropout has an exact inference where you kind of where you just approximate the ensemble with a simple average of weights. Those are kind of the main differences. Now just to show you like one example so consider this is our network right here so we have the sampling let's say that sampling sample 0.4 that means we keep 40% of each hidden layer and which here yields two neurons per each hidden layer that we consider. And then once we're gonna deploy our train then we can accommodate the width of the network based on the tier of the device that we are about to deploy to. Now how do we apply this into kind of a very good learning training loop? So the construction is following so we have a set of our devices. We're gonna first take the architecture. We take a set of the width that we wanna train on our model and also the distribution of the width that we wanna sample from. Here example is for instance like let's look at Cypher 10 on ResNet so we might have a file width that we wanna train on and you can see that essentially leads to different number of max and parameters per network across different width. And what we're gonna do for the devices we essentially split them into several tiers and then what we're gonna do each tier is assigned with the Pmax value which is essentially the maximum width that those devices might work with while respecting the constraints. Now what we're gonna do then for a kind of a single communication router or single training step of our freight training. As standard we're just gonna select the devices to participate based on the constraints that I discussed. Then for each device we send the model that corresponds to its Pmax to the maximum width that it can work with. Now it perform local steps but with the order dropout in order to kind of train across all the width that it's capable to train on just because to get the proper solution to get this kind of a decomposable network as I described before. Once we have this they obtain the update of the model that's just communicating back and then the server aggregates which is kind of in this non-uniform manner in a sense we just aggregate over the client for a given set of width we aggregate over the client that essentially updated it. And then for the inference you just deploy based on the device capabilities. Now one of the nice things because you already have the model that train like that that's actually decomposable is that even after deployment imagine that now we take this device from the higher tier even if this phase we can then scale the model dynamically during the inference for instance like if that model has current increase or like has decreased battery level or increased load then in order to kind of perform within the constraint we can just simply decrease the size of the model because we train it in this decomposable manner. Okay, so let me show you some of the experiments. So here, okay, so that's strange. So here the one on the right I'm gonna focus on the cipher only. So here what we did essentially we're gonna just use the same training setup as for this model right here and what we're gonna do we're just gonna train it with the order dropout and what are this red squares right here that's a model that's optimized and trained from the scratch within. So here the red models essentially where you have five different trainings while this one the orange one is ours just training by a single loop based on the hyper parameters of the bigger model and what you can see essentially performance is more or less on par and we believe like with the tuning if kind of like a regularization hits in that if you would tune a bit of parameter then you can actually have even better performance than just train models trains from the scratch. Now another thing is what might be maybe a bit or what somebody might say that essentially like, okay, do we really learn the meaningful decomposition of the model and one way to check it we take our model train with order dropout and we just compare it to the model that was trained only for a full width and we're gonna obtain with like random dropout we're gonna obtain the models that have the same number of parameters as our model with order dropout and look at its performance just kind of to double check that it's not just some kind of implicit property of a general training that you would discover that essentially like any model of that with your sample would have the same performance and here you essentially see very huge drop even if you take like 40% of the network width then essentially that corresponds pretty much to random guess or we can still do more than 90%. The apologies that it seems that from the neck to windows doesn't like some figures. So here the next thing that I also wanna link here that's a nice property of the order dropout is that you can actually increase the granularity of the width that you train on. So when we go from the uniform width of size five to uniform width of size 10 that essentially you can see that the performance across the intersecting width variable intersects and then once we're actually gonna deploy this into rate learning where we see the best increase in the performance where the reason is that essentially the only available baseline at the time of this work was to kind of consider that in order to accommodate the clients that cannot run the given model we just when we train the model on that given client we just gonna apply random dropout and that's the way how we're gonna match the constraint of the client. But with that you can actually see if you're trying to train the larger and larger models you actually might see the decrease in the performance while for us you can see the steady increase of a performance meaning that the more compute power or more like parameters you can store the actually the better model you can get. And this is also just to double check that the scalability also transfer to the federated learning. Now what I'm gonna discuss quickly is this next work that also is based on on the order dropout where this kind of motivation or kind of usage of the order dropout is slightly different. So again to acknowledge my collaborators is Steve, Shashank and Hongi. So what's the main challenge that we're trying to overcome here is actually how to train large deep models or like how to make kind of training of large small deep models kind of equivalent to train large deep models. And if you look at the standard problems with training large deep models particularly those one with billions or millions of parameters there is energy consumption, resource demands and data requirements. Now, so the thing is like why do we actually train large models? The reason is that they perform so well. The reason for that is this is still kind of active area of research might be due to implicit regularization meaning that the model biases towards the simple solutions and also smoothness in the last case. So it's actually easy to optimize larger model than the smaller ones. Now what would be our goals is whether we can kind of keep the benefit of training larger models where we actually gonna train smaller models. And then the aim is to design models that can maintain high performance while significantly reducing the size of computational requirements. And how are we gonna do it? We're gonna do it through order dropout and we're gonna try to discover some kind of low dimensional structures via this efficient decomposition. And by low dimensional structures here I mean a low rank of the weights that's based on the prior work that essentially observed that that's kind of one good notion of low dimensionality. All right, and how are we gonna do it? Essentially, we already see that with order dropout is kind of connected to the decompositions. So what we're gonna do is like if we have original mapping M times N which we transform it into factorized mapping we'll try to decompose it, try to essentially remove the zeros and that's how we get the low rank approximation. And how we make this kind of low rank approximation nice and trainable. This is the way how we design it with the order dropout. So let's have this, our original network. What we're gonna do, like within the each layer we're gonna input the layer that has an inner dimension of a minimum of these two layers. Then what we're gonna do we're just gonna input this as a like factorized layer where we're gonna deploy order dropout in this factorized layer. The order dropout here doesn't have any bias or any activation. And then what we're gonna do for training for training now our sampling is gonna be a bit different. What we're gonna do is we're gonna sample one layer at a time. So here that might be the first one and we're gonna sample its rank. So this is essentially in each step the network that we're gonna evaluate. And the reason for that is to essentially learn an efficient factorization of this layer that we can later prune. And that factorization is essentially the one that should account the data and should account the structure of network as well. Now that essentially these are the main component of the method that we propose here called Maestro. So just to walk you through the algorithm. So what we do here, so essentially this first part right here is just an order dropout. So this is just a sampling and obtaining the prune network. Then something that I didn't discuss. So in order to have essentially low ranks we somehow have to enforce it. And one good way to enforce low rank is lasso penalty. And because we have a natural ordering we can do grub lasso. And then the last thing to actually have a good adaptive pruning. So to actually prune as we train we have this condition right here. So essentially what it tells like once the network discovers that I don't need certain ranks I'm just gonna simply drop it. Now let me quickly walk you through the experiments. So essentially what this table tells you that essentially with any other method that does low rank approximation with just under SVD but not the decomposition that's essentially tailored to the data and the network. We do much better and we can do much sparse networks as well. Then also what I wanna show here that essentially it's here the one in the middle what we display here is the ranks per layer for a different grub lasso penalties. And what you can actually see here that it might be not that the network learns only the composition within the layer but also some global decomposition because the thing here is that by increasing the grub lasso penalty you always get a subset of the original network. Another thing just kind of a sanity check when we compare like when we kind of design a pruning technique because again you get a decomposition. So even with having a network you can do post training pruning because you have decomposed network already we can compare it to SVD which is kind of a naive linear approximation when we go for the decomposition. So we can see that we do much better there. And also another interesting thing is that we actually don't need to kind of look for a good pruning technique. What we can actually do we can just look oh what's the other networks found and I'll try to replicate that and it actually works relatively well. Okay, so I'll just have two more slides just to summarize that we discussed two main one of the two key challenges in freight learning. One is the efficient training and inference and heterogeneous devices. We looked at all the dropout as a technique that would enable the composition of networks and we introduced these two techniques to exploit order dropout that is Fjord and Maestro and some other interesting application of order dropout that we're looking at right now is the automatic rise selection for Lola. Some model alignment, once you're architectures and also network consolidation. So with this, let me conclude and thank you for your attention. Thanks a lot, Sam, for this amazing talk. Colleagues, do we have questions? Thank you for a great presentation. My question about cyber security and privacy of federated learning. What's the future of federated learning? Is it, for example, integration with HAI and quantum cryptography and something else? What else? We should be going to be more efficient and more private and more secure. So, yeah, thank you for your question. That's a very good question. I mean, the main thing is, I mean, in terms of when you ask about privacy, I mean, we can make it as private as we want, right? The thing with that is, what you want to actually achieve is best accuracy or performance privacy trade-off. So that's one of the main challenge because always you just don't need to communicate. You don't need to care about any privacy whatsoever. Like you just kind of, I keep my data private, never communicate that. So that has a perfect privacy, but essentially zero utility. So there is really like the main challenge that people are looking at is right now, like how to, first of all, it actually turns out that it's even in this kind of federated network, it's highly non-trivial to even train without any privacy constraint. Just as I introduced the federated learning, that's still kind of a non-trivial task. Then when you add like formal privacy guarantees on top of that, that introduced the extra layer of challenge and there is a lot of work going on on to like how to find the optimal trade-off between the utility and privacy. So that's kind of the main thing, but you can deploy it essentially anywhere where you collect the private data. Or where you not collect private data, but where you want to actually train on the private data. Hi, thank you again for a great talk. I have a question. Could we expect at some time, I don't know, unified framework for doing that, like Hug and Face, it will say like combined all this and from the user perspective, you just checked the model. Could we expect something similar for federated learning because you've mentioned a lot of frameworks and they're kind of independent to each other. So what's the reason? It's like still under progress or? Yes, I like pretty much. I mean, that's the main two frameworks. I mean, they kind of like, where I think I mean, there are also both of them like startups and one of the kind of main goal for them is to provide kind of like unified one-stop shop for federated learning. That's good. Thank you so much for very interesting talk. So my question is about, as I understand, there are several approaches which helps you to make training process more efficient and you show some graphics on ResNet as far as I remember, yes. And on ResNet, we have approximately no drop in accuracy. Approximately no drop in score. My question is the result of this good score is a result of some of these approaches or you apply one, for example, dropout or matrix approximation of the layer. Yeah, so I mean, essentially, I mean, that's the ones where you do not essentially lose any accuracy. I mean, what we kind of show here, I think that's the one, this is the one that you point out to. Yeah, so essentially what we show here that, I mean, first of all, if you kind of do things well, you can kind of improve tiny a bit. Yeah. So essentially, I mean, that's when kind of a regularization kicks in, like the standard thing that you would expect to get with the decomposition that you remove a noise. And then we show like you can actually, like if you really care about, are you trying to push the number of parameters smaller and smaller, you can still do that by just increasing this like a sparsity penalty. Okay, so am I understand correctly that this table shows the approach of regularization? Yes. So it's, I mean, regularization plus burning. Thank you. Yeah, colleagues. I think that we'll be running out of time because next sessions should already start. So I suggest that if you have any more questions to Sam that you just approach him like during coffee break and something like that. So let us thank Sam again. Thank you. Yes. And let us proceed. Let me remind that we again have two parallel sessions. So the NLP session is going to be here. And in the second venue, we have I guess computer vision, right? Okay, so thank you very much. I'm going to present a joint work. Mostly work has been done by Osgeya Sevgili. So she is a student at the University of Hamburg. And also this work is with some collaborators also from Hamburg and Indian Institute of Technology. So the work is about the task of ultra fine entity typing. So what is entity typing and ultra fine entity typing is consider this example. You have this Olympic National Park. And this is a mention. Many of you know about entity linking task. And this is when you take some mention and then you need to link this mention to a knowledge graph to a knowledge base. Let's say Wikidata or Wikipedia. So you assume then in Wikipedia, there is a specific page about Olympic National Park. And then it's quite useful to link these because you can harvest, get all the information from Wikidata about this National Park. Maybe attributes mentioned description and so on. Now the reality is even the biggest knowledge graphs due to inherent let's say power law of distribution or scarcity of the data, they cannot cover everything. So I remember this case then there is this bubble net database like a knowledge graph. And if you look for word Python, there are about 50 cents of the word Python and there are like two roller coaster parks one in Germany and another in the United States and Florida somewhere. And that actually gives you an idea that even if you, no matter how hard you try, there will be always this gap. There will be always the long tail of some entities which just nobody even either cared to enter in a knowledge graph or actually never managed to insert in there. So that's why entity type income to rescue then you just label with certain hyper name and like is a relation and then you can still get a certain idea of what it mentioned is even if it's a note inside the knowledge graph. Now inside this area, there are different granularities because you can let's say this is a park or this is a location or geogift area or maybe you can say this is an Olympic Park, right? So there are really also a hard question what kind of granularity you need to take and they talk we speak about case when you deal with pretty high granularity. So you speak about like tens of thousands of census or labels, but then the task is relatively easy when you take let's say a hundred or 50 different types then you can collect easily a lot of data but this information might be not so specific not so useful for different applications and the problem with ultra fine typing when you deal with a large vocabulary is then it starts to get again hitting this problem or scarcity of the data so you don't have this amount of data and people try to come up with different ways with this problem and one of the approach maybe not the most successful even but still interesting I hope I will present today that's unsupervised approach so that we don't use certain manual annotated data rather we use distributional semantics and let's say bottom-up approach to do it how people also approach this problem well people try to use the distance supervision so okay for this entity typing thing what people try to do they try to think hey where can I just get the data for free where the data occur you might get entity linking data sets and for entity link you can look in the knowledge base and see what is the hyponym of this entity and then use it maybe generate several hyponames this is a common approach to let's say automatically generate such kind of data sets people also just try to use something like Hearst partner so rule-based approach to extract these ESA relations from text let's say I say a sentence like such cars as Mercedes BMW in Audi they are expensive and luxurious well that would be like a second approach but there are also some approaches which use zero short unsupervised techniques again to avoid this bottleneck but what we are trying to do so we are trying to leverage unsupervised induced word senses using this job and tax framework which is based on distributional semantics and it contains not only distributional representation but also distributional representation of word senses labels with hyponyms and what we actually do in this work we try to see how much useful this hyponym labels to the task of entity typing so we try to disambiguate the context with respect to this induced senses so how it actually works so at the core of this approach a repository of this distributionally induced senses so these are coming from this job and tax framework and you can look at this paper and this paper just for the background work so this is not what has been proposed in this work but rather like infrastructure which we are using and what it proposes for every word you have let's say in this case it's worth rents so rents might be a city in France so in this case you see this first cluster let's say Lille, Montpellier and other versions of rents but you also have this football club well basically this football club has a different hyponym and here you see these labels for ease of labels like club will be the common is a label for one sense and the city is a hyponym is a label for these senses and well basically the magic of this is that none of this is done using human labor instead how people actually obtain this first you obtain something like a word-to-vec distributionally related words and then you perform clustering and then you group these words obtain these clusters of words but again you don't have hyponyms but how the hyponyms are obtained every word is assigned a list of automatically induced hyponyms let's say using these patterns and then these counts correspond to common hyponyms for this cluster let's say this is noisy procedure so of course this term will contain a lot of hyponyms but the common hyponyms will actually pop up at the top so that's actually the trick how you get relatively clean hyponyms here at the top so city or club there will be common hyponyms for a lot of these distributionally related decisions because they are all cities right there might be even some noisy words but this is what we are kind of using now how method works well you have this input and always you have of course certain mention you need to assign the hyponym to in context so you need to disambiguate this and this becomes your sensory repository when the rest is very simple actually so you vectorize the context using SBRT representation and you vectorize the mention because it also contains the word RANS and then you vectorize these guys as well using of course the same vectorizer so they can be certain similarity competitions can be done and basically in the rest is just picking the most most relevant cluster and picking hyponyms from it to label it and then there are a few additional steps apply so if you go for real data sets for entity typing you see not just a new word but something like this so mentions might be really long might be really elaborate and even if you obtain sense inventory in this bottom-up way from the text corpus there might be no sense representations or mentions for this kind of multi-word expression so headwords a lot of additional preprocessing are done so headwords are collected so different words keywords from this mentions might be collected and then candidates for senses they are obtained from these not necessarily from the exact mention of course also certain other steps like singularization of the hyponyms and the mentions are done mention might be done most smoothly but essentially as soon as all these linguistic normalizations are done you perform this vectorization of the context and you compare the context vector with candidates prototypes for different senses and you pick the most appropriate hyponyms label so experiment-wise what has been done we took the setup of choi et al which was this paper and compared to it and to some other baselines the first baseline was just to pick the first cluster for given word or pick random cluster so this means picking something so this is a very important baseline in every words and disambiguation task it's called most frequent sense baseline why it's important because distribution again so most words are used in the dominant sense and of course the largest cluster or first cluster in this case will have dominant sense and this is kind of considerable with a strong baseline random cluster just pick a random sense and then choi et al well this is approach which rely on encoding with a bad direction was tm cnn and train certain multitask objective and also some other approaches based on mask language model and NLI we considered of course there are quite a few moving parts in this method so you can select headwords differently you can singularize these different words mentioned differently here an entity Lincoln and principle the way actually how do you yield candidates maybe is more important than the kind of a neural network you use this is my many studies show this and because this is a similar study this is no the difference if you do if you some entity Lincoln business all this kind of entity typing business the way how do you select mentioned how do you match it it's very important how do you generate candidates okay of course some parameters correspond to the tax itself for German tax is a fixed granularity of clustering so clustering can be the be fine or a more coarse grain so you can have for different work let's say for Jaguar you can have three senses or you can have ten senses and of course they might be also different more noisy less noisy and so on so forth here of course last parameter where we will take into account and it's important one is a number of predictions so you can take only first how high pin him or just five or then here you can just return to this example so first second third labels they seems to be relevant but if you go down the least this is just automatically created least and of course you you hit at some point where they will be like a very noisy high pin him so which which correspond to some very generic or irrelevant senses all right so so here is a table with results so first of all we see that indeed first cluster and a random cluster they relatively strong baselines so you can but but the method itself is actually outperforming these baselines so actually it does certain disambiguation consistently however the results for the literature are pretty strong and based on these other approaches so in the end what kind of contribution on this work managed to do is to in combination with approach of joy the method yields certain improvement so basically itself the methods show that the clusters are pretty noisy so just using the method by itself yield quite noisy results but it gives some additional information and if you combine this with method of joy especially taking only a few predictions so you see here the precision really drops quite significantly if you take the first or second three first five or seven so precision drops but we call of course increase with time so what actually worked much better is also to drop in pronouns what is the reason why if you combine with pronouns if you work with pronouns it doesn't really work well the reason very simple think of pronouns like eat if you have senses even if you have some senses several senses of the word eat or she he hyponyms of the senses will be completely meaningless and if you take some mention in context and you have something like it there is no way to generate candidates for this if you take setup without pronouns this boost for ultra fine setup can be obtained but that is summarized in this picture if you consider pronouns the results more or less stay the same but this additional information coming from distribution labels they are useful in setup without pronouns ok so in this case slight improvement of score improved and of course there are certain errors in the method you can see here this is what I was generated and these are kind of two predictions so for instance you see the method generated something like violation difficulty for this mention but the true label was like a crime or something and again you see the task is pretty challenging for real datasets as you have very long mentions and in this case the method seems to generate quite plausible results but again according to human judgment these might be somehow relevant but they are also not mentioned in the gold standard ok so the summary of this talk is like this we export how information from job in text from unsupervised induced word census can be used for the task of unsupervised for entity typing it seems that if you don't consider pronouns the results can be improved if combined with the method of choice that means word census contains helpful and complementary information however word census induced just from text they are pretty noisy and you need to deal with them with extreme care because also you need to be very careful about how you communicate so that's pretty much it in case you have any questions I will be answering and Osge might be also on a zoom link so in case Osge you want to say something yes I am also here thanks for the talk ok Amalia if you can maybe camera that will be also great ok no I mean I think it's something different yeah well if you start zoom then everyone in zoom will see themselves which is probably not yes yes we do if you can hear me testing yes yes we can hear ok perfect so yeah just one very simple question it seems that all the experiments were done with English data so I'm not familiar with this job in text framework how easy would it be to extend it to other languages and does it exist for other languages yes it actually exists for multiple languages I'm sure it has support for German for Russian for Italian not sure how long this list is but yes I think about let's say maybe 5 to 10 languages are supported and yeah you can look at the website they have a nice web demo you can just enter a word and look up all of these word senses actually it's a snapshot of this demo so you enter certain term select maybe model and language and you will see something like this you see this automatically induced since clusters and these labels just in case I can't see a screen okay if you cannot see okay so because yeah right now what we see what we hear in zoom C is only the video stream full screen but anyway thanks thank you very much good morning guys my name is Maria Maslova and today I'm going to present a project which steadily through CAM comparative argumentative machine for the Russian language and my colleagues are Irina Nikishina, Stefano Vrabrikov as well as Chris and Sebastian what is wrong that's nice so I'm going to speak about quite an important topic the problem of choice I'm sure that all of you have faced the necessity to choose between iOS and Android holiday places car models and so on and one more topical issue is the choice between cats and dogs so who is for cats in this auditory raise your hands and for dogs as you can see so in the end of the presentation we will see how our system responds to this question so it is quite logical to create a sort of system that will help to solve this problem the problem of choice with the support of some reliable arguments however this task is quite complicated as it lies concurrently in the field of question answering and argument mining still one of the most known and prominent research in the field is can the system which can answer users comparative input with the support of some arguments extracted from each text corpus however cam is English and there are no looks for the Russian language so now we present RU-CAM a system aimed at comparing two objects from general domain in Russian with argumentative explanation if compared to its predecessor RU-CAM has following differences it allows to work with comparative questions in natural language it has the component for object and aspect identification from comparative questions and it uses an elastic search index of open stupefied crawled aggregated corpus abbreviated OSCAR we do not only develop a similar system and a pipeline from an engineering perspective we also try to pose and answer following research questions what are the main peculiarities of CAM that need to be taken into account when adapting to other languages and one more question more specific one what are the main challenges when adopting CAM specifically to the Russian language to start answering these questions let's look at the system design the process can be split into two steps question analysis and argument retrieval the first step is about the identification of interrogative and comparative nature of a sentence also it includes object and aspect identification process and the second step consists of the search of relevant arguments for the input objects their classification and ranking let's consider each step in detail the processing of a request starts with identifying a question type whether it is comparative or not it can be done in different ways including a rule-based approach here we stick to the idea of special patterns in comparative questions which include comparative forms explicit mention of comparison similarity difference etc to implement some machine learning approaches we first compile a data set from the sources shown in the table and here is an example taken from this data set then we use a rotating bird and a fine-tuned bird from another research as you can see in the table the letter shows the best results but comparative questions are quite specific kind of questions that can be identified with nice quality even using rule-based methods after identifying the question is comparative we need to extract objects and aspects to further provide them for the argument retrieval stage at this step we also implement several approaches including a rule-based one it is founded on the idea that all requests have certain structure namely they consist of they contain two objects compared and connective of comparative nature between them we consider following cases two nouns, two verbs the combination of noun and adjective and the combination of noun and two subordinate adjectives also we expect a connective from the list of conjunctions and synthetic words expressing comparison between these two objects in order to create a data set for the task we take 6,000 sentences from the previous step that have been labeled as comparative and manually annotate them three experts in computational linguistics were asked to label the first and the second object an optionally aspect and common object is a specific structure with noun subordinating to adjectives so for example in a sentence in a phrase green or black T is an example of common object so the level of annotation agreement is shown when creating the file data set we use the annotation version supported by the majority of annotation and that's one more example considering our example we use fine-tune transformer in quarters and if you short approach on generative transformers to solve object and aspect extraction task the table presents the results for each model we see that generative models perform on power even slightly better than baselines and significantly worse than transformer in quarters still they may perform much better after proper fine-tuning as we have shown them only 5 examples regarding our N0 scores for common object and aspect labels we claim that there are two problems the first is the inconsistency of annotation and the second is is a complex nature of these labels in semantic sense in order to refocus in favor of one of the other object we use open super large called aggregated corpus oscar we use oscar instead of a common curl while it is claimed to be its filtered version we store and index this data with elastic search when indexing documents we decide to create two indexes the first one is for storing document information like the number of sentences web or web link and so on and the second is for storing these two indexes themselves two or two sentences we first do this not lemmatization because of the type constraints and then apply wildcars to be able to find all word forms we send a boolean gson query and require that the clause must appear in matching documents we consider the step to be the most challenging in all camp paper as Russian language has highly fusion morphology which makes it much more difficult to retrieve sentences than in English because query words may occur in any form and look at the elastic output for our objects just an example after the candidate sentences with possible arguments are found it is necessary to understand whether the sentence argues in favor of the first or the second object again we have a rule-based approach that requires a list of keywords with adjectives and adverbs with the meaning of superiority or inferiority of the first object over to the second also take into account negation cases when the sentence is reversed we collected data set from 140px and ML and annotated them using the Yandex Taloka system for data crowdsourcing to do this we select same or similar pairs from the same domains as in English research English conversion like programming languages car manufacturers food drinks and so on and make queries to elastic search to extract all sentences matching the query then we create a system of tags there are three tags butter tag means that the first item wins over the second worse tag means that the first item loses and the tag none means no comparison between the objects we are interested in unfortunately the annotated data set is highly imbalanced 75% of sentences belong to none and for example only 9% belong to the worst tag and this tab also implement several transformer encoders and few short approaches with generative transformers the results for comparative comparative sentences classification are inconsistent and relatively low for all the models due to the class imbalance problem it is interesting that the rule based approach produces a quite decent result on better better better better sentences then it outperforms both large on worst sentences and generative transformers on non-sentences the process of sentence ranking is identical to the one in CAM we score comparative sentences by combining the classifier confidence and the elastic search score when displaying the arguments in the CAM on a certain object we sum up not only better arguments where the current object is the first item but also worse arguments where the object is the second one in the sentence for instance both sentences cats are better than dogs and dogs are worse than cats are used in favor of cats when comparing them with dogs the main outcome of our search is the final system where we integrate the parts described above the evaluation of the system is currently working progress we plan to evaluate to come analogously to CAM evaluation pipeline by asking whether users are performing faster searching something of comparative nature if compared the work with CAM with the keyboard search and also we can ask some users just to play with the system to collect their feedback that's it about the pipeline and it's time to answer search questions in general so in general when transferring come to other languages you should take the following peculiarities into account the difference in the notion of comparative sentences in different languages difference in the syntax and morphology of languages when implementing rule based approaches and the existence of the relevant data sets and pre-trained large language models for training for different subtasks as well as large text corpora containing comparative sentences for search in the target language so nevertheless as it has been shown in Russian it might be quite smooth if at least some of the required tools are available what do we have now we have the CAM the first instrument which helps to answer general domain comparative questions in Russian inspired by the CAM system we create a similar pipeline adding new steps for comparative question identification object and aspect identification and sentence classification we also present several new data sets in Russian that might be further used for the fine-tuning of language models for each subtask and from the preformed experiments we can see that rule based approaches show decent result or subtask of comparative question answering as well as few short generative transformers and this subtask needs to be further investigated and finally let's look at our comparison at our small research according to our system cats win over dogs and do that quite confidently that's interesting and that's an example of some top ranked sentences extracted from the elastic search as future directions we plan to incorporate a summarization system that would be able to produce a coherent answer from two lists of arguments for each object it will allow us to compare the results of various instruction tuned models for Russian and charged GPT with a rootcamp pipeline thank you for your attention any questions thank you for your work can you elaborate a bit more on how you plan to evaluate the system outputs because your experiments were based on different components so whether classifier argument identification works well whether sentence classification works well but ultimately a user has certain information need and it's presented with these outputs whether cat is better or not and how would you judge whether the system is satisfying this information need well or not so for users there will be frontend to use and to input and work slower faster and what else should I say ok yeah that's a human study do you think it's possibly to do it automatically like a reproducible way so that tomorrow somebody will develop another system and it can be compared as well or it's just not obvious as I understand for now the main way of evaluation is supposed to be human human based but maybe we should think more about some ways of evaluation genetic evaluation also that just sort of future work I suppose so ok and maybe ok I will pass thank you for the talk the question is like related to what have you tried to ask questions like compare something that is incomparable and how system can be used with such thing I mean it's partially due to lack of data partially due to some strange query and what is the default behavior what is the intended behavior in your system like if I compare cats and audio or something like this what is the bad well of course in this data set if you train it on this data set cats are always better yes in fact I suppose that will be not enough output from the corpus if we pose such question clearly but what is the default behavior in this case or some like what system should do yeah sure sorry for now there is going to be done in request so if we identify that the question is comparative then the two objects will go to the elastic search system and will retrieve some sentences in my opinion there will be almost no sentences comparing cats and being read sometimes we have a cat as a machine and then this is the disambigation problem in this case but mostly it's for future work because one of the limitations of this yeah so there might be some incomparable objects but in future work normally we are planning to apply some taxonomic structures to look for hypernames and looking for how they close are related or maybe they are like in the different parts of the graph so they cannot be compared not compare yeah yeah that could be done with some more frontunias some kind of unsupervised baselines using taxonomy on wordnet in my opinion thank you for the talk have you considered to do more classes it seems to me that sometimes the answer to the comparable question is things are equally good or equally bad and have you thought of of this classes we've never thought about that but I think that as a final result we should receive a sort of process for the first object and the second object and that output requires the classification better and nothing more okay thank you let's thank the speaker again now it's time for the third talk okay hello everyone I'm Maxim Savkin and I would like to present an introduction called Tuning Free discriminative nearest neighbor for short intent detection their consecutive knowledge transfer so let's start from an introduction to the task intent classification is task of identifying user intent given an utterance naturally it appears in dialogue systems and comes along with the task of autoscope detection so it belongs to any of the predefined in scope intents so it belongs to the autoscope class I would like to emphasize the importance of autoscope detection as it is crucial for generating an appropriate response we will solve both of these problems simultaneously so the motivation behind this work is that most of the existing methods for intent classification are expensive fine tuning and have high training requirements especially state of the art models and also most of them are focused on in scope classification completely missing autoscope detection our approach on the other hand in our approach we try to create a model which can work as a service so it doesn't require any task specific fine tuning it takes a few short moments to create a set and a set of unlabeled utterances as input and produces a set of intent labels we try to inherit the discriminative nearest neighbor architecture it utilizes a standard k-nearest neighbors and replaces the distance function with a deep cross-encoder repertor model this model takes an input and a training utterance the probability of this utterance is belonging to the same intent class so it is some sort of similarity function but based on a deep cross-encoder model so in the original paper strong capabilities of this similarity function were achieved by fine tuning it on pairs of examples from the target data set however we suggest to completely skip the fine tuning step and focus more on creating a strong pre-trained similarity function which can differentiate between unseen intents so further I'll focus on pre-training this similarity function we consider using several binary classification tasks the first one is natural language influence it is popular for pre-training some strong binary discriminators the second one is paraphrasing it suits a bit better for similarity prediction and the final one is a consecutive pre-training where model firstly trains on a large natural language inference data set so that it can learn some utterance relations and then it tunes on a smaller paraphrasing data set so to better measure the similarity prediction we also for natural language inference merge the last two classes neutral and contradiction into non entailment so it's a binary classification task one problem with paraphrasing is that it lacks large high quality data sets so to mitigate this issue we've tried using a small high quality data set and augment some non paraphrases for generating non paraphrases we use some sort of clustering you can notice that paraphrasing is an equivalence relation so paraphrasing data set can be divided into equivalence classes and utterances from the same class have the same meaning and can be considered paraphrases and utterances from different classes can be considered non paraphrases they possibly have different meaning so all those missing connections you can see on the slide will become our newly generated non paraphrases which we will use to increase the results for paraphrasing data as for the matrix for the final model for in-term classification we are using in-scope accuracy out-of-scope precision and recall which are defined as standard recall and precision but positive class is an out-of-scope and negative class is a combination of all in-scope classes we randomly sample 10 few short data sets from the original data sets and report the average and standard deviation for all the matrix the data sets we will be using other following we will be using a large clink 150 data set which contains 10 domains and wide variety of intents so let's move on to the results of pre-training as you can see on the left plot a natural language inference task despite being so popular achieved the worst results due to its directional nature and the best results we were able to obtain so far are with consecutive pre-training where a model firstly trains the results of pre-training and the results of pre-training where a model firstly trains on a large natural language inference data set and then tunes on paraphrasing data set with newly augmented non-paraphrases this is the third column I would also like to notice that augmentation really helped to increase both in-scope accuracy and out-of-scope recall by introducing new non-paraphrases so let's compare our model against other tuning free methods the first one is TFIDF KNM classification and the second one is embedding KNM vanilla which is actually just KNM based on being coder Roberto model which was pre-trained on a natural language inference task only pre-trained, no fun tuning at all so here you can see that on a clean data set and banking subset of clean we are achieving the best results so far even getting much in our approach is much more stable to threshold selection than our approaches as it has much larger area under the curves of in-scope accuracy and out-of-scope recall we also thought it was important to compare our model with some fine-tuned methods DNC is the state-of-the-art model for out-of-scope detection and here you can see that here you can see that as expected fine-tuned methods are better in-scope accuracy however for standard Roberto model you can see a huge drop in accuracy which means that it has a lot of low confidence predictions and our model has a larger area under the curves in in-scope accuracy so it is more stable to selection of threshold this results not included in our paper but we decided that it also would be important to see how our model stacks against chat GPT so the whole clean data set didn't fit in the prompt so we've used only a banking subset of clean data set and with standard out-of-scope examples you can see that chat GPT with zero short prompts achieves nearly ideal results and we suppose that it's all due to the fact that it memorized this data set quite well so we decided to change all text labels with indexes and we also replaced standard out-of-domain examples out-of-domain out-of-scope examples with harder in-domain out-of-scope examples and as you can see our model still attains relatively high recall and accuracy while chat GPT really struggles with these out-of-harder out-of-scope examples so only a usual chat GPT is able to produce some relatively good results yeah so summarizing our paper I would like to say that we've developed a model that doesn't require any specific point tuning so it can be applied to any data set for intent classification it supports an out-of-scope detection as the best performance on clean data set and it is robust to threshold selection thank you thank you can you elaborate on this difference between setups in domain and out-of-domain so you mentioned yeah just next up yeah just the one before conclusion so you mentioned that you removed some identifiers and converted them to numerical what does it actually mean I mean that chat GPT probably has memorized the data set quite well so we decided to reduce the effect of this memorization and replaced all text labels with just indexes so it couldn't memorize just by prompting just so well would it be fair to chat GPT I mean in the sense that maybe your system memorized this indices no it doesn't take labels just at all so it just takes input utterances and compares them between themselves only and doesn't take a label as input totally fair do we have more questions in the audience maybe we have some time for one zoom question if we have one no we don't have questions in the zoom let's thank the speaker now it's time for the fourth talk hello my name is Vasily and we are going to talk about whether it is possible or not to find the number of topics in a natural language processing data set so a couple of words as an introduction what is topic modeling about a topic model receives as input a huge text collection unlabeled and as output it produces topics as topics as probabilities over words we can see what text is about and we can locate the places and text where each topic is covered so the problem of topic modeling can be viewed as a matrix decomposition problem topic model receives as input a matrix of word in document frequencies and it decomposes this matrix into this multiplication of two matrices first matrix of word probabilities and topics matrix and second matrix of probabilities of topics in documents well but it is not clear how one should select this hyperparameter is it crucial or not how to find it so it seems that in some text collections at least the number of topics can be well defined beforehand for example if we take some articles from Wikipedia they are labeled they are split intersections so if we take some articles from let's say art sections some from biology sections some from history we get a data set and we can say obviously that there are three topics in it so but in real life it may appear more complex because these topics labeled by humans this is something that just helps to simplify the categorization process so in real life it may be well much more topics in this text collection these bigger topics may be split into smaller ones and furthermore topics may combine and produce some new topics so is the number of topics in text collection or not can we find it or not we are trying to find an answer to this question and what are we going to do we are going to train a topic model for a text collection with varying number of topics from low to high and we are going to track some topic model quality measures and look at the plots at the dependence of quality measure against number of topics and if we well let's say see some local minimum or maximum or plateau it may be a sign that the number of topics corresponding to this interesting point is an optimal one so this is the picture which we are well expect to see after going further I think it is important to say a couple of words about similar projects we are not the first to try to track a lot of quality measures while training topic model we are even not the first to try to find the number of topics using this quality measures but we believe that our research is one of the most extensive one so what quality what quality measures are we going to use first perplexity one maybe the most common measure while training topic models the lower the better second block measures diversity measures they compare topics with each other computing distances between pairs of topics because if a topic model produces topics which are all similar it is bad the second block is clustering niches because topic modeling can be viewed as a soft question problem words are split into topics which are soft clusters so we can adopt several measures from clustering analysis to use it as a measure of topic model quality and the last on this slide is a block of stability measures well topic models are unstable it means that if we train a topic model on the same text collection with different random initializations we can get different results so we compare different topic models obtained with different random initializations with each other compare the topics well this is not all measures which we use the next block are information theoretic measures we use several of them but the idea is the same compute the difference between model complexity and model likelihood so the bigger the likelihood the better but the bigger complexity the worse so these metrics are trying to find the balance between the two model characteristics the next one is entropy there are blocks where the authors propose to use this metric because well another analogy between topic modeling and a complex system where we have several possible state topics and particles that is words can occupy several of them so the optimal number of topics using this measure is the number of topics which gives an equilibrium state to the system the last block which we use the last block of measures are top tokens measures compute coherence and lift scores that's it well the methodology of our experiment is roughly described here for each data set we train a topic model with varying number of topics from minimum to maximum compute topic model quality measures minimum and maximum topic model minimum and maximum number of topics which we vary depends on the data set the models which we use are the following PLSA the simplest model which have only one hyperparameter the number of topics LDA the most known topic model and a couple more the correlated topic model which is trained in order to produce topics which are distinct sparse topic model which distinguishes its topic splits the topics into two groups background topics which are smooth and about nothing and specific topics which are sparse and exact sparse means that the largest probability mass is spread only on a small group of topic words well and this is data sets we use several data sets in English, several in Russian language for each data set we know at least approximately the number of topics in this data set they expected this is our ground truth well, probably I should say a couple of words about the last data set Rubik, good, this is our data set composed by our research group it consists of good Wikipedia from good articles from Russian Wikipedia so and this is the result this is the results table with three numeric columns each column is a score which we assign to each quality measure so what they mean the first column is a jacar metric it tells how the predictions of the same topic model when trained with different random sets are consistent with each other the lower the better in this table it is called jacar because it is computed the following way we take the predictions of topic models with different random sets and we make two sets the first set is union of predictions the next set is intersection of prediction and we compute the jacar distance between these two sets the next column is informativity well, it tells how the plot of dependence of quality measure against number of topics is readable by a human whether it contains local minimum, maximum or plateau or it just random up and down and without any possibility to make a prediction out of it the last column is called expected well, it tells whether the quality measure succeeded in finding the exact number of topics whether it corresponds to the ground truth from the table with datasets well what can we see from here the best values in table are colored with blue but these best values are obviously far from good so it may be an indicator that there is no such notion as natural number of topics in a dataset and there are several illustrations to support some other conclusions first, we found that optimal number of topics depends on the topic model used for example, on this plot oops, on this plot we can see three curves each curve corresponds to one each curve corresponds to a sparse model with different sparse hyperparameter so one is more sparse than another and as we can see these models this is the same sparse model with different parameters they give different results as optimal number of topics as local minimum what is more, we can see that this well all lines have reached over random seeds also each random seed produces a bit different number of topics so it is also what impacts the result and on this plot we can see, well not about this plot other finding is the following different quality measures produces different results as number of topics this is general case however, sometimes on this plot exactly we could see that different quality measures give the same result for example, here one topic model, one data set and several quality measures and they point roughly at the same number of topics 7 but this is not a rule, this is an exception and that is probably it, so we found out that number of topics is probably not a natural characteristic of a data set it is just another hyperparameter of a model and it is also dependent on the quality measure which is used to find the number of topics perplexity and coherence maybe it's surprisingly they failed to give any decent results however information theoretic criterion and ringy entropy are the best results well, as a remark, as a final remark we want to say that probably this is not such an important task to find the optimal number of topics as just finding a way of training a topic model which has all topics interpretable and good whatever topic number you assign at the beginning well, that's it thank you for your attention at Zoom it's fine you're listening to me from the mic so I have one small question so could you please summarize what we should do if we need to select the optimal number of topics so basically run as many models as possible and then to select as I understood correctly the best way we think that if you should know beforehand at least roughly what number of topics you have in your collection and it is the best way if you don't know how many topics you have 200 then this is best to just probably train topic models conduct many experiments and in order to find a topic model which best describes your collection well, another way to train many topic models and collect good interpretable topics put them aside thinking about how many topics you collected you can see how many topics you have so something like this you start with some number of topics and start experimenting making your topic model better or collecting topics can you think you have more questions yeah, thank you did you consider some classification based experiments where you apply to obtain representations to measure quality of this no, we will concentrate on just intrinsic quality measures without trying to assess topic models by experts or secondary tasks because well, if we try to select the number of topics based on secondary tasks it obviously would produce better results because we just find the number of topics which give us best results we wanted to find out there is some natural number of topics which could be found by intrinsic quality measures yeah, I think that might be so different for different classification tasks or for different applications different granularities might be needed so that might be kind of at the heart of this issue so you have different hierarchy of granularities maybe for retrieval you need a cursor, granularities for authorship identification or something you need very specific things so there might be just this general idea that everything in computer science depends on application and there is just no universal representation I mean at least in terms of topics which works always well like in clustering you might have different views on this data if you are trying to solve some secondary tasks it is best to search the number of topics based on this task we have more questions yes, we do no, no, Andrei we have one question from Adrian and then yours thank you for the insightful talk I have a couple of small technical questions that data set were really good and as I believe in all those data sets you had the true number of classes, the topics so how did you obtain the topics from the rule week so this category is probably but what was the title set well, yes it was great from Wikipedia we knew the categories of each article so that's how we found the number of topics I mean what was the topics like some categories from Wikipedia some large categories, yes which are good topics I divided into there are articles which are called good, which are checked and well, big and and there are several categories which these good articles belong to so there was some manual post processing for the label set yes, yes second question also the technical what motivated the choice of the models in concern, right, PLS-ALDE and different flavors of ARTM so I mean the kingdom of topic models is rather large so why these well our main idea was to take just several topic models in order to exclude some well biases about topic models we just wanted to take more than one topic model so we took PLS-ALDE as best as well known as simplest approaches and several other variations just to to make more I mean wouldn't like no topic models contribute some something else or it was not really that important for the study no, we didn't consider neural topic modeling because as well as I understood, as well as I understand there is also a hyperparameter T for these models well we just excluded what we excluded is the models which also try to find this number of topics as a result for example hierarchical Dirichlet process we excluded these models because they introduce additional hyperparameters which needs to be optimized and they are also universal because in terms that they are assessed differently than the majority of topic models which have just two matrices phi and theta but some neural topic models have T as a parameter like ETM for example by D but anyways, thanks for the answer thanks for question let's thank the speaker again and now we have the last talk of the session thank you for everybody welcome to this talk this work is done in collaboration with Kazan Federal University so this is a part of project of studying text complexity on different levels so this part is related to text complexity and the sentence level and the other parts of the work of this big project done in CAFU is related to academic complexity academic text complexity and also lexical complexity so the context of this project is that we want to do the prediction at the sentence level and at different levels as I already discussed and sentence complexity is well-studied tasks but in Russian it's not well-studied there are limitations for classical measures so we tried not only the classical approach with the features but also the deep neural networks to train and measure the performance on this task and one thing that there are other languages like Italian, English and others that already get data set for sentence complexity in Russian there was no data set a brief discussion of the work that people tried to do before us and they collect data in different languages that train some classical approaches like on lexical features and syntactical features as well and deep learning methods like pre-trained language models and collected a lot of different data sets in this domain including English and Italian and our first part of the project was collecting the data in order to run the evaluation so the data set we resembled the methodology for English and Italian data set collection so we used the Taloka crowd-sourcing platform and asked the workers to annotate the sentences according to seven levels of complexity but we also wanted to experiment with different features so we sampled the data set from the Syntagruse corpus because there is a syntactic annotations and other interesting things that we can make use of later that was a sample of the sentences was related to the frequency of the lexems in the sentence so we tried to sample the sentences of average frequency not to sample to complex, to rare words and so on as our colleagues before did and each bin contained 200 sentences so overall it's like 1,200 sentences so it's just sample interface of the UI people just were asked to pick one of these scores there was a 10 assessments per sentence and just example of a sentence example of this thing and we collected data from people who are native speakers and no other restrictions or something like this we collected the data there is a slight imbalance of the data here you can see the distribution of these scores among the data set and it's biased towards complexity more complex sentences which is kind of contradictory to what we have into other data sets English and Italian but it's still an interesting observation regarding the assessment of agreement so we have this distribution of complexities corresponding to the x-axis here is the sentence length and y-axis is the average score distribution of average score and of course there is clearly the correlation between the two parameters but also we measured the average number of people who agreed about the score on the sentence so out of 10 people assessors 4.3 on average agreed about the score of course the sentence length is a crucial parameter it's important to assess the complexity or the readability of the sentence but it's not the only one previously we analyzed the previous work we analyzed what features can be important based on the features that are more important more correlated to the target we build a simple linear regression and then in this work it's kind of a step further we try to push the performance more to get more quality of this data and then in the second part of the talk there will be some modeling like classical and deep learning approaches here you can see the difference between discrepancy between Italian and English and Russian data and you can see that in Italian and English you have a relatively small number of complex sentences not complex here but complexity on the y-score but on the x-score again just length of the sentence and in those two data sets there are relatively less number of sentences with a relatively less number of instances and the same you can see here is a distribution of the complexity score so pink is Russian and this one is English and this one is Italian so the average complexity also different on different languages so it can be due to sampling of course but maybe it's just also related to some other like properties such as average length of word average length of sentence and maybe the also on a tater basis here you can see that distribution of these properties like this Russian, English and Italian data set and here the frequency of lemmas set as length so it's more or less similar the discrepancy is appearing but because we got the logarithmic frequency so after the normalization they look the same so these three data sets are comparable and you can run the several experiments several models on this data simple approaches like based on linear regression and decision trees and SVMs based on classical like TFIDF matrix features and then we tried also like modern approach based on BERT model and fine tuning of BERT and also graph based neural network so that was the initial idea to select features three features that we can select from the feature set of the build linear model with three parameters gives not very nice results and for this quality like RTL score is quite low and MSC and MIE quite low so the next idea about BERT is quite obvious, I don't want to stop more on this, just fine tuning pre-trained rubird, Italian BERT English base model regarding the GNN we make use of this model that gets the syntactical tree dependency tree augmented with additional edges and also we use features of these nodes in each syntactical tree in dependency tree first provided by fasttext so it's the convolutional model actually gets the features and applies the graph based convolution in order to find the representation of the whole of each node and then we do the pooling over the all these tree and the final linear layer decides what is the complex term of the sentence results of fine tuned BERTs of course it's as it was expected they were quite decent and this is just number outrageously low numbers like good numbers because of large number of epochs when we compare it to GNN and SVM and other models we can see that yeah, BERT based models for all languages are much better and linear regression sometimes just doesn't provide any reasonable result because of maybe a number of features who knows and this GNN model actually is not that bad compared to some other languages maybe that's data set is available and we are going to continue these studies in both cross-lingual non-lingual setting maybe building a model that can be applied to several languages and there is an interesting direction of research when data group working on lexical complexity so they want to measure the complexity of the word in context and then using this thing you can measure the complexity of the context itself not just simple, some sentences with words but simple complex sentences or simple sentences with the same word and analyze its complexity in context that's probably it if I have time I think I have some time for questions please go ahead I have one small question why English and Italian data sets use 7th skill why do they use 7th skill classification instead of 5th for example for me 5th was the simplest way to think if you are not satisfied with binary classification because it's not the type of binary classification task why not 5th why did you decide 7th that's good question actually I don't remember they explain it well in paper that somehow this it's called the Likert scale and that's usually from 1 to 5 and in this case it was like 7th for some maybe more fine grained is better I don't know really good question thank you for the great talk and for the data set finally and the question that I want to ask is when I put myself in the shoes of the annotator yet again about the scale well it's sort of hard to choose I believe however you explain each button right so I wonder if there are any gamified things or some procedures some procedures where people annotate the complexity of the text implicitly so that the data set can be derived from the procedure is there anything of this sort something that comes to my mind is this eye tracking thing that people measure something like indirectly like how long time how many times you have these circuits or something like this when you read the text there are measures and also I think in KFU there is an investigation of this part how students in school read the text and how they perceive it using the eye trackers devices but for this crowdsourcing I think the only thing to do is to increase number of annotators and somehow control the output quality like measure the time how much time people spend on each sentence for instance annotating each sentence it should be not less than like several seconds maybe so that's kind of general approach thanks okay we have one more questions in the Zoom yes please thanks for the talk so my question is also about the data set and also about putting myself on the shoes of the annotators so the obvious strategy for an annotator is of course to just label long sentences as short sentences is not difficult so the guidelines for the annotators did they include any specific instructions about taking into account sentence length or not taking into account sentence length yes good question there was a short guideline actually we provided examples of sentences that are hard and not hard we completely rely on the intuition of an annotator in this case I think any guideline in this part when you try to gather an assessment of the complexity or difficulty that is purely like a perceptional thing it's not can it be formulated or defined objectively in my opinion so we rely on the intuition linguistic intuition of assessors and we minimize these guidelines as we can and then again just if you have enough number of assessors per sentence on average you will get a good assessment good score for the sentence the line was very short yeah and the guidelines didn't mention the sentence length anyway no no no it was not related to any bound to any parameters or whatever so we just didn't want to push assessors or bias them towards some attribute or whatever thanks I can't answer the question thank you for the talk I have one small question have you considered any relation of text complexity with linguistic acceptability if there is some connection between these two notions maybe the text complexity could be assessed with the perplexity of the models and other methods used for acceptability no we didn't measure this abstract relationship between these two things what I was thinking about the measure connection between the grammaticality or some syntactic features that's why we're looking into syntax rules because you can compute some syntactic based parameters from the sentence but the grammaticality or some quality of sentence is related to complexity of course acceptance in this case well I don't think we have a lot of data in Russia maybe I just don't know about the there is the Rucola dataset Rucola so it's maybe possible to do this it's a good idea it's just a very short comment question do you think it makes sense to when annotators are given different sentences to provide the same length of the sentences so that they never biased indeed by this very strong maybe a model can learn better if you provide yes that's a good point actually it needs a different scheme of annotation you give two sentences with the same length and decide ask annotator to measure to answer which one is more complex maybe they pay attention then to more non-trivial features it was a similar thing we tried in previous work that was a classifier that was trained on the pairs of sentences of the same length but for annotators here we collected only the scores per sentence that could be a bias towards the length of the sentence to avoid this yes we didn't do this but the idea is quite obvious and maybe we'll try to do it right now the goal of the project is to develop a model that will just sample complex sentences simple sentences try to make it short because we're a little bit out of time so 10 minutes for coffee break and then the next session and it'll be yeah I see okay do you hear me okay hello everyone my name is Alexey Andutov I'm from Manosov Moscow State University I am my colleague my name is Alexey Ukeshevich I'm an appropriate researcher with document level relation extraction in Russian excuse me can I use this? one, two, three, okay, let's start okay what are we going to talk about this today first of all why is the task of extracting information and sub-task relevant we will short-interview extraction task and what is the difference between sentence level and document level relation extraction and I will show the problem of nested entities in this task and what models for document relationship extraction are used okay traditionally one of the most important task in the NOP is information extraction and one of these sub-task and list export is relation extraction this task has a broad range of applications starting creating and updating knowledge bases like Wikipedia and WordNet to structuring documents like is it working? sure yeah, let's start okay, could I continue, yeah okay let's continue why this work is important first we address extraction with nested entities and ignoring these aspects without information loss and next we focus on the document level relation extraction it's a complex problem and it's crucial for understanding entire document not to be no studies focusing on Russian language have been published yet what is the task of relation extraction look at this example given a sentence the previous film festival was held from May 19 to 26 given as entities festival and date we should predict the type of relation point in time in this case and also relation can be terminated as triplet subject, object and relation type but in this work we consider document relation extraction document level and let's look an example and use article about Konstantin Robinov we consider two entities music band Konstantin Robinov at the same time we see that Konstantin Robinov same have some mentions like Kuzio that is a nickname of Robinov and we should recognize relation type founded by and recognize evidence supporting our sentences that can help you to understand the relation type in this case and what is the difference between sentence level and document level first of all in first case you have single sentence and usually only two entities involves but in document level relation extraction you have one entire document entities can be mentioned in various forms and you should also predict the evidence we use can someone from zoom hear us? yes so we can continue right? or you haven't heard three slides before that it was okay? it was okay let's continue we use narrow data set each contains annotations of entities and relations between some words and there are about 30 different types of entities like person or place and about 50 different types of relations like workplace alternative name or work as and importantly that some of these relations occur between sentences in the text okay how this task of relation extraction can be solved the baseline approach through its classification task for a pair of entities in text for example given a sentence you should mark two entities and apply some of crucifier but we have some problems with more advanced approaches for example when we mask our entities we have problem with other weapon entities like Spunbird and it can be used in some models like Spunbird okay and one of this approach solves this problem and it's about tagging in my previous work I presented that this problem can be solved by several approaches first of all decomposition into sub tasks which solved by a separate model and the main approach is joint extraction we use a single model to extract all of information in your text and in short that our results that using a single model provides better quality this allows for incorporating all knowledges in a single model okay we took this into account and consider it some approach that addresses relation extraction in a general sense and what approach did we use to solve the task at the document level we started from baseline BLSTM that was mentioned in document paper text with features like go off and entity type vectors and coordinates of entities in text applied by BLSTM and finally we aggregate some of entity dimensions and classify by multi-class classification layer but in this task it's important to consider all of entities and it seems a good idea and in approach of document after encoder our text we build a relation matrix and use unit architecture from computer vision approaches and finally we can pass this matrix and classify the output in addition we previously mentioned that we previously mentioned the concept of evidence or of supporting sentences that can help to recognize the target relation in our task and in the extended narrow dataset such weblin is also provided the approach allows us to the additional markup beyond predicting the relationship let's also model the importance of the sentence in this text and finally we can keep only the most important sentences and make predictions based on this document and this approach is called fusion approach so we see that the best result was from dream and fusion approaches using RU, Poverter large encoder and also we and also let's look at the some main problems we encountered when working with model on the Russian dataset in RL in this task you should process entire document and we use window based text processing with overwarp windows okay next the second problem is that the quadratic complexity based on the number of entities and this is because the number of relationship in the document is square of the number of entities and a lot of relations in your text is label like no relation and it's important and in cases when are using some large models it can be help that removing random that ratio within the document okay finally we conduct first metrics on document Russian extraction in Russia and specifically and narrow dataset also metrics achieved the state of the art models for English I mean benchmark we explore the issue of nested entities and implemented some enhancements to poses longer text and also we see that balance in negative relationships helps optimize the training poses in the task of document or Russian extraction okay that's all any questions your experiments were done on only narrow dataset how you would estimate how generic this model is if you just go and start apply to new text let's say from newswire or internet websites just for harvesting very large database of relations about persons would it be accurate or domain shift would it be very severe yeah I'm sure that it's a general approach for Russian extraction where we use these models yeah first of all we checked all the models that have the best results of the English benchmarks like docred we used a approach that can be used in another dataset like in a rail and for another language like Russian so I think that it can be used for any domain or any area but in some of cases you can see some problems like I mean that you should optimize your model to process for example in real time and so on okay thank you very much second question let's then thank the speaker again and now it's time for the second talk of the session hello everyone can you hear me so hello once again my name is Marie and today my colleague Alona who is here via zoom online and I'm going to present our joint work on the way to control but text summarization however today I'm more like a presenter because the main research was performed by Alona and she's the main contributor so all credentials to her and I was more like a tutor and advisor unfortunately Alona couldn't come today but she's here with us and after the talk she will be ready to answer your questions with me so today we are going to talk about the application of the new promising approach namely the HydroSum approach to controllable text summarization in Russia and the structure of the okay so today first we will talk about the motivation then briefly discuss all automatic text summarization approach and then I'll give the brief overview of the original HydroSum method and then we will switch to our research, talk about the data the experiments and of course we will discuss the result so basically the concept of controllable text generation brings an additional layer of flexibility and customization to the summarization process and this controllability allows users to specify specific attributes of the desired text such as length or style for example this customization enhances the user experience by providing summaries that align more closely with the information and individual preferences moreover it allows users to specify the level of compression to make it easier to summarize the desired text. As for the objectives of this study the main one was to investigate whether the multi-coder architecture utilizing transformer based model called the HydroSum is applicable to the Russian language because it has shown great results for English and whether HydroSum could produce more stylistically diverse or higher quality summaries than the classic approach of fine-tuning language model this research lies in the field of natural language processing with the focus on automatic text summarization. As traditional methods of text summarization are mainly divided into the two big groups namely they are extractive and the abstractive ones provide no control abilities producing texts which are not stylistically diverse the HydroSum method on the contrary introduces some control and briefly speaking HydroSum is a mixture of expert architecture with multiple decoders which is based on the pre-trained language model as a base model the authors of the original paper used Facebook's BART large but they claim that it can be applied to any transformer model in HydroSum architecture the base model is extended to consist of multiple decoders namely the authors experiment with two and three decoder architecture where each decoder captures different stylistic features of the input text and each decoder has a total number of decoder blocks where the parameter of the bottom layer are shared among decoders this is done to minimize the number of additional parameters introduced in the model architecture the top layers of different decoders are pre-trained independently thus each decoder can specialize and learn distant representations to suit its specific type and an important thing the gating mechanism which is an important part of HydroSum basically speaking this gating mechanism is a weighted sum of the k decoders output and it dynamically determines how much it encoders output contributes to the overall result enabling flexibility in decision making based on the weighted contributions after utilizing the gating mechanism the outputs from the shared layers are fed into a feed forward by a softmax activation function which outputs the overall next token probability thus these processes assign ways to the outputs determining their relative importance the author provided three inference strategies namely sampling from individual decoders where one decoder is more abstractive and the second one is more abstractive and mixture of decoders using the gating mechanism mixture of using manually specified gating mechanism basically speaking to adapt HydroSum architecture to the Russian language we've chosen the classic summarization dataset known as Gazeta dataset and recently it has been one of the most popular dataset for summarization tasks in Russia the dataset consists of news articles and their summaries from Gazeta news website titles of the articles is date URLs and additional information we also introduced two additional binary column namely the gate column and the sand column the base model we took M-BART model which is a multilingual language model pre-trained on the massive on the massive corpora and besides M-BART we also trained three baselines namely the standard fine-tuning of the transformer model like the root C5 base root GPT3 small and fine-tuning of M-BART itself because of course we wanted to compare the performance of the fine-tuned M-BART with the M-BART incorporated in the HydroSum architecture to compare the performance we evaluated it using the classical metric namely we used Rusch scores measuring the quality of generated text with respect to reference summaries apart from Rusch scores we also measured generated summary relevant metrics such as abstractiveness, specificity, length and readability just in two words abstractiveness is measured with the help of two additional tools namely coverage which counts the proportion of words presented in both input text and in the summary density which counts the average longest continuous extra copied from the input text the metric for evaluating generated summaries was suggested in the paper newsroom a dataset of 1 1.3 million summaries with diverse extractives to measure specificity of summaries specificity of summaries specificity of summaries was used and the length of the summaries is measured by two additional metrics of the difference and the compression rate now on to the results we see some significant difference between the performance of two individual decoders the first decoder which is called decoder 0 here provides longer summaries and has lower coverage and density then the second decoder it is called decoder 1 here is more extractive than decoder 0 and shows bigger coverage the most extractive summaries will produced by root 5 and the bar and most attractive appear to be the reference one showing the low coverage which is predictable because this is the true answer so mixture of decoders produce more abstractive summaries than each decoder individually in terms of specificity all generated summaries have results which are quite close to each other it can be explained by the pack that all models will fine-tuned on the same data set and therefore they share the same vocabulary however individual decoders of HydroSum architecture have also shown different results on specificity metric decoder 0 has generated a summary of the lowest specificity score among all summaries whereas the highest results on this metric were shown by mixture of decoder and the root 5 based model to sum up in this work we studied the application of the HydroSum method to the Russian language we found that the first decoder in it is more abstractive the most abstractive summary were provided by root 5 and mixture of decoder provided more abstractive summaries than other models and all models showed close results on specificity metric oh, the time is up and practically finished during our experiments the HydroSum approach proved to be promising for the Russian language and as a part of the future research we plan to train this model with more decoders and on a big and more diverse dataset to try capturing more stylistic features moreover it is important to try manually to specify the gating mechanism during the inference stage so now my colleague Alona and I were ready to answer your questions Alona, can you hear us are you here? yeah, I'm here this is Alona we have questions thank you for the talk my question about these metrics that you use for evaluation of the decoders behavior is it possible to somehow adjust the behavior of the model or fine tune the model towards more abstractiveness output more specific or something like this as I understand from the last table you just measured this matrix but is it possible to change the matrix the output according to the required criteria is it clear? Alona, can you answer? yeah in fact, yes because it's like the future work which we are going to do when you are manually specifying the gating mechanism you can do it yourself so you can assign weights and thus you can change the output to be more abstractive or more specific one by adjusting the weights by assigning the weights yourself so it's manual manual approach? yeah, you can do it manually thank you very much for your talk my question is about the mixture of two decoders as I have seen from the results table the results are lower so can you please elaborate how this mixture of decoders was done, maybe I missed it and why do you think the results are lower? just a second yeah, so the mixture of decoders it's when the model kind of it's when the model decides which output to generate so it kind of it samples from both decoders simultaneously so the result was lower well I think I may add in the mixture of decoders we basically assign the mixture of probabilities from the two decoders thus we basically obtain the weighted probability from the two decoder and here as far as I remember we used just the averaging from the two of the probabilities from the two decoders and thus maybe we should have carried out and this is our future plan to experiment with different with different with different decoder weights so that for example the first decoder or the second decoder is more important in the result yeah, thanks for the talk so just a quick question about this mixture of decoders so what about its computational requirements so is it more I guess it's more compute-intensive than just using one decoder right so how much of a problem it is I'm sorry I didn't catch it yeah so I guess when you are using a mixture of decoders when you sample from two decoders instead of one right it means that yeah I guess it's more compute-intensive and so so what's like twice as expensive or the dependency is not linear so just how much of a problem it is will it be like twice more expensive well it is expensive it it took a long time much longer than training with one or yeah with one decoder and but it's not really a problem because I did it in my call-up notebook so yeah I kind of spent some resources on it but it was okay to do that well I guess it depends on the size of the training data or the inference data but just my question is essentially is the dependency linear so is it just the case that when you use two decoders when you sample from two decoders it makes twice the amount of time or compute resources as when you sample from one decoder or is it more complicated well the dependency is more complicated because as we've mentioned in the beginning they have the shared number of layers the bottom layers they share the bottom layers and that's why of course the inference from the two decoders is not twice as expensive as sampling from one decoder but I'm not ready to write the exact dependency it's much more complicated but it's less than twice because they have some shared parameters just for this purpose to save some computational resources during the fine tuning and the inference state right thanks hello everyone my name is Anna and my work is dedicated to machine translation for Russian Hakas language pair my goal is to present the results that we were able to achieve and also to try to explain how we did it so that you can maybe repeat it on another language pair of your choice so the Hakas language is a language spoken in Russia by about 40,000 people and have a very limited amount of digitized data so it is considered low resource actually there exist 60,000 pairs in Russian and Hakas and they are from TIL Corpus of Turkic languages you can see the results of training the baseline model exclusively on Russian Hakas data and as you can see on the picture everything that is less than 10 is hard to get a sense of and everything that is above 50 is considered high quality good translations so the basic approach to improve the results of the model is transfer learning which means you initialize the weights randomly and then you pre-train the model on a resource rich language pair and then you take these weights and initialize another model with them and fine tune it on a low resource language pair so the biggest question one of the biggest questions was which language to choose for pre-training we wanted it to be Turkic as well because we thought that maybe it will help the model to train well and we also wanted to be in Cyrillic script because we wanted to use the shared vocabulary between the parent model and the child model so here you can see the languages that meet these requirements and the sizes of the available corpora so as you can see Kazakh language is the largest one but the problem about these languages is that the data is mainly web scraped and it is not of a very good quality in terms of translation but also it tends to lean to certain domains like news or government documents so we decided that we will stick to trash language mainly because it is manually aligned and checked and actually it is the second in size so for quantitative reasons as well the preprocessing of the data didn't include much for the parent data because it was of good quality and we just did some additional shuffling and the child data from the till corpus it contains some trash symbols random numbers but also the Russian sentences were of good quality but the Khakas sentences for some reason they lacked punctuation at all and also had some mistakes but luckily we had another corpus that consisted of 30,000 pairs and they were of good quality and also they were the same sentences that were in till so we replaced the translations that we were able to get from electronic corpus of the Khakas language and thus we improved the quality of half of the data set the next step was tokenization and this was done by byte parent coding with dropout for those who are not familiar with the technology I will briefly explain so it goes like you split the sentences by characters and then you set up the amount of merging operations you first glue together the symbols that appear most often together in the text so on the left side you can see the traditional byte parent coding and the dropout technique actually is the same but you just skip some merging operations for example you can see on a picture B then on the left in the middle the RE are glued together and on the right one the RE is skipped and it starts with AT so this is a good way to kind of augment the data because when you split when you do the dropout several times on the source data and get different picking nice sentences you can then assign the same translation to all of them and you kind of increase the amount of the data you can train it on so here's the short information about the setting that I used, I used transformer model in its classic way from the original article I optimized diplomatic and I used the shared vocabulary between Chuvash and Hakaz data the experiments I did was the increasing the share of the Hakaz data in the vocabulary I did the same amount as Chuvash to 1.5 times larger another experiment included dropout in byte parent coding and the third one was adjusting the maximum sequence length parameter because it turned out that 99% of the parent data was about 100 tokens long and the child data 75 tokens long so we did it with this parameter a little bit and actually it showed that all of these experiments combined show the best results in metrics comparing and has quite a big improvement comparing to baseline I also compared character F metric which compares not word engrams but character engrams and sometimes is considered better and here I compared my results to other works on low resource languages but this is maybe not very representative because there is a lot of factors that affect the result of the model especially for different languages but just so that you can see how it generally works what is more representative though is examples of the sentences that the model gave to us you can see on the top example that actually the meaning of the resulting translation is like the same to the reference translation and in the below example you can see that sometimes the model still makes mistakes like here it missed the word to study basically and the future work it may include adjusting the number of byte parent coding merging operations because some hypothesis includes that the more morphological the tokenization is maybe the model will train better also traditional way is to expand the corpus and also trying another Turkic or not Turkic language for the parent model so this is all I think and I'm ready to answer your questions do we have a question or do we have pre-questions did you consider pre-training in some other languages to enable better transfer learning capabilities let's say you pre-train on some other languages in unsupervised or supervised way which might boost performance for Akass yes of course this is a future work we will try different languages and maybe just the NLB model whatever so that we can compare the results do you have any idea like which languages are more beneficial less beneficial because some might hurt some might improve yes actually I read some articles about this and the main idea I think is to be similar in morphological structure because for example when you translate from English to Russian then in English it will be the same word for all the cases and things like that and it is difficult for the model to catch those those differences so the more complicated I think the language for the parent model the better for the child model since Akass is very morphologically but not necessarily the same linguistic family no actually there are studies that show that the size of the corpora makes the biggest difference so they even pre-trained I don't know on Finnish language and fine-tuned on Turkish something like that and they still managed to achieve some good results yes actually some colleagues had similar experiments and yes that was counter-intuitive to me so I had to train on some completely different language but with a similar morphological structure and that actually give a boost thank you so I have it, thank you for talk, I have a small question are you familiar with Akass language or is one of your colleagues knows how did you evaluate the results I am not familiar with Akass language but I know some Tatar and it is quite similar so I could actually visually evaluated somehow but for the sake of science my translations were evaluated by the native speakers that I asked to do it thank you for the talk I have a couple just technical questions, could you please maybe I have missed that, just deliberate a bit more what kind of model did you use just transformer architecture with 6 encoder and 6 decoder layers I see ok and with blue what kind of for blue evaluation what kind of tokenization did you use it was the same tokenization as for the training the BPE thing well that is interesting because maybe for the sake of I don't know getting a different view on the quality maybe you should try to I don't know try some some like raw tokenization by or some I am not sure about the existence of morphological analyzer for Akass language but something like stamina limitization and raw tokens because this can give a different perspective on the evaluation it's like the same issue that was widely discussed at the time when character level machine translation was popular so it's a bit different thing and definitely worth comparing I think yes I think we can do that thank you for the talk I have not the question but rather a suggestion you have said that one of the corpora was in some other script in Latin probably maybe you could use just some transliteration scheme and augment the data that way yes as I said we wanted to be it in Cyrillic because we thought first we used the shared vocabulary but the transliteration crossed our minds but I actually didn't find the good transliteration tool because for example for Turkish language it is very difficult it has many rules and to translate it to Cyrillic is a very complicated separate task so I didn't find the tool and I didn't go for this idea as far as I know for some languages exists like specific transliteration scheme so if someone has developed it it could be used I will have to look at it thank you ok more questions let me check the zoom ok we don't have questions in zoom so thank you very much and the next talk will be online yes that's true nice to meet you I am Zaid Zavadek Sey and I will be talking today about how difficult it is to make another style attack for machine translation models I have two authors that did most part of the job, most part of the experiments in writing and they worked with Sanxom, Pavel Burnyshov and Yelizmet Kostunov we are participating in this project as well so let's start on what and why we are doing so basically most of the models we see today in NLP are neural networks we often see some very rare examples on how Zavadek if you know the right way to attack it how to make it do what you want and for this for example prohibited by the model authors so it seems like a good goal to try to examine and how can we find some big points of a model can you hear me we can see the slides let's wait a little bit but we in zoom can see the slides we can see in zoom but we can see in the audience let's wait a couple of minutes this is where the hybrid conference starts we are still waiting for slides the problem is our main internet we can hear you but we just can't see I guess it's some connection problems I can try to to share one more time maybe it would be better this way let's see this I suppose the program is maybe that you don't display the presentation on the screen for some reason maybe you should move something somewhere we can all move to zoom I guess right we all have separate web tops let's open it join zoom let's wait probably it will be fine I believe it has something to do with how zoom translation is set up there on site I think the problem is the display here I can see the room people are in yes and they see the screen we see the screen now you can see from my laptop what I see yes I see again two ways and both ways I see that this is not my presentation for some reason I don't know then someone try to to see how the zoom is looking for the laptop in the hall in the room let's wait from the team to manage it okay Lena that's the on-site team is the on-site team already working on it yeah they are trying okay they are trying cool so there is still some hope yeah the work is going on maybe they are restarting the computer I don't know sounds like the good option but I can't see the videos of the online participants on the watch screen there we still see the same slides about the confidence yes we have the placeholder basically I think that you should somehow move I don't know what can be a problem but okay let's probably leave that to the local session I guess what I can do for all of you what I can do anything for now show your work team if you want yes another another camera yeah we can see the videos on the large camera yes yes but yeah maybe indeed it's easier if for now just all the on-site participants simply join the zoom room so we have 10 participants here 10 participants wait for a couple oh yeah we see slides now we can very good can I continue now yes please continue so what is the problem we want to find some vulnerable points of the model and we can try to find this adversarial attacks basically the adversarial attacks are a way to find these vulnerabilities we try to find in an efficient way on how can we break the model how we alter the output of the model by small enough change of the input for example we can see the following idea here it's an example of an attack in the bottom so we have some models that tries to this input flat misguided comedy okay to a label in this case you can see that it is definitely a negative review and so we try to fix this input this as the model change as possible so in this way in this example we change flat to which basically the minimum case almost the same and in this case the model outputs positive label a marking that we succeeded in breaking the model and now the label is not correct because definitely it is still not a positive review but okay model sync it is positive so the question is how should we design such attacks to common classification models for NLP or for example translation models so okay the problem maybe the main problem for this area is that we don't have differentiability and the search space for NLP is discrete and we somehow should move in the discrete space looking for adversarial attacks for adversarial changes of the input basically what we can we can try to do we can try to for example in code flip calculate the gradient with respect to particular token in the input and so we can try to say that if we have this test order approximation of what is going on we can then find what and how we change to change the loss most and what we are doing and what is a pretty successful strategy and also we can try to use modelized models this was proposed in our previous paper it is based on the idea that we somehow should modify the loss function when designing adversarial attack so basically what we can try to do we have this input sequence x1 is on the right x1, x2, x3 and so on where D is the length of the sequence and we want to train a generated model to modify it so now the output is some x prime x1 prime, x2 prime and so on and so what we want from the sequence we can formally express in the loss function as a test is a classifier so we want the label of the output to be different from what we have for initial sequence and this we can encode with this second term here and we see capital is a score of model also we can try to say that we should not be far away and this D11 time distance or whatever other distance like below similarity we have we wanted the similarity of x prime to be high in terms of how many tokens we change or in terms of some semantics matrix and so in this case we design the loss function and we can backpropagate this on the softmax trick into the generator and make it generate adversarial example this is what we can do and this also transfers problem from a discrete optimization problem to continuous optimization problem so we can try to adjust it and basically this idea is followed by another one that allow us to do practically the same thing but for machine translation model basically we try to find the change in the embedding space that leads to okay small change of the input but significant change in the output so we look at this gradient this color product of the gradient of the adversarial loss and embedding we try to find the embedding that is good and using this embedding it generate back decode the adversarial sequence and also change we hope that we will change significantly the output and this we can proceed in different ways for examples we can say that okay we have another more this loss function and we focus on predicting below and below for the initial sequence should be high so we should not change X significantly but if you look at below for a pair of Y and the adversarial translation by prime should be significantly different and we can direct organize flow of the gradients in all part of this model and train it to generate to find good candidates for the adversarial attack so it seems like a very natural approach but does it work let's look at the results for classification problem can be changed with our generate deal model the input and obtain good results and the thing is that yes we can and we what we are looking for in this table we have four problems and we have the accuracy for these problems before attack the scores of the accuracy is pretty high it's like for example AG is 0.95 here but what's going on next after different attacks there are some sort attacks and the last two approaches are different variants power attack and we see that if we use our top performance attack Dilma this deep Levenstein distance we have for some problems almost 0.5 accuracy that means that we can almost broke the model here and here a significantly decrease the quality of the model for AG and FT problems and so what else we can look what we usually should check as well is that our change in the meaning is pretty small and we don't change much the semantics of the sentence and we can try to achieve this by looking at the scores of this right part this is the score of a discriminator so what is discriminator classifier we generate a sample of adversarial examples adversarial sentences and we have a label that it is an adversarial sequence label 1 also we can check for label 0 for common natural sentences from our sample and we can try train the classifier and we see that in many cases this classifier are not very good for our attacks it means that this discriminator even after training can distinguish adversarial examples and normal examples it means that our generative model and our generative attack works pretty well if we try to solve the classification problem but let's go to the main goal of the paper let's try to check if we can do this for adversarial attack on machine translation models and basically what we do we run the attack with different high parameter value that corresponds to the power of attack how much we should corrupt the output or an input because it definitely is to correlated things how we change the input and the output ok that what next we can do we try to assess the quality of what we are doing of the attack so we basically look at some similarity score for the initial sentence the changed sentence x' after an attack and the original translation and the translation after an attack so we want this guy similarity to be large and this similarity to be small for a successful attack we need to plot both of them so let's do this and look at what's going on so we try different sort of attacks ok and see actually what's going on so each point here is a different high parameter setting different colors corresponds to different sort of attacks and basically what we want to be we want to be here where we have a similarity initial bird and ok for initial sequence x' and x' attack but low similarity between the translation after the attack between the y translated and y translated after the attack and as you can see for all metrics despite we want to be here we have small improvement I would say that ok for all classes methods ok significantly lower the diagonal ok slightly lower the diagonal slightly change the ok the change in the input and output are pretty much the same ok and also an additional experiment shows that sometimes we still can be pretty successful we added another type of attacks that's called character based attacks we swapped characters and also we combine this with the gradients and we can see a new growth of points here that again corresponds to the different for the different high parameters and generally we see that the input less than we change the output it means that in this case we have attacked machine translation models basically we think that this effect comes from the idea that we need to find some big spots of deep translation models and basically it's not an easy task and we should focus on something that lies outside of a training sample for such models and swaps of characters make us to get here and to improve over the baselines so basically it means that if we want to change the classification label we still can try to find some change in the input to get a change of this label with very small participation of the input if we look at machine translation models the situation is different typical approaches doesn't work here and what we can do is try to go to the character level and in this case we are pretty much successful so we can say that machine translation models are pretty strong like many other sequence-to-sequence models that we think so that's all about my talk do you have any questions? Any questions? We don't have questions in the audience we have one question in the audience I don't know if I understood you right but do you check that the sentence that you change for example about the view that you showed in the beginning do you check that the actual meaning of it stayed negative but the label changed I can try to show some examples from the paper and you can see that typically the change is pretty small and if we show this to a human we would say that it still would undetify something like this the first example maybe says that this changes to something that's not very meaningful here we say that human can say it's a misprint it's a galon but after translation we see that the translation is pretty different and the meaning is not the same for even machine translation for some successful examples so you check that the meaning stays the same while the label changes actually if we are talking about the first paper we even did some human elevation and they say that in most cases the meaning is pretty same okay thank you you're welcome okay let's thank the speaker again and the next course will be online so let me share my screen just a second let's hope it will be better than last time yeah here so can you see my screen yeah can you hear me yes we can hear you see the screen so good everyone let's start so the prevalence of online shopping has made it the foremost method for purchasing various goods the online customer reviews play a crucial role in providing valuable insights into customers interests and knowledge of the product but how can we discover the insights automatically and formulate them our paper user review summarization in Russian researches this question so what is user review summarization well the logical extension of text summarization is multi-document summarization where we summarize multi-document but from the multi-document summarization stems the opinion or user review summarization which uses the specifics of human opinion presentation in various sources on the internet here on the slide you can see an example of user review summarization the input consists of reviews covering different features of the entity in different colors and the summarization model should analyze given reviews and produce a summary which would cover all the reviewer's opinions so the researchers in the field explore both supervised and unsupervised settings but a well supervised setting is widely used due to its effectiveness it requires golden summaries for training which can be extremely difficult and resource-intensive to produce moreover actually the majority of the existing studies focus solely on the English data and neglect non-English languages because they lack publicly accessible resources this hampered the opportunity for broadening our understanding of cross-languistic properties of our summarization and we try to change it to collect the data set in recent years the researchers suggested a lot of methods for opinion summarization some of them suggest using automatic data aggregation while others use weak supervision in the form of seed words for further major topics identification some articles suggest using information about such topics or aspects to model summary so the first method that we will be reviewing is plan summary we experimented with the best performing contemporary abstractive and extractive models from the articles on the screen we employ different methods which use sentiment and aspects information to create a summary as our work focuses generally on the data and training and not on the architectures let us overview only the main ideas of these methods that we will be reviewing is plan sum it extracts sentiment and aspects distribution from the data and then fuses these distributions and token embeddings the result of the fusion are then fed into elastame recorder with attention which composes the actual summary another another method is sum does not have a colorful scheme but mainly it uses multiple instance learning model from the previous works of the authors to induce aspects controllers and then feed them into the transformer model the original transformer model the controllers are actually calculated using with separation in the form of manually collected seed words another method is quantized transformer which uses transformer and decoder to quantize the sentences of the reviews to find the average opinions so as valuable aspects and formulate it with the existing using the existing sentences and the last work which we review in this project is the semantic cutting which continues the work of context transformer but creates the distribution of aspects and just ranks them and choose the top elements as the basis of the summary so let's talk about the data the standard data sets for opinion summarization in English language are written tomatoes, Yelp, Amazon, Oposum and space data sets is presenting data in different domains unfortunately from our knowledge there are no publicly available data sets for user review summarization in Russian and the majority of Russian-aligned services do not allow to use their data therefore the data was collected from the open internet source TripAdvisor.True with the help of web scrapping resembling the collection of space data set we collected around 1 million reviews of hotels from 11 cities and structured the data in a convenient way as requested by one of the reviewers which also show the statistics of the collected data sets some of the data distributions but as you can see the statistics are fairly similar among the train tests and validation splits the labeling process for the validation and test splits of our data sets involve manual annotation by us the authors similar to space data sets 25 summaries per split we created 28 summaries per split we wanted to compare the performance of the models trained on the collected Russian data and the existing English data sets which is why we followed the annotation of space data set the set of aspects was taken from some paper translated to Russian and fixed for manual summary construction every hotel was labeled based on 50 reviews stratified by rating to ensure that both low and high rating reviews are present in the data we manually filtered the concepts and phrases corresponding to specific aspects and chose the most repetitive of them as a summary basis we well made that the human written summaries might provide some restrictions on the operation of models and leave this for further research we choose models with the highest metric values on space and Amazon data sets which utilize aspects information for the correct comparison on the Russian language data set during our experiments we trade abstractive and extractive methods and compare their performance some methods were not taken into consideration because they either utilize self-supervised learning apart from unsupervised learning in these methods or propose novel algorithms which not consider sentiment and aspects information for our experiments we employed the Rouge metric because it's a standard metric in this area we also experimented with changing methods preprocessing and sit was extraction steps but as this did not lead to any improvements we wouldn't stop on this here you can see the results of different models evaluation all the models were trained on the collected data but the models with FT postfix were additionally fine-tuned on the small part of translated space data set while the evaluation on the collected data set shows clear dominance of fine-tuned as sum evaluation on the space data set shows several well-beforming models among them fine-tuned as sum which show the highest biogram Rouge metric and variations of plant sum which show higher unigram and longest common subsequence metrics so having conducted a manual analysis of the produced summaries which can be also found in the paper we found out that obstructive summaries contain more specifics from their views Extractive summaries contain more general information but on the other hand quantized transformer and semantic coding coders extractive summarizers included personal information in their summaries and plant sum which produces more human readable texts catches indirect relation to aspects while as sum described the fixed aspects more precisely but with limited lexic polarity so in this work we explore the application of modern solutions to the Russian language user review summarization we managed to collect around 1 million reviews of hotels and annotated parts of the data for evaluation we compared models from unsupervised and quickly supervised settings the best performing models among the approaches were adapted for the Russian language analysis and trained and fatuned to summarize the opinions in the hotel domain our conclusion is that obstructive approaches outperform extractive approaches on the collected Russian data set in contrast to findings for English data presented in the recent article proposing the semantic coding model so specifically the best performing model on our data is a sum and on the space data set are plant sum variations further research may focus on the stylistic limitations of human written summaries for better model performance coherence and readability analysis of the summaries and ways to improve them researching the self supervised methods which are actually excluded from the comparison as some of the well performing models on the standard data sets so thank you very much for your attention if you have any questions you're welcome we have one question thank you for the talk I think about three slides ago you've provided quite a cool this one, quite a cool summary of the specifics of each models and of their performance on the reviews you've collected and I wonder is it possible to some of these specifics clearly are is it possible to measure all of these effects were you I don't know going to do that it's actually first of all these are the findings of the work we've done so it may differ from more general information but that's what we found out and secondly it was measured manually as I mentioned before because the test and validation splits are actually not that big so we managed to measure it manually could you please remind me what was the size of the data sets based on the part of the data set that you were making the conclusions based on so we followed the annotation of space data set which actually achieves some kind of good results they had 25 hotel summaries so for every hotel we have a review set around 100 or 200 of reviews and for each of the hotel we created a summary so they had 25 summaries for validation split and for test split so overall 50 summaries and we had 28 summaries for both splits so 56 summaries in total thank you you're welcome thank you very much for your talk I was wondering do you consider now only the hotel reviews or do you plan to do more on other topics or it's just hotels now I kind of missed it from the talk so the as you can see the standard data sets you can see but anyway the standard data sets are different domains but we show the space data set because it stated the most clearly the procedure of collecting and annotating the reviews and it was the easiest domain to start from but of course it is a bright prospect of the research to try with different domains which were actually done in English data in different articles so yes okay more questions we don't have questions in the audience Zoom part okay I guess we don't have questions on the Zoom well we do have questions yes so yeah just a quick question thanks for the talk and as far as I understand you annotated the reviews well you wrote the summaries for the reviews and the set of aspects taken from one of the prior works and of course the choice of the aspects the result in summaries quite a lot so I think if you choose another set of aspects your findings would be much different about this we actually explored this avenue so we tried to find the seed water extraction methods in order to comprise different set of aspects different from the aspects from ASUM and we tried all the different water embeddings all the different methods described in different papers but actually it didn't produce any better metrics according to summarization metrics which is why it's ticked to the ASUM set of aspects but if anyone could find a better way to extract aspects this could be another review of the research I guess it was not exactly what my question was about because you evaluate your model on the summaries which are written by humans taking into account some set of aspects so your gold standard would be different if the aspects are different right? Yes of course so yes the production of summaries was based on the given set of aspects the set of aspects was taken from the ASUM method and they actually provided it because they collected it manually from a great experience human written and they extracted the most valuable aspects but if we had another we could also experiment with that Yes so my question was about your expectation so if for example your current set of aspects includes things like cleanliness of the apartment I don't know like the food quality etc but if you choose some completely different set of aspects I don't know just the easiness of finding the hotel do you think that your ranking your results will still be the same? What is your expectation? I'm not sure whether you do experiments or not I understand it is quite difficult because the model generates the summary based on what it was trained on so I guess it can find different aspects and it would give even lower quality but better summary if that makes sense so probably for that purpose is not the best metric but it's the one that we used because we had the fixed set of aspects Thanks Have a lunch break Yeah and I guess we should announce that please don't forget that we will announce the best paper during the closing session so please make sure to attend it so as to learn what is the best paper in the NLP track Yes, it's true See you soon after the lunch break See you Thank you San A-B-C-D Who is it? G-I-G A-B-C-D-E-F-G A-G-I-J-K N-M-R-M-P W-E-X-Y-N-Z No Now If I just I think No, I said, won't you sing with me? Hi-fif! Do you know how much you've been seeing me off? I don't know. No, I don't know. Waka, waka, waka. Hi! I'm so confused. Why is your dad so noisy? You're so noisy. Haha! ABCD. Yeah, ABCD. But you're not a waka. Oh, you're a waka-waka. ABCD. Ah, I heard. Samy, where do you live? I don't know. Where do you live? Where do you live? Porch. Porch? Where do you live? Porch. Porch? Where do you live? Karmir. Samy? I don't know. I don't know. So it's like the American American. Where do you live? Where do you live? Where do you live? Where do you live? Where do you live? Where do you live? Porch. This is a song. This is a song. We are from the United States. Aashoniz keshatem sirum Aashon, aashon, izzmerim sirum ganal Ye suzumem ganam soner vosho tansni aashon We are very thankful that we have been able to meet again. Let's give a big thank you to our friends. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Hello? Can you hear us? Hello? Hello. Can I try to share the screen? Yes. Perfect. We can see you. We can see your presentation. I think it's time. So are you ready to start? OK, so let me just first introduce you. So thank you for coming. So Artem was supposed to be here, but unfortunately, for some personal reasons, had to cancel his 3-plus minute. Nevertheless, it's a great pleasure to me to present another very interesting speaker from Emirates today. Artem, I know him since 2019. And I should say that he always amazed me by his energy dedication to NLP research and to dedication to the topic as well, which he's going to present today. Over the years, so I've seen how Artem emerged into a leader in this topic, in this field. He studied a lot on active learning and certainty estimation techniques and published extensively at the very top conferences and continue to do so. So his record track in NLP is really impressive. And today we will learn about this very important, I think, topic on certainty quantification and generation for NLP and, more specifically, for LLMs. So without further ado, Artem, please go ahead. Yeah, thank you very much for the kind words. I'm very apologized that I cannot be there in person because of health issues. But never mind. I mean, let's dive into very interesting topic. I currently mostly work on safety of NLP models in several ways. And yeah, let's go ahead. So we'll have some introduction and some estimation background. Yeah, this topic will be more focused on uncertainty estimation, and especially on some estimation for language models. I will also present our last effort regarding summarizing and systemizing these diverse works on uncertainty estimation for language models in one package, in one Python library. And finally, we will conclude with some suggestions. All right, so first and first, we know that language models are becoming very good in multiple tasks. And basically, regardless of what task you pick up, you'll find something like that on the papers with code that we approach or exceed the human baseline. You can see that, well, most every task might be considered assault or maybe most of the middle complexity tasks are solved. But playing performance as accuracy is not the only thing that we want to pursue. That also safety and this safety is related to two other aspects. First of all, it's bias and reliability. So let's start with bias. So what we usually have, we have some data with some problems with maybe some garbage and maybe some biases, biases, pure correlation between target variables and some feature that shouldn't be really taking into account. And we train the model and we get, of course, a bias model that picks up on this particular dependencies and use them to do these predictions, which is, of course, not good. Because, well, we can think about many biases. For example, one bias, if you classify a mood person by a picture and you pick up a color of his eyes as a feature, that's probably one example for bias, which you shouldn't take into account. Because, of course, the color of the eyes doesn't make sense for predicting the mood. But due to your data, you can pick it up as a viable feature and use it in the future. So what we want to do is in the biasing and fairness, we have garbage data with some biases and eventually getting an unbiased good model. So how we can do this? Yeah, one important case of bias in NLP, which is usually considered is social biases and stereotypes, like depending on gender, race, age. And here's one example where we try to determine the profession of the person, where we try to determine the occupation of the person, want to determine whether a person will be a surgeon or a nurse. And we can find out that sometimes classifiers just look at the gender of the person and just put all women to nurses category and all males to the surgeon category, which is completely unfair and should be avoided, of course, in practice. Because if we do this, we will facilitate biases in the data in our social stereotypes. So how we can deal with this? Well, there are two techniques that we can apply to our models. There's the bias in techniques that reduce the effect of bias in the original training data. First, we can avoid instances in the training data. So the signal from the loss, from the training loss, will not bring this spurious correlation between protected attribute and the target variable. Another case is when we try to unlearn or forget about some attributes in our future representations. And here, for example, we have this additional loss component that help us to forget about gender race and other stuff. So the model cannot predict from the future representations these protected features. So eventually it becomes less biased, more balanced. All right, that's how we deal with bias. Now let's go to reliability. OK, so what is reliability and why we should care? First of all, let's look at this example. As you can see, there is a banking application where a person asks for his balance. And well, this banking application works perfectly fine when the questions are related to the topic. But when a person asks something unrelated, like from the sports, it tries to answer his request in the way it was trained to do. And it fails because it answers gibberish, something which shouldn't be answered. And that is an example of out of distribution question where a person asks something from the domain, which is not part of the training domain of the model. So instead, it should be probably saying that, OK, sorry, I can answer your question because it's not my type of concern, not my type of domain knowledge. So all right, so that's one of the example of out of distribution input. Another example, well, a little bit more a safe critical application diagnosis in medicine where we have, for example, some symptoms into symptoms or maybe a numbness of a person. And your model has to make a diagnosis. It might be a fact that treatment of a person in the patient in the future. And of course, there might be some out of domain questions like that it should try to avoid, try to ask a person for the help. But it also can be some situations where we have something like very close diseases like SARS and COVID. And it's very difficult to distinguish between them with this particular amount of information you have. We probably should try to ask for a physician or a real person another more complex system to make a decision for that instead of making a diagnosis that probably can harm the patient in the future. So that's two examples. Let's go ahead. So what we have and reliability, we have to always remember that model capabilities are very limited. First, due to limited amount of data it was trained on. There's always training data set and it's limited. And going beyond this data set always has a risk of making model answers completely unreliable. And it's also a situation when our task has ambiguities like in SARS and COVID where it's very difficult to distinguish these two diseases with a lack of information, lack of features. So this second thing, ambiguity in the task, is another concern. We have to spot these areas of ambiguity and also try to abstain from making any decisions in these. So what we need to do to achieve reliability is to develop mechanisms that take into account these limitations when the model is deployed. And reliability is basically a capability of our system to apply this mechanism in a perfect way to detect outer distribution instances, detect ambiguity, and many other stuff, like maybe adversarial attack detection. So OK, so how to handle reliability, how to make your model reliable? This slide is a little bit marketing here because basically we're promoting here some of our work which was published in ACL in 2022 and 2023 and other conferences actually. There are some solutions to reliability problem in terms of selective classification and outer detection. The last one will be published in ACL this year. It is intersection between devising and uncertainty. But of course, there are many more works in NLP and uncertainty for NLP and we will discuss them later. Just a small slide here about our work. OK, so our new concerns what we have today. Today we have generative AI era and in generative AI era, models can generate something inappropriate, something that shouldn't be generated. And in particular case with CharGPT here, there is a case where it gives an incorrect answer. Here we can see that, well, we're asking how many letters in the word 19 and it's incorrect. Then we ask when we try confused model even more and it becomes even more confused. So yeah, this happens all the time with language models. And be sure I check this example for this lecture. And well, the first one was fixed, but the second one is not fixed. You can still confuse model into thinking that. There are like any number of letters in the word 19. Another example, I apologize before Alex Panchenko for that, but this is a question about his biography. Like tell me about Alex Panchenko and the model also hallucinates some facts about his biography and very confidently say that he's a professor. I mean, he will be probably in the near future, but right now he's not a full professor in Hamburg, but it's pretty convincing in saying some false facts about Alex. And that happens all the time with these kind of models. So what concerns we have in the area of generative AI? So models can generate unacceptable output and we should first consider the hallucinations, like the false facts that they generate. We also should consider something like toxicity because the model can be trained on, can have some leaks of toxic texts and we should avoid that. We also should care about truthfulness because sometimes model cannot do the task and it should be quite explicit about that. I cannot do this. It should be truthful. And also one thing which is commonly put into consideration is personality, that the model should not pretend somebody else. It should be explicit about its personality, that it is a model. It is like assistant. It's not some sort of like a human professor, et cetera, giving you this answer. So how to deal with these problems? In general, there are a few ways to do this already and people are applying currently to all these language models, kind of CPT, charge of PT, and others. It's reinforcement learning, of course, where models is trained on a big amount of scores of assessors and these scores are essentially aimed to reduce bias, reduce success, to reduce many other issues with the model. There are other approaches to this problem, like contrastive learning. You can also just fine tune on some good answers that people craft by themselves. Also you can do some filtering of the input data or output results with a stop list and other stuff. One of the colleagues that were visiting the embassy was from Microsoft tutoring and said, it's like stop list is a way to go with the language model. It's a very good thing. It works much better than anything else. Fact checking and another stream of work where people try to look at the output of the model and do some fact checking with the databases. And finally, which is I think one of the most important and most interesting for me is uncertainty estimation. And can we apply uncertainty estimation for language models? All right, first of all, let's just go through uncertainty estimation background. I hope pretty fast. Again, let's consider one small example here. Consider we have a classifier that is trained to distinguish cats and dogs. And it works perfectly when it has perfect picture of cat and perfect picture for dog, it works good. It predicts what it was intended to do. But what we have here is something weird which cannot be classified into a cat or dog class. Here probably model should output something around 0.5, like something 0.49 or 0.49 to demonstrate that it's something weird. It's something super uncertain and maybe you should call a person or something else to work with this image. But unfortunately, like this probability scores that obtained through softmax classifiers are not very good. The reason is here. The softmax probability gives you, it can be used as a certain estimation score, but unfortunately it gives you uncertainty like this. So in most of the area, you are completely certain here and here and you only uncertain in this small decision area between two classes. And that's what we get from softmax classifier. But in case of our example with cat and dog, this cat and dog image can appear like anywhere. It might appear something like here, here. And of course in these areas, we will be completely certain and we will predict particular class. So overconfidence is the problem of softmax probability and certain estimation. So what we really want to have is something like on the right picture, where everywhere we don't have any particular evidence and we don't have any training data. We are uncertain and also we are uncertain in the area between two classes. That's what we want to have. All right. So that's why we cannot use softmax probabilities. Well, we can, but sometimes it doesn't really what we want to. Okay. So what is uncertainty in particular? Well, there's no unified way of specifying certainty scores. It can be anything which works very well for us for our task like audit detection or selective classification. It might be some sort of distances, probabilities, entropy, of course, or estimation of an error. But while in biasing theory, there's a particular way of defining uncertainty and this is basically entropy of a probability distribution. And so we have a training data and we can, well, we can say that uncertainty is basically an entropy of predictive distribution and this prediction distribution can be parameterized with a neural network. And of course, if we are biasing, we probably like to put some sort of like prior on our parameters. So they have also some sort of distribution. And then if we, then we can rewrite our predictive distribution in this way. So we have this formula which we can substitute it into the predictive entropy formula and get something like this. But of course, I mean, we could do this in a biasing way. We could try to estimate everything correctly using some sort of approach from the biasing theory. We can do some variational inference, et cetera. But usually we don't have an access to, we don't have a way to do this. Usually the biasing neural networks is very hard to train and they usually very difficult. They also have several drawbacks like storing additional weights and another stuff. So let's go back to types of uncertainty and look at our classifier again. Here we have a situation where we have a lot of information on the data and we have a vanilla classifier which tries to distinguish two classes. But unfortunately in this area, small area between classes, these classes overlap. This is an example of high aleatoric uncertainty where we have noise in the data where we cannot easily distinguish between two classes just simply by drawing a line. So in this situation, we are saying that we have high aleatoric uncertainty which is related to noise and ambiguity in the data. Another situation where we have lack of data and we have some sort of freedom to draw this linear classifier in various ways and it will perfectly distinguish two classes. But eventually the situation like here is ambiguous for us because again, this might be a train set but in the real situation, that it can be located here in, this data can be attributed to one of these classes. So in this case, we say that we have high epistemic uncertainty which is related to lack of knowledge. So we don't have enough information, enough data to draw this classification boundary. All right, so by definition, the whole predictive uncertainty is basically a sum of epistemic and aleatoric uncertainty. And then what we can do is to write some, I would say from the previous formulas we can derive several formulas for epistemic uncertainty and aleatoric uncertainty. First of all, let's look at epistemic uncertainty. Epistemic uncertainty is basically as I said as a lack of information. So what we here define epistemic uncertainty has information that we get about model parameters after we see the target variable Y or the object X. So this is how like our knowledge about model parameters change after we get information about this particular instance. So if we write this formula a little bit, we eventually get this formula where we subtract our predictive uncertainty from this term, which is basically an expectation of entropy for each particular realization of model parameters. All right, and okay, so what is aleatoric uncertainty then? Well, by definition, we have a predictive uncertainty the sum of epistemic uncertainty, aleatoric uncertainty. So simply we can derive a formula for aleatoric uncertainty here. So this is basically this expectation of entropy of predictive distribution for each realization of model parameters. So again, in Bison theory, we have like perfect formulas for these things, but in reality, we cannot estimate them directly. It's quite difficult where there are several ways to do that in Bison modeling, of course, like Bison backprop, variation, et cetera, et cetera. But usually people don't do this because it's a little bit more complicated and a little bit more overhead to do that. And it's very hard to train Bison models right now. Again, let's look at this formula. So predictive uncertainty is a sum of epistemic and aleatoric uncertainty. We can see that epistemic uncertainty is a mutual information between model parameters and target variable. And aleatoric uncertainty is an expectation of this entropy of each realization of model parameters. And again, let's look at how like I certainly just look on the charts. We have some sort of like two moons data set. Aleatoric uncertainty will be basically this line between two classes. This is a decision boundary. Epistemic uncertainty is basically all area around this training data set. And then we sum these two uncertainties together. So we get this predictive uncertainty. So we have both certain areas in the area without training data and between these two classes. Some words about where we apply different types of uncertainties. Aleatoric uncertainty usually can be used for the clean data. We can spot some noise in the data sets. Like epistemic uncertainty here can be used to detect outliers. Of course, it can be used for undesirable detection and there's also a possibility to use it as a criteria for active learning for selection criteria in active learning. A total uncertainty is used for selective classification. And that is kind of more the most critical thing for safe critical applications where we need to abstain from making a decision. Somewhere, for example, in medical applications. All right, let's just have a very broad overview of the methods for instant estimation. We'll get to in particular realizations of these methods where we discuss particular methods for language models. So in general, very strong approach to instant estimation is basically using ensemble of models and look at the diversity of predictions of these ensembles of this ensemble elements. So if while the diversity is high, the uncertainty is high, the same thing you can do with the Monte Carlo dropout. Monte Carlo dropout is a technique where you make not one prediction but multiple predictions but you enable your dropout layer. So each prediction becomes a little bit different from one another. And you can assume it as a cheaper version of ensemble where you don't need to train multiple models but you just have like one model but you apply different dropout masks and get a little bit different predictions. And again, we are looking at diversity of predictions if diversity is high, then certainty is high. There's also density-based methods where we try to approximate our training distribution with some models based on the Latin feature representation. So we just try to create a training data distribution and analyze whether our instance belongs to the training distribution or not. And finally, there are some techniques related to training loss regularization. Well, you add a regularizer to the loss. It helps you to calibrate your model a little bit. Sometimes it's also going to be helpful in selective classification. Okay, so why is it an estimation for language models is hard? Well, first of all, we have not one but multiple predictions in a sequence. These decisions are based on virus sampling algorithm we also can get the final sequence not just like simply looking at the maximum probability token but have a multiple beams and select the beam in the end. And of course, we should remember that predictions of language models are kind of, it's hard to normalize them because again, each like a sequence, well, you can create infinite number of possible sequences and how to estimate the probability of each of this sequence and how to normalize them. So this is basically a very hot topic for hot ongoing research topic right now. Okay, finally, we get to the standard estimation for language models. Let's look at what we have. Again, the simplest techniques are very similar to what we have for the standard classification models. We can look for example, on maximum sequence probability here is basically we can estimate the probability of sequence by just multiplying probabilities of each decision in a sequence and subtract it and get our uncertainty score. There's also like a very common thing is to have an average of log probabilities of our tokens. Well, sometimes it is called perplexity sometimes it's called the normalize log probability. Again, it depends on whether you do explanation or not. And also you can simply just calculate entropy for each particular token and average it. And there are many other, you can think about many other ways of aggregating individual predictions of for each individual token, but these are just the most common ones in the literature. Again, okay, a little bit more complicated stuff. I mean, not the complicated stuff but a little bit more interesting stuff. It's a point-wise mutual information here, for example, it's a way of addressing it's a way of addressing that some sometimes model outputs generic stuff and basically maybe uncertainty of this generic stuff is not that crucial. So, I mean, like, if it outputs something very generic maybe we shouldn't be that much concerned about that. So there's a approach to correct things with this additional term where we simply run the model with some query and run the model without query. And so we look at how the model, what is the probability of a sequence without any particular input, how generic it is. A little bit more complicated way of doing this is conditional point-wise mutual information. This is an idea where we basically fall back to the simple perplexity, but sometimes where we are very uncertain we look at the probability of the generic sequence. So it's basically, so most of the time we are using this term but sometimes if this entropy is high enough we also add this term with the generic model. And so the faster than this one and works also a little bit better. Okay, assembling, I will not spend all of time on assembling here unfortunately compared to like classification assembling in generation, we're not that good. I mean, classification is one of very reliable algorithms in assembling. Usually if an assemble doesn't give you any improvements probably nothing will do, but in sequence generation ensembles according to our experiments do not perform that well. Anyways, so for an assembly, we can do the same stuff as before. We can just look at the average probability from each of the ensemble element. We also can look at the average log probability of course, and there's also not a more stronger technique which was suggested by Malini and Gels is a reverse mutual information where we look at the logarithms also of each realization of each individual ensemble element. Note that we also could use some sort of like Monte Carlo dropout here, but what I call dropout in sequence generation works a little bit worse again and note that you probably need to keep the same dropout masks across multiple sequence generations to keep your scores good enough. Yeah, all right. So density-based methods, these are quite strong methods especially for out of distribution detection in here are two papers that propose these techniques for all the detection in sequence duration tasks. These are this is our work which was published on ACL in 2023 and very concurrent work which was published on iClear in 2023 also. Basically, this idea of using the density-based techniques in sequence duration is basically the same stuff. In classification, we use so-called mathematical distance as a certain estimation metric, a certain estimation score, and you compute it like this. So the idea behind that is that you estimate the conditional probability of your particular instance belonging to some sort of class in your training data. Like for example, here we are looking at the probability of X belonging to yellow class. And we say that, well, this probability is basically a normal distribution with a parent or M in the center of this class with the covariance matrix sigma which might be computed just looking at this particular class or at the whole data set. There's just different ways of computing the whole distance. So if you do the durations correctly, you will eventually get this distance score. So something will be basically the distance between the centroid of your class to your data point where which you're trying to estimate uncertainty for. And the higher this distance is, the more uncertain you are. So because your point is lying far away from the training data. And well, of course, in sequence generation tasks, you don't have particular classes. So basically in sequence generation, we have only one single class where you have all the three classes you have all the training data in one single class, you compute this centroid, you compute the covariance matrix and then you can estimate the distances between centroid and your instance in question. Okay, so how do we, so what is H here? H is a representation of our instances. And in sequence to sequence models, we have two options here. We have encoder and decoder and we tested both of them and they work pretty well. And you can take the representations of each of the input sequence and average them and you have an embedding of the input question input query, but you also can do the same thing with the output. You can get the embeddings of all output tokens and also average them and get this embedding of the output sequence. So then we can compute this, we can compute the covariance matrix, we can compute the centroids, of course using our training data and determine whether this instance is out of distribution or not. Pretty simple. All right, there are two modifications of that. One is based on, it's basically how does the distance plus PCA plus minimal covariance determinant. PCA helps us to reduce the dimensionality of our representations and reduce the effect of outliers. The same thing goes with the minimal covariance determinant. It helps to filter out some noise when we compute the covariance matrix. And okay, there's another option is relative to the distance where we calculate the distance for our particular point and we subtract the distance to some sort of like global central, the centroid of the background collection. In this case, we have a big background collection like maybe C4 dataset. We calculate the centroid for the C4 dataset and look how close our instance to this centroid. And this idea is just like that. We don't want to be high and certain for very common sequences, to very common queries, to very common outputs. So we are, in this case, we are uncertain only when we are both far away from the training data and far away from like the pre-training dataset. Okay. All right, semantic entropy. This is one, okay. So let me reiterate with density-based methods. These methods are very good for audit detection and they are our type of specific and certain estimation techniques. So if you have, even especially sequence-to-sequence models, they're showing pretty good at tasks for machine translation, summarization, question answering, and you can find some experiments in our paper, in the paper of our colleagues in the concurrent work. All right, semantic entropy is a more interesting stuff. You know probably that model can generate multiple, similar sequences in terms of meaning, but very different in terms of surface realizations. Like you can ask who's the president of the United States and for example, like model can answer like George Bush or the president of the United States, George Bush or George Bush is the United States president, et cetera, et cetera. This will be essentially the same thing. And we want to take into account the similarities and semantic entropy does exactly that. It samples several predictions from the language model, like A1 to AM, then it clusters into meanings, like meaning clusters. And then it estimates a certain entropy on top of these meanings, not the particular sequences. So, and each meaning can contain multiple generations, like again, regarding this president, they all basically can be put into one single cluster. To calculate the probability for each meaning, the author just simply is some of the probabilities for each individual sequence and then basically calculate the standard entropy formula on top of the probabilities of these meanings. Well, this works pretty well, but surprisingly that black box methods work even better. So, in semantic entropy, we need an access to logits and probabilities of the output language models, but sometimes, or even actually usually, some models like chatGPD, GPT4, and many other APIs shouldn't provide you an access neither to embeddings nor to logits of models. And then you probably need to only you can, what you can do only is to analyze its outputs. So, what you do here, you again, sample multiple outputs from the language model. Then you can calculate, get bearwise similarity between these responses. And then you can compute and sort of search estimation on metric on top of that. First, what these authors of this paper tried, they tried multiple similarity, bearwise similarity scores. One of them is basically jacquard similarity where they simply look at the bag of words of two answers. Surprisingly, it worked. And another thing which is more elaborated is to use of the shelf and lie the Bertha model or an LP entailment. So that is just to look at two outputs and analyze them with the Bertha and determine whether one is entailment of another. So if one is basically entailment of another and another is entailment of the first one, you say that they are essentially the same thing. So they're similar. So this is the second way of computing similarity which is more effective. So now, let's get to uncertainty scores. Again, surprisingly, very simple score, uncertainty score here is a number of semantic sets. Here you, again, you calculate these bearwise similarities. You basically cluster them into semantic clusters and calculate the number of these clusters. And if the number is high, then you're uncertain. If the number is clusters is low, then you're certain. That's basically it. Again, this works, but there are some better versions of that. The second version is to compute doing the spectral clustering on top of these similarities. So instead of just trying to create this hard adjacency matrix, you create like, I would say, soft adjacency matrix where you have an adjacency matrix with a scores of similarity of each output sequence to each other. And then you can do some sort of spectral clustering where you can basically look at the number of these spectral clusters. So essentially this method does this. Another more simpler method, but actually from our experience, a little bit more effective is basically looking at average bearwise distance. Again, we can construct this matrix and we can just look at some of bearwise similarities between output sequences. And then we can just average that and get the certainty score. So again, if the average bearwise distance is high, then we are uncertain. If average bearwise distance is low, we are certain. Yeah, so these two things work actually pretty good. So, and moreover, I want to again emphasize that for this particular case, you don't need an access to the model. You can just do this, you can apply these methods to Kajapet, Qt4, and anything that has an API. All right, finally, the method is called P2. From the antrothic paper, which basically says, well, why not we can ask the model directly? And they did, they just did this stuff. They have related the multiple choice question answering. They provided the question and the answer of the model and then ask it whether it's true or not. And surprisingly that the model, as you can see here, it's answers to the second question, like whether it's true or not, can be used as a proxymetric for uncertainty for its original answer. As you can see that for the wrong answers, the model is usually uncertain, like the scores are below 0.5 or closer to 0.4. And for the right answers, it's pretty confident. It has this distribution, as you can see. So this interesting way of looking at it, certainly like, again, asking the model directly itself, it's probably also interesting direction for the future work. But again, I want to emphasize that despite it works in this particular case in the work of these researchers, in our experiments and experiments of other people, that didn't work really well. Like for example, for machine translation or for other stuff, the question answering also sometimes it doesn't really work. All right, so let's do some sort of summary, what works and what doesn't. For audit detection, I was just going with the density based methods, of course. And I was just considering robust density estimation first. For second classification, I would look at the black box method, its average pairwise distance and lexical similarity also. Yeah, and also probably it's a good way to combine density based methods and perplexity that was shown in one work, on a clear work that it helps. We also get experiments where it helps but in text classification. Current this P2 method, sometimes doesn't work and symbols also sometimes fail. Right, now let's get to the final part of this talk to our program framework, LMPOLYGRAPH. So it basically helps you to know what LMS do not tell you. So LMPOLYGRAPH is basically Python library which accumulates a state-of-the-art and some estimation techniques. It supports state-of-the-art models, state-of-the-art models, GPT-like models. It has wrappers for chatGPD, for Hagen-Phase API. You can just use a very small number of just a few lines of code to add uncertainty to your language model. It also provides a benchmark to relate novel methods and estimation techniques, maybe on your data, et cetera. We also will provide a live demo, I hope soon. Maybe on MLP, maybe on AAAI. All right, some demonstration examples with LMPOLYGRAPH in our demo. For example, here we're asking model to translate some sort of non-existent language, like translating to wasJabian language. And you can see that model is completely uncertain because of our satisfaction metric is zero. And for French, it's pretty easy. As you can see, model is completely certain about its output. Another example with the knowledge of the model, as we can see, the model doesn't know Russian songs from Russian artists, from Russian singers, but it knows pretty well the songs of British, of Beatles. And as you can see, it tries to, in Russian case, it tries to imagine something. Like, it tries to predict something like Irina, Allegrova, Allegrova, Allegrova. It tries to predict something similar, but essentially it fails. And now we can detect that it fails. Another example is like asking some simple and complex questions. If we ask a complex question, like how we can cure dinosaur, consider like dinosaurs came from Earth, how we can cure them there from Pumania. And despite it provides some list of suggestions about how to cure Pumania of a dinosaur, we can see that it's completely uncertain. But when we ask the same thing for the human, well, it shows pretty decent confidence. The same thing with like, how to perform a kidney surgery. Well, and surprisingly, the model gives you a pretty good plan for a kidney surgery to perform with high confidence. But if we ask how to perform a kidney surgery with only one arm, and now the model is completely uncertain about that, yeah, it also notes that doing a kidney surgery with one arm is not very good, but you see that some estimation scores works here that it shows that this is unreliable answer. So finally, some words about our team. There are, I want to acknowledge our great team for making this library, for developing stuff. Our technology Maxime Panov, who's at the conference on technology, Alexander Panchenka, who's also was a part of this initiative. Here's our many authors, they are from different organizations. There are many others, but these are the main organizations of our work. So in conclusion, let's look at some takeaways. Okay, so there are several things that we should consider it beyond just accuracy beyond just performance metrics. These are devising fairness and reliability. So this estimation is a crucial component of machine learning systems, including language models. And for all the considered density-based methods, like robust density estimation for selective duration, try black box methods because they work very well. And maybe you can also try combination of density-based methods and complexity. Well, overall, don't trust LLMs, try a LLM polygraph to reveal what LLMs do not tell you. I want to also note that we have a very strong team at MBUI that work on devising fairness, which is led by Professor Tim Paltman. And I have an owner to be one of his colleagues and his group. Regarding fact-checking, we have also a very strong team which is led by Professor Praslav Nakhov. And I'm working also in both the direction and also in uncertain estimation. Also, I would note that Maksim Panov is also one of the colleagues who is working on uncertain estimation right now. All right, so that's it. Thank you very much for your patience. Like, if you have any questions, I'll be happy to answer. Again, here's my contact. Again, here's our GitHub link. Please give us some feedback. Still in better, I would say in the alpha maybe version, but anyways, let's see how it goes. Artem, thank you very much for this most insightful talk. So first of all, I would like to ask if there is any questions from audience. If yes, you can come here or go to the mic. Okay. Thank you, Artem, for the very nice talk with me. So my question is just maybe clarifying some things maybe I'm not understanding. Have you tried to analyze how the uncertainty of the model may be correlated with the error rate or the wrong decisions that made by the model? Does that understand it's a different thing, right? Well, actually, this is a task of selective classification where we want to all uncertain instances to be incorrect and all certain instances to be correct. So we want to sort our data set in this way. So all very uncertain instances have more errors and all instances that have correct answers are confident. So when we maybe do some abstention, like 10%, we abstain from making prediction for 10% and give these instances to human and expert or another system, we get a better boost. We get a better outcome out of that. So yeah, we did this. We have a couple of paper regarding test classification and we also have a paper regarding selective generation where for example, we solve question answering task with multiple choice question answering. So we, yes, we analyze the quality of uncertainty by looking at how good, for example, it in multiple choice question answering, which answers are certain which are not. Okay, thank you. And to a short question about just a funny question, have you tried this lempelograph to real humans? What they generate and is it possible to maybe to analyze what humans generate? Well, unfortunately, we cannot ask a human and asking to write the answer for the same question multiple times, right? And it's polygraph for language models, not for human. It's about how we can analyze the distribution of predictions of language model, not a human. And fortunately for human, we have the common polygraph, right? Artem, I would not agree with you because if you ever make in the sociological tests, you know, like in companies, this is what they do. They ask you again and again the same question and actually they kind of sample and see whether they just, so actually that works. Well, maybe I haven't tried that. How frequent human are certain? How frequent a lemon are certain, actually? Yeah, yeah, they have like a hundred questions, but then it's just repeating more or less the same question, different paraphrasing, and then basically just check whether you answered the same way. So actually that's not that far from how they do it. Okay, thank you. Okay, more questions from audience. Okay, Artem, maybe let me ask you this question, like really the fact that assembling doesn't work and like work so well is what has been empirically discovered, but do you have some generic explanation or some insight? Why is this so for LLMS and why it works so differently from some other machine learning setups? Well, maybe we haven't designed ensembles very well. I mean, there are several ways to design ensembles, we can just train them on different data sets, we can apply multiple seeds to them, but essentially if they are pre-trained, they are very, very similar. So that's a problem. So they're very, very similar. And maybe if we look at ensemble of multiple different models, like for example, if we apply some techniques for building an ensemble, like in LM Blender, for example, that will help us a little bit. So I think we need to add more diversity to ensemble in this case. I think that this is a answer, but maybe there are some other issues, maybe there are some other qubits in assembling. So I am sending your answer correctly that the variation in let's say a function of decoder mechanism currently, current typical decoder mechanism is like less than variation in ensembles of some classifiers as run through Monte Carlo dropout, something like this. Well, I mean, in a simple text classification models, you can build an ensemble just by using different random seeds, right? And they will be a little bit different, they'll be different enough. In language modeling, pre-training is probably even, you're pre-training models even harder. I mean, like that pre-trained on bigger data set. So if you fine tune them on small data set, probably they will not be that different from each other. That's problem. So the more data is in pre-training, the more similar will be the ensemble elements in your ensemble. And the more similar answers they will give you. I think this is one of the possible answers to this question. Thank you. Any more follow-up questions from audience? Okay, maybe let me ask one last but short question. So you presented this approach when you just asked language model whether the answer is correct. That's interesting, but whether people tried some kind of a chain of thought elaboration of this idea. Let's say you answer multiple times and then you kind of a sample several responses or you repeat again the question about certainty or correctness in different ways, not just once. This way obtaining also some kind of a sampling but in this dialogue way style. Yeah, I think that the way that reflexive power of language models is very strong too. And it, I saw a work where of course, chain of thought was used to improve the quality of the answer. Of course we know that the chain of thought improve the quality of the answer or the chain of, well, when the model assesses its answer by itself and it says that it's not good and answer is also being used in some works. I think there was a paper just a few days ago I saw this idea that, okay, well, it's pretty fast how people get to this idea. Yeah, so if you unfortunately don't remember the title exactly, but the idea was essentially similar, the quantitative multiple times they assessed the output by the model and then tried to correct it multiple times. Really great, we're living in an interesting age. And if you're interested in this research, yeah, get in touch with Artem or me and for your research, you might get involved as well as you see there is a lot of improvement. Okay, so we are very late out of time now. Unfortunately, Artem, thank you very much for your insightful talk. And now we need to see the gears for the next speaker. Thank you. Thanks. Mohamed, hello, can you hear? Yes, I can hear you, can you hear me? Perfect, yes, we can perfectly hear you. Can you turn on your video and try to share your slides? Okay, great. Fine, I can see you. Okay. Now we just... Can you see my slides? Yes, yes, thank you very much. So now we switch into the gears and we're having a next speaker, Dr. Mohamed Malik. So he is a postdoc fellow at the High School of Economics in Moscow now at computer science faculty. And he's also a former assistant professor at Islamabad in Pakistan. He has an extensive research experience with both teaching and research contributions spanning almost 20 years. And today he will be speaking about freight and content and target identification in low resource languages. So, and that's another NLP talk today. Hamid, without further ado, please. Okay, thank you so much. Good afternoon, everybody. Thank you so much for organizing committee for providing me this opportunity to deliver a talk on this social media mining topic and to share my findings with you. So let me introduce you that the title of the talk that is Threatening Content and Target Identification in Low Resource Languages Using NLP Techniques. That is the outline of my talk. In the first start, I want to introduce some terms and terminologies related to this domain, then the challenges, and then after that problem definition to related to my contribution in this domain regarding Urdu language. I will discuss a case study with the results and then I will conclude my talk by summarizing the findings and then future prospects. Okay, so coming to the first point that is what is hate speech? In literature, there are several definitions of hate speech exist because researchers tried to define the hate speech according to their understanding, their knowledge, their vocabulary and their thinking perspective. So these are few hate speech definitions I added and the references are also added. You can see these definitions slightly vary according to their own thinking perspective and their understanding. But let me introduce you the most common and most you can say the majority of the researchers have a consensus on this definition of hate speech that is it is a toxic speech attack on a person's individuality and likely to result in violence. One, when targeted against groups based on specific grounds like ethnicity, religion, race, place of birth, personal background, language, residence, caste, community and et cetera. So when there is a hate speech, there is some target because hate speech always targets someone. So this is the basic conclusion which we can conclude by understanding this definition of hate speech. The next point which I want to describe is what are the various forms of hate speech? How hate speech can be delivered? How hate speech can be formulated? It could, there are a few common and few famous terminologies presented here like cyberbullying can be used to define the hate speech define the hate speech, flaming, profanity, abusive language, discrimination and toxic comments. So these are few forms of hate speech. Let me share with you the definitions of these forms and then we can also distinct between the definitions and the general hate speech. What is the difference between them? For example, if I consider the abusive language, as you can see the term abusive language seeks to diminish or humiliate some person or group. And when we see the hate speech, hate speech is a type of abusive language. So abusive language could be the parent of hate speech means hate speech is at least, we can say that hate speech is at least abusive language. And in the other way, we can say that the abusive language is a form of hate speech. Similarly, toxic language. If we consider toxic language, let me share you. Sorry, okay. So the toxic language as far as its definition is concerned the toxic language is conveying content that is disrespectful, abusive, unpleasant and harmful. And when we see the definition of hate speech not all toxic comments contain hate speech. Means toxic comments could be general without targeting anyone. We can analyze and we can understand this concept of toxic language. Some people have a habit of using toxic style of using language usually. They use some words which are not directly targeting someone but they have a habit of using these words. So that is a type of toxic language that is without hate speech. In hate speech, when we talk about the hate speech there must be a target. Okay, so considering the next point that is the famous 12 languages landscape presented here that is shared by the Washington Post in 2022. The landscape described the proportion of speakers related to that particular language. Here we can see the Chinese language dominated which have a 1.39 billion speakers worldwide. Including all dialects of the Chinese language means they're all script. Some languages have more than one script. Similarly, Urdu language also has more than one script Arabic and Hindi language also. As far as Urdu and Hindi languages concern we can see the population of the people globally that 588 million people are basically using Hindi and Urdu language for their communication. As far as Arabic language is concerned the statistics are there. The proportion of speakers for Bengali language and for Russian language is almost equal. And we can see that the Italian language had lowest number of speakers according to this landscape. So here Hindi, Urdu, Chinese, Arabic, Bengali, Russian, Italian, Portuguese, German, Japanese these are low resource languages. So coming to the next point, what are the challenges with these low resource languages while designing some identification, some detection system dealing with these low resource languages? The first challenge that is the very basic challenge is the lack of annotated data set. We have to crawl the information and clean the information and then we have to go for a notation process. This is the basic and the first challenge. The next challenge is for some low resource languages like for Urdu, for Arabic, even for Russian and Bengali language, which I know because I work on these languages. There are some essential resources and accurate text processing tool kits are not available or sometimes they are missing as compared to other high resource languages, for example, for English. So that is another challenge related to the resources and pre-processing tool kits. The third challenge is some languages use multiple scripts for example, Urdu use two scripts. People use either Arabic style, that is also called Nostalic style or Roman style. Roman Urdu or Arabic Urdu. Similarly, for Arabic language, there are also two styles. In the language, there are also two styles. But for English, there is only one style that is in Roman style. Social media users usually use multiple scripts for example, if I consider, if I talk about the Urdu language, the user usually use both scripts while sharing their opinion. They use Arabic style and Roman style at the same time. So that is the problem of code mixing that is another issue, another challenge with low resource languages. And the last challenge which I interested here, these are not the all challenges. There are also other challenges, but I listed few to describe the challenges, to highlight the challenges related to low resource languages. The pertinent, famous, we can say the latest language models that are already pre-trained models are hardly available. Not every model sport the low resource languages. So that is another challenge or issue. Coming to the problem definition, which I want to, the problem and the task, which on which I want to deliver a talk and want to discuss the proposed methodology and then the results and the finding. That problem definition is given on the, is given on the left side of the presentation. Here we can see on the top of user language, then hate speech, that is general hate speech. So that general hate speech could be categorized into threatening content, could be categorized into categorized into violence incitation and other categories. So today I am discussing the type of general hate speech that is threatening content identification in the nostalgic script or Arabic style. So on the right side, the hierarchical classification of the problem is provided that the tweet or the comment of social media is going to categorize, going to be categorized into two levels that one is threatening and the other is non-threatening. Then the threatening contents are further considered for targeted identification that either individual is being victimized, targeted in the threatening post or threatening tweet or the group. The difference between the individual and group class label or we can say the category is that when a individual person is being addressed, then we say that it is related to the individual class when more than one person are being targeted, we categorize it into the group class label. Well, my contribution handling the only Urdu language, not Russian, because I'm also working on Russian language and another low-resort language. So for Urdu language, I have these contributions. First, I have a contribution in 2022 that related to the offensive content identification in Urdu, one paper was already published and the next paper is in the right of state. Then hate speech and targeted community detection that is related to the community. Today, I am just going to describe the framework related to the journal target that is individual or group. But in this hate speech and targeted community detection concept topic, we are going to basically target the community that either it's a religious community, political community, it's a media community, it's a judiciary community, it's a army, a uniform person community that who are being targeted in the hate speech. Then threatening content and target identification is today's topic. And also I have a contribution in multilingual model for threatening text identification in English and Urdu language. So today, I want to share the design of the methodology related to the Urdu language concerning the topic of threatening content and target identification. So here the question arises that why there are, there is a need to design a Urdu identification system for Urdu language. Because Urdu is a national language of Pakistan. And as we, if we consider the Asian region, then around 170 million people are basically speaking and expressing their opinions and views in Urdu language on social media. Then at the same time, if we consider the global prospective, then around 300 million people are there related to the Urdu language, those are speaker. And Urdu language is not only being spoken in Asia, in the southern part of the Asia, but also in the USA, in the UK and in the Canada region. So that's why there is a need to design identification framework related to threatening content and target identification in Urdu language. On the right side, I have added the alphabets. These are the 39 alphabets used in Urdu language to design the words and sentences to describe opinion or concept. So Urdu language is more similar to Arabic language and Persian language as compared to any other language. What was the objectives which was addressed while designing the framework for Urdu language? The first one is to design an automated identification system for Twitter data and to classify them as a threatening versus non-threatening accurately. The second objective was for threatening tweets, only for threatening tweets. Design an effective framework to identify the type of target that either individual is being targeted or a group is being targeted. The third objective was that was the preliminary requirement which we assumed, which we set to start the design that the proposed framework should be based on the automated feature generation technique in contrast or instead of handcrafted feature. So it should be automated feature generation technique. It should be based on. Now coming to the proposed methodology, here block diagram is added. We can see that the tweets were collected from the Pakistani Twitter account because the users of Pakistani Twitter accounts, Twitter users were considered because in Pakistan Urdu is the national language. We have a chance of a very big corpus, we can design a big corpus of data. So after that, the preprocessing techniques, I will describe these steps individually in next slide. Just here I want to give an overview. Then preprocessing techniques were applied and feature extractions. Here we can see the different types of feature extraction techniques were applied like word and ground character and ground semantic techniques, word and building technique, FASTX, topic models, latent semantic analysis and language model were also applied. After that, the machine learning and fine tuning process of the Roberta. It's basically not word, it's a Roberta, Urdu Roberta. It's a mistake here. So at the first level, the contents were classified into threatening versus not threatening. Then at the second level, the threatening contents were considered for targeted ratification into individual or group. Before applying the actual proposed methodology, we should have a data set, annotated data set. That is the first challenge which I describe while discussing the challenges related to the low resource languages. So annotated data sets always have a problem with low resource languages. So we crawl the data. Hello, is there any question? Okay. So we crawl relevant information, relevant tweets from the tweeters and then process those tweets. Then after that, we annotate the data sets according to the designed annotated guidelines. Before starting the crawling of the data, whenever we, in NLP, whenever we have, we want to crawl a data related to specific topics, specific research area, we need to design a lexicon because without this lexicon, without this seed word lexicon, we could not crawl the relevant data because if we crawl all the data, then how we can choose and identify that this data is related to our domain, our topic and this is not related to our domain or topic. So for designing the lexicon, we need to design this lexicon manually by looking the type of contents for which we are going to crawl the corpus. So this lexicon will also alternatively, we can say that it's a seed word lexicon. So we also designed 250 keywords lexicon keywords to, so that we can easily crawl the relevant tweets from the Twitter, Urdu Twitter account. Here, I have added the examples of a few keywords. You can see the Urdu part of the keyword and the translated part. Unigram, biogram and trigram and also some keywords in foreground. So after designing this lexicon, then everybody who are going to crawl the data is in a position to write the relevant comments post or whatever the social media platform they are going to crawl the information. So the next point is the time range for which time we should consider the data crawling because it depends upon the situation. For threatening content and targeted identification, we consider the time period of 24 months ranging from August 2022, August 2022 because in this time period, the political situation in Pakistan was very unstable. So people were very aggressive, people were very sometimes sad, worried and they expressed their opinion on Twitter, on Facebook and then we have an opportunity in this time period to get the relevant content so that we can design a better annotated data set. So by applying this lexicon and the timeframe, we crawl the data for both types of content threatening versus threatening. Then after that, we applied the process of cleaning because the data should be cleaned and before giving them to the annotators. If the data has inconsistency, the annotators will have a problem while they annotate the data set. So there are a few steps invested here for related to the cleaning. Removal of empty tweets, duplicate tweets and those tweets that contain other language words. For example, if a tweet contains the Arabic word or sometimes a Hindi word, sometimes a Bengali word. So it is not possible manually to translate those words into the Urdu language because it takes manual effort. So we remove those tweets. After that, we have a clean data set. Then we design the annotated guidelines for annotators so that they can easily annotate the data set. And for this annotation, we hired the services of Pakistani annotators because in Pakistan, Urdu is a national language and people have more advantage compared to other countries. So the basic criteria for choosing the annotators was the native, there should be native Urdu speakers and they should have at least most level education and have a prayer experience of annotating Urdu data because the annotation is not only for one level. There are two level of annotation. First, they have to categorize it into threatening versus non-threatening. And then in the second level, the threatening contents will be considered for individual or group category. So after annotation, we have a data, we compile the data and compute the inter-annotator agreement that is more than 80, that is 83%. Here I added a few samples of the annotated data set. We can see the first, second are the threatening tweets and on the second level, the individual or group is being targeted. In this column, the Urdu part is mentioned and the side column with the translated version is mentioned. If you see the third and fourth tweet, it is obvious and it is clear that for non-threatening tweets, we do not need to consider the second level because we are interested for threatening and then for threatening, who is being threatened? Individual or group, so that's why here this level of annotation is not applied. But here, the fourth tweet, you can see that that is not threatening but that is abusive or toxic content because it's the person is being referred with the dog tail. So it's not threatening, it's not exactly threatening but it is abusive, the person is being abused, being inserted, being disrespectful. As I described earlier, the cleaning process was adopted before annotation. Then after annotation, when we have a final dataset, we applied the process of preprocessing. Here, the main difficulty was with the stop words list because with Urdu, the stop words are very difficult and very different, not in English life because in Urdu, if you see the style of writing Urdu, then for one type of stop word, there could be many options, many combinations of stop words. For example, the word here is ka, it is a stop word. It could be k, there could be multiple versions of this stop word. So that was a main obstacle hurdle with the stop words identification and removal. But we designed, there were already some stop words were available because researchers did some work but we compile a big lexicon and we shared it. Then at the same time with the stop word removal, not for transformer model, for other feature engineering models, we also transform the emojis and emoticons which are present in the tweets so that the context should be the same. If we remove the emojis and emoticons from the tweets, then the context could be broken. So we adopt this option that we should translate these emojis into the corresponding text. And the other preprocessing techniques, these are already obvious and people are familiar with them that the other irrelevant information should be removed. Here, the demonstration of the preprocessing techniques that I did, first punctuation removal with the Urdu text and then with the after removal, you can see and that here the translated tweet is also added. For stop word, this is really interesting. You can see that the stop word removed, when stop word removed the sentence, the behavior of the sentence and how we pronounce and how we read this sentence after stop word removal. And replacement of emojis, here the example is also demonstrated that in this tweet, the person is angry. So we replace the corresponding text and then after that the hashtag and the other irrelevant information are also removed. I also added in this slide some sample stop words so that we can get idea of what type of stop words are being used in Urdu. I added the corresponding translation also. You can see these are few stop words used in Urdu language. Here the word count and cloud representation is presented so that we can get the most dominating words, key words that are being used to threaten somebody. Now let me introduce you about the proposed methodology related to the feature engineering and then the machine learning part of the framework. We search the latest and the already pre-trained language model that are available in Urdu. So after doing the exhaustive search, we found only two transformer models that are pre-trained in Urdu. One is Urdu-Rubata that the other is Bert Samal. So we applied both models for feature engineering but in this talk I added only this Urdu-Rubata model because this model performance is very effective and very promising. So here we have a transformer model that is pre-trained on a big corpus. If we want to use this transformer model for our specific class in Urdu, although it is already pre-trained in Urdu, we need to fine-tune carefully with imported hyperparameters. So with fine-tuning, it's not easy task. If we blindly fine-tune our pre-trained model, it could even loss the prior knowledge. So we have two issues while fine-tuning any transformer model. When we fine-tune, we have these two issues. One is catastrophic forgetting and the other one is overfitting. Catastrophic forgetting means the models already learn knowledge by pre-training. And if we fine-tune it on another corpus, then if we do not deal it properly, then it could lose the previous learning. So we have to, because we unfreeze the all layers of the transformer model and then for learning new knowledge by fine-tuning, and so it could lose the previous learning. The other problem is overfitting that is also related with deep learning domain that choosing the number of epochs for training a model is a combination. So if we choose very few epochs, it could result in underfitting and if we choose too many epochs, it could result in overfitting. So we deal these two problems appropriately, the catastrophic forgetting and overfitting. This is the list of hyperparameters which we use for fine-tuning the Udu-Roberta model. You can see the list of the parameters and their ranges. We applied the grid search methodology to find the optimum values of the parameters so that we can get the best performance from fine-tuning. While overfitting problem was dealt with, we used the validation, because when we split the data, we split it into three parts, training, validation and test part. And the technique mentioned here, it's a very common technique and majority of the researcher applied this technique for fine-tuning that the 80% data should be used for the training purposes and 20% is used for testing purposes. And from the 80%, 90% is actually used for training and 10% is used for validation. So we use the validation part of the data set to analyze the validation loss because when we apply the train model on the validation part of the data set, we generate the validation loss and we monitor the validation loss and see when model, when validation loss decreases and when it is going to increase and continuously increasing. So we concluded on this matter of overfitting that five to six epochs are enough for this problem. I will describe this matter in the next slides and that how it is possible. So fine-tuning process, it is common that we need to apply the tokenization then training or training any transformer model with a classification layer. As far as catastrophic forgetting is concerned, I already described that when we unfreeze the layers of the transformer model while fine-tuning, we have to carefully monitor the learning of the model that it already learned. So we finalize the starting learning rate that from which learning rate we should start our model fine-tuning so that it could keep the previous learning and also will be able to learn new knowledge that is related to threatening content and targeted identification. So we tried several learning rates. Those are given here. Three eras to power four and above. But while applying the learning rate, we reach on the conclusion that the fine-tuning of the Odo Roberta leads to conversion failure. So we obtain the best performance with the two eras to power minus five learning rate that is helpful to handle the problem of catastrophic forgetting. The next point is which baseline and comparable models we use to compare our proposed methodology so that we can compare fairly that how our proposed model is performing. So there was only one study exist in the prior approaches that also address the same problem of threatening content and its threatening content and targeted identification. But with this approach, the problem was with their annotated data set. Their data set was actually not threatening and unthreatening. Those were basically the data set was actually offensive data set and they use it for the threatening and threatening identification problem. So we already described that we designed the new data set annotated it on the two level threatening versus not threatening and targeted identification. So for this benchmark or baseline, we regenerate their results on our data set to compare it fairly with our proposed methodology. We also design our new comparable models so that we must compare our proposed methodology with enough comparables. So latent semantic analysis and bag of word approaches or feature engineering approaches were considered in the benchmark word and ground, corrective and ground, fast tech embeddings these feature engineering approaches were present. And we also utilize the state-of-the-art machine learning models because these models already demonstrated significant performance and related NLP tasks. And these are the performance measures to evaluate the performance of the classifiers. Coming to the next point. Okay, so here I added the training and validation results obtained by fine-tuning the Urdu Roberta for threatening versus non-threatening tasks. You can see the two sequence stands for upright 64 and 128 with three batch sizes 8, 16 and 32. And five epochs results are added here. The data related to the validation loss, training loss, validation accuracy and their time. We can see that the very training loss is continuously decreasing. It's mean when we when we train the model it continuously train. When we applied the train model on the validation part we can see that the first three for three instances, sorry for first three epochs the model learn model behave appropriately. Then after that the validation loss start increasing. This is the common behavior of train model with all combinations. So if we see this behavior of validation and training loss individually, we can see that the behavior of the training loss is continuously decreasing. We can see with different sequence length and different batch sizes with sequence length of 64 and batch size of 8, 16 and 32 and for sequence length of 48, 8, 16 and 32 batch size. The training loss is continuously decreasing. It's mean on every epoch the model is continuously learning. But when we when we applied the training train model on the validation part of the data set we can see that for the first three epochs the model the model behavior is that the model is basically performing very well and then after that the validation loss start increasing up to the five epochs. So we reach on the conclusion that five epochs are enough even four epochs are also enough to test this model on the test part of the data set because after this the validation loss is continuously increasing and there is no because if we try more epochs then our model could overfit. Maybe it could give better results but it could overfit on the validation part of the data set and then when we try it on the test part it will perform very badly. So we did not apply we only apply the five epochs. The reason is I already described with you. That is for threatening not threatening task. So before coming to the this validated first we trained, then validated and then tested sorry this validated fine-tuned Roberta model for testing part of the data set. Let me describe you the results obtained from the baseline and from the comparable model. Here you can see that because there are five machine learning algorithms we applied linear regression logistic regression as far as SVM, K-Nearest, NABL and NIBE base. So these five machine learning models were applied for every type of features. A word unigram, by-ground trigram and their combined effect then corrector, unigram by-ground trigram. I just added only those results which are at least some which crosses some threshold. So also we can see the first text performance, bag of word, latent semantic analysis. Here we can see that the logistic regression as compared to the other machine learning models, applied machine learning model outperform it gives the best performance. So this is one of the findings related to the baseline and comparable model. And the other point is as far as settling versus not settling classification is concerned in baseline and comparable model we got the best performance with corrector by-ground. Here we can see the performance because this performance is in accuracy and our data set is balanced data set. In addition we also compute other metrics that are perceived and macro F1. So here the comparison of the proposed methodology proposed framework for threatening versus non-threatening and the baseline were compared. We can see the performance of the various classification models and in proposed section we can see that we got the best performance with 64 sequence length and with 8 batch size and the performance is 87.5% accuracy and if we consider the micro F1 that is 87.8% and with corrector by-ground the performance is 85.83%. So the proposed framework that is based on not on handicraft features but basically based on automated feature generation methodology it outperformed the other comparable and even the baseline. The next part is the target identification of the threatening tweets that which target is being addressed either individual or group. So here I added again the training and validation results of fine-tuning Roberta model for target identification. We can see the performance of the training loss, validation loss, validation accuracy and the number of epochs. If we close and analyze the validation loss here for threatening versus non-threatening task the validation loss start increasing from 30 epochs but here you can see that the validation on validation loss when we applied the train model on the validation part of the dataset the validation loss start increasing from the 30 epochs up to the 5 epochs. So that is the point related to the overfitting that if we try more epochs if we try 10 epochs then our model will definitely overfit because validation loss is not decreasing. Then the comparison of the baseline and comparable model for target identification task with the proposed methodology here we can see that compared to the threatening versus non-threatening classification task for target identification the sequence length of 128 with batch size of 8 outperformed and provided 83.20% micro eq 1 and it outperformed. We just have a couple of few minutes so if you can try to I am just going to because it's just two slides. Okay, so by concluding the talk I can say that we designed the significant threatening and target identification framework using the contextual semantic embeddings that are basically designed using the transform model but with fine tuning to handle the ambiguity and the complexity issues of the Urdu language. The proposed framework demonstrated benchmark performance in comparison with the comparable and the baseline on the top of all it is the proposed framework is basically based on the automated feature generation technique not on handcrafted feature and the transform model can capture the actual context of the language being used to threat someone. So the finding of this could be helpful for law enforcement organization for identification of this type of unwanted material that is threatening content and target identification in Urdu language. While talking about the future prospects, according to my point of view, I surprise the future prospects in three points. If we deal the interpretability of the trained models and the tested models for low resource languages we will face the issue. For example because each low resource language has a different way of creating context to describe an opinion. The next issue that could arise is that the definitions of these different form of hate speech these definitions overlap. So we need appropriate categorization of these various types of hate speech for low resource languages because the definitions overlap and then the classification system could not be effective or efficient. The third future prospect and the challenge would be that as I described earlier that people usually use multiple scripts for a single language to describe their opinions. So there must be a problem of code mixing and if we are going to design an efficient code mix content identification problem that is not an easy task. So that is also a challenging task. I am not talking about simple code mix content identification, I am talking about the efficient that should be efficient. So that's all from my side and any questions from the audience. Thank you very much for this nice presentation and very important topic of course for model NLP. First I just would like to ask if there is any questions from audience. Let me then just start. The formulation of your task is classification, right? So you try to cast toxicity problem as classification but how about alternative formulations. Let's say you mentioned that hate speech always has a target. How about detecting precisely what is the target and what exactly insults or other aspects particular hate speech attributes we used or how about a generation of hate speech using LLMs and preventing this so could you comment on these alternative let's say directions of work? I use the handcrafted features and also the language model too. Basically the problem is not hate speech, it's a type of hate speech that is the threatening content. So I applied both models both type of model handcrafted feature generation techniques. Yes that's what I mean but the question is not about how do you solve the problem but whether okay we need classifiers for hate speech for toxicity but then maybe we these automatic systems, machine learning systems should be also acting okay let's say being tagging systems every token of systems so technically not as classifiers but maybe as sequence tagging or rewriting toxic speech into something non-toxic which would be kind of a machine translation or sequence to sequence Yeah What do you think? My topic was just classification I was not considering the sequence to sequence phenomena and because that's why I use the annotated dataset so that the classifier have an opportunity to learn the exact context of the language and that is the specifically language bound classifier that is for Urdu Arabic style that could not be we could not see that we could not say that this type of this particular classifier could be applied as it is for other low resource languages for other low resource languages we have to consider the language context what are the opportunities what are the preprocessing and what are the other things are available we have to consider them and it's a monolingual approach it's not a multi-lingual although I have designed a multi-lingual approach as far as low resource languages usually monolingual approaches are designed so it is just specifically related to the specific language and specific style so Yes I'm sorry to interrupt but we are really out of time because of some shift I suggest you just to show your context and for those who are interested can contact you on this work and ask questions directly ok thank you very much let's thank Mohammed for his presentation and proceed to for the talk chaired by Andrei Kruz ok thank you so much and thank you for the organizing committee for the meeting provided me this opportunity right everyone I'm not sure there are any people present on-site this NLP session because I am also participating remotely due to flight delays so if there is anyone in Yerevan who is now sitting at the NLP track session and watching us please maybe join the Zoom room so that we know that you are there or turn on the camera because it's a bit difficult to chair a session without knowing whether there is anyone there on their own site that's just wait a couple of minutes maybe some reaction Andrei can you hear me Yes we have this shift now so the first talk is by Anna Marshalova Elena Bruches and Tatiana Batura so some yes they are there so my question was whether there is anyone in the room on-site no I think the last session potentially was designed to be online so I think we should yes all the speakers will be online so yeah that's fine I guess I'm not entirely sure but I think yes I checked yes all the speakers are online okay yes then let's just proceed okay so this is our fully remote session and it's also streamed on YouTube so it's not limited to only those present in Zoom but yeah so the first talk is about automatic aspect extraction from scientific texts by Anna Marshalova Elena Bruches Tatiana Batura and as far as I understand Anna will be presenting is that true can you unmute yourself yes can you hear me thanks yes yes so the floor is yours can you try to share your screen of course yes yes we can we can see your screen so yeah you have 15 minutes and the floor is yours so hello everyone my name is Anna and I'm going to present our paper which is called automatic aspect extraction from scientific texts as the number of published research papers increases there is a growing need in tools that automatically extract information from them for example we might need to extract such information as the task of the research the author's contribution the use methods the conclusion of the study we suggest calling these main points of paper its aspects however even though Russian is among the languages most commonly used in science there is a sparse amount of aspect extraction tools for Russian and most of them focus on certain domains such as medicine or computer science to address this in our research we to create a cross-domain dataset of Russian language scientific texts and to propose a tool for automatic aspect extraction from Russian scientific texts of any domain let's start with the dataset it contains 200 abstracts 2 papers of scientific domains namely psychology physics, medicine, mathematics computer science, linguistics journalism, pedagogy law and history in this text we identified 4 types of aspects task, contribution, method and conclusion here is an example of an annotated text as you can see aspects do not cover the whole text and aspects can be nested for example in this case task and method are nested inside the contribution aspect overall we identified 836 aspects almost half of them being the contribution aspect this might be due to the fact that abstracts are written to give an idea of the author's contribution however in some domains the conclusion aspect prevails for example medicine with papers describing results of clinical studies or history with papers describing results of archaeological expeditions as for the task aspect we discovered that in some domains especially in humanities we are not talking about the tasks themselves but rather about mathematical issues or research objects so it was decided to attribute them to the task aspect as well and as for the method aspect it is most usually mentioned in papers on natural and exact sciences an average length of an aspect is 12 tokens but it mostly depends on the aspect type so tasks and methods are rather short and they are expressed in short terms or phrases whereas conclusion and contribution are rather long and they are expressed in full sentences or clauses so let's move on to the algorithm for this task we have tuned a BERT model on a multi-class, multi-label token classification task so for each token we select up to two the most probable aspects the probabilities of which are higher than the threshold and if none of them are higher than the threshold it means that the aspect is not assigned to any the token is not assigned to any aspect at all after that the neighboring tokens assigned to one aspect are united into spans and to the spans we apply some eristics to enhance the aspect boundary detection these eristics usually remove unnecessary words or add missing ones to the extracted aspects finally aspects expressed in nominal phrases are put into the nominative case so here is what we get as results on the left there is an example of automatic aspect extraction and you can compare it to the manual annotation on the right in this task in this case the model performed quite well but not perfectly for example the extracted conclusion is rather incomplete but it still expresses the main point found the best model we conducted a number of experiments which included using different portraying models and freezing weights and putting some layers on top but the best results were shown by multilingual birds fine-tuned to our data with just a linear layer for classification the fact that a multilingual model outperformed monolingual specialized models is quite surprising but to find its reasons some new experiments are needed and we plan to conduct them in future by now these are the best metrics for the best model and the best extracted aspect is contribution as it is the most frequent aspect in the dataset and the worst expected aspect is task which might be too due to its heterogeneity apart from metrics for individual tokens we used exact match ratio which is lower than the other metrics so we still have some problems with aspect boundary detection finally we conducted cross-domain experiments to see how our model performs on unseen domains for each experiment we used nine domains to train the model and one to test it and the obtained results prove that our model is able to generalize to new domains so as a result in our study we created a cross-domain dataset of Russian language scientific texts with manual aspect annotation and proposed a tool for automatic aspects extraction from Russian scientific texts of any domain the code and the dataset are available by this link and the result thank you for your attention yes thanks a lot so since I hear some applause from Yerevan I'm not sure maybe there are some questions from the audience on site before we get to questions from the audience online so at least I do have a question you can proceed right so again thanks for the talk my question is mostly related to the dataset that you created especially these aspects that you choose to use so for example it's not like I'm looking at the dataset right now in your GitHub repository and it's not quite clear what's the difference between contributions and conclusions because it's obvious that sometimes well not sometimes but often the conclusion of the paper contains the contributions right I'm looking at some examples where I myself would not be entirely sure whether this is a contribution or a conclusion so how did you choose between these two labels when annotating I mean we mostly identified as contributions something that authors have done and when they write that we have proposed something research something or something like that and we identified conclusion when they write about the exact conclusion that was reached during their research so one of the examples from the linguistics subset so the sentence this is labeled as contribution and then comes the sentence labeled as conclusion and the sentence is so like the authors have shown something and this is labeled as conclusion and then another sort of description of what the authors have done is labeled as contribution for the conclusion aspect we also used some word markers like pokazana and so on and we mostly identified as conclusion the clause which follows such the main sentence which is pokazana so I mean in this case it is more obvious that this is a conclusion and when we are talking about something like the first sentence you mentioned so how I see it in this sentence they propose to consider this problem in this way so yeah and you report the average inter-rater agreement in the paper but do you maybe remember what was the agreement for these two aspects in the contribution conclusion we did not measure agreement for pairs of aspects it was just measured for the whole deficit yes yes but I guess you have the numbers for every specific aspect because you report the average inter-rater agreement over all four aspects so if you average it then it means that you have four estimations I mean for values well yeah I guess I haven't but I just I don't think that I paid attention to the intermediate results it was just counted it was just averaged and I think that maybe I should have paid attention because I think there might be some interesting discoveries about which pairs of aspects are most usually confused okay I see thanks yeah I believe the data set that you released definitely is going to be very useful okay anyway any questions from the well from Jerevan I guess you can just come to the microphone and start speaking because I don't see you anyway well I see something well if you come to the microphone it will be better I guess no you can proceed there is no question from the audience for this one okay then let's thank the speakers again hey thanks a lot so yeah the next talk is supposed to be prompt tuning for targeted sentiment analysis in Russian by Juliana Salomatin and Natalia Lukashevich hello everyone yeah I'm going to present this online offline perfect I think I'm going to need some help with my presentation on this laptop so I guess the local team usually changes the setup so that the slides are shown in addition to the video stream so if Alexander could help it would be great unfortunately I definitely cannot fix it remotely but you have the slides on the laptop right the help is on its way oh perfect okay I'm going to present our paper which is called prompt tuning for targeted sentiment analysis in Russian and first of all we need to define what targeted sentiment analysis is and how it is different from general sentiment analysis in fact it is often important to take into account the relationships between the participants of the situation for example X offended Y and so on and this is what targeted sentiment analysis is so we have a target and we have some attitude that is being expressed towards it and there are very few studies on this topic in Russian language material and also it is important to mention that targeted sentiment analysis is particularly relevant for the news discourse and news texts are more difficult to analyze in terms of sentiment as compared with reviews for example not because of this target topic but also because of the predominance of neutral polarities since journalists always try to be as neutral as possible and that's why some sentiments are expressed implicitly which means there are no expressive sentiment words but just some underlying meanings, some facts that can be defined as sentiments and so earlier this year the competition Rusent NE was organized and the task was to deal with the targeted sentiment analysis on Russian data and the current study suggests prompt based learning for this task and for the black bone model we used Roubaix large model and the experiments were based on the question answering approach which means that the task was formulated as a question in a natural language and then it was fed into the model and the expected output was the class label positive negative or neutral and it is important to mention that what worked best is fine tuning and prompt tuning both and I'm going to briefly overview the methods that we implemented during our experiments and first of all we are all very familiar with the fine tuning approach but there is also a prompting approach that was recently suggested and the idea behind that is that when we fine tune the model for the downstream task we can formulate a prompt in such a manner that the downstream task will be very similar to the pre-training task and this can boost the model's performance and another problem with this fine tuning is that it is hard to choose the prompt manually because we can change one token in this prompt and this can significantly affect the result and that's why we can tune the prompt just like we tune the whole model or if we deal with a very large model like GPT-3 we can tune the prompt instead of fine tuning the model and after that many different modifications of this approach were suggested for example prompt tuning with rules it means that we can mask not only the class label but also some sub prompts, some other tokens that help explain the task to the model better and then we can aggregate the predictions via a conductive normal form and get the final class label another approach is not reachable prompt tuning it deals with the verbalizer verbalizer is basically a list of class names which are then mapped with the labels of the class and these names are predicted during the language modeling blog and knowledgeable prompt tuning means that we can predict not only the labels of the class but also some words related to them which are extracted from some external linguistic research so in this study we implemented both manual prompts and prompt tuning approaches that I just described and also we implemented the approach of mixed templates which means that some tokens in the prompt can be fixed and others can be tuned and the data is recent NECorpus which was pre-labeled with named entities and then it was labeled with sentiments and with relationships between their named entities and for the competition it was preprocessed for example there are only used sentences with non-contradictionary cases that X is loved by Y but hated by Z it is contradictory case which in these cases were excluded and also we only dealt with sentence level and not with document level and for evaluation in this research we first conducted three-fold cross-validation and then we tested our model on the recent NEC split which was provided by the organizers of the competition and I would like to say that the baseline model for the competition was rule-built without any modifications and the results are presented on the slide and the metrics which were used apart from just the F score F1 score also the F1 score that was averaged across only the positive and negative class since these two classes are particularly interesting for this task and also I would like to acknowledge the authors of this paper about open prompt since they released a very helpful tool for implementing different prompt tuning strategies out of the box which really helps using them for using any prompt tuning approach for the downstream task just by loading their pre-trained model from Hugginface and formulating the template of the prompt and the first series of experiments is manual prompt based experiments and here we tested two variables the prompt type and the way we handle class imbalance since as I already mentioned news texts contain many neutral polarities and not many positive or negative ones and we used a prompt as just the target word or a question to the target word about how do they feel about X and for handling class imbalance we first of all try to not use any other anything for this then we try to calculate class weights in the during the loss function and also to augment data via back translation and replacing some tokens with contextually close other tokens then the mixed template is present on the slide here the soft are the tunable tokens and text are the fixed tokens which were just the same during their whole training stage and the verbalizer consists of just class names and there then we implemented prompt with rules and here we also tested how the initialization of the prompt can affect the result and in the first template we tried to focus on the fact that there is some participant of the situation mentioned in the sentence from which the model can derive the attitude and in the second plate we emphasize the fact that the sentiment can be expressed both implicitly and explicitly and the last approach is notable prompt tuning and for this we utilized root center lex and root center frames lexicons and in the first tragedy we tried to collect words for the verbalizes in such a way that for both negative and positive sentiments the classes could overlap since there are some words that can have different meanings in different contexts and can have different sentiments in different contexts and in the second strategy we made sure that there are no overlapping and the positive class is always prioritized since in the previous experiments we saw that the results for the positive class are always a bit worse than for the negative one and so it means that for example some words like shame murder have negative connotation that's why it is negative class and so on the results are presented here and we can see first of all that no no method to handle class imbalance showed good result and some of them like augmentation are very time-consuming and they do not give some performance boost so they are not really suitable for the task and prompting approach works much better and the best model that showed the best result during our experiments is knowledgeable prompt tuning with the other first strategy and so we selected this best model and then evaluated it on the data split from the competition and the results can be compared to the third place however the models from the top of the leader board leveraged its assembly methods and these methods are extremely computationally intensive because they are ensembles of transformers and so the prompting and prompt tuning approach works really good only in terms of the quality but also in terms of the computational and so in the current study we researched the task of targeted sentiment analysis in Russia and we tested different strategies for both hard prompts and soft prompts and we also saw that prompt tuning surpasses vanilla fine tuning and manual prompting and the best model that showed really good results was the model based on knowledgeable prompt tuning and that's it thank you for your attention if you have any questions please feel free to ask them thanks a lot Filiana are there any questions Zoom for a change start with the online audience you can raise your hand or just unmute your microphone and as I said if there are any questions in the on-site from the on-site audience just come up to the mic okay meanwhile maybe a very silly question for me since you use prompting anyway so why have you decided to use BERT like encoder models not generative decoder or encoder decoder models and don't you think that using generative models would improve the performance of your approach thank you for your question yeah I was considering that but first I started with the BERT model because this was before the whole thing with chat dpt and so on and BERT models were more suitable for classification tasks and also for the purposes of computational possibilities and then the experiment showed that Rue Roberta was given even better results and then the chat dpt appeared and I also had some experiments with chat dpt which were not very successful and this is just a topic for another conversation why they were not that successful but yeah I think that it is possible that decoder only models when promptuned can be very suitable for this task but I also wonder as far as I know no one from the participants of the competition no one used decoder models maybe they also conducted some experiments but yeah thank you I think that there is room for improvement for other experiments on that yeah thanks and what about decoder models like T5 etc you didn't try them either no I didn't try them okay thanks okay thank you I guess we are way out of time so we probably should move on with the next talk but yeah thanks again Edana that was very interesting and the next talk is supposed to be the battle informational presentations comparing sentiment and semantic features for forecasting market trends yeah and I guess Andrei Zaichenko is here in Zoom yes can you hear me yes we can hear you and we can see your slides so yeah I guess please move on you have 15 minutes floor is yours greetings everyone today on behalf of my team I will introduce to you our paper something's happening Andrei are you still here yes but we don't see your slides anymore can you Andrei I guess he was disconnected Andrei are you here yeah yeah I'm here sorry the zoom crashed for me yes may I also ask you to maybe turn on your camera because it will be much better if you present if we see your face when you're presenting yeah sorry but I don't have access to my notebook I'm presented from the PC and it doesn't have a camera yeah yeah then yeah just please share your screen yes okay let's start again yeah yeah so again greetings everyone I will introduce to you our paper comparing sentiment and semantic features for forecasting market trends in recent years several studies have applied deep learning and on p-techniques to financial data including news articles social media posts and quarterly financial reports in order to predict price movements despite its significance most researchers rely on sentiment analysis as the primary additional feature living little exploration of the potential of semantics and context hidden in a text in this paper we aim to fill this gap in research by testing the hypothesis that semantic features are important for stock price prediction our approach uses sentence embeddings extracted from Twitter data to capture extra information and contextual relationships with financial market trends and then we compare our approach to traditional sentiment-based solutions to evaluate its performance in the introduction to the paper we made a claim about existence of contextual information inside the text that can be utilized and retained using the embedding approach to demonstrate the validity of this claim we created a vector representation of the text using the state-of-the-art sentence transformer model mapping our sentence to 384 dimensional dense vector space after the creation of the embeddings we proceeded with vector clustering and used a bird topic modeling technique to create a hierarchy just Andrej, is it supposed to be that you still show the title slide only? no, no, no I think there is some problem with zoom because we only see the first slide it says zoom quit unexpectedly maybe yes, now I stop the sharing yes can you see the slide? yes, now it works ok, ok, ok can you try to change the slides just to make sure that it works? ok, now I don't know, it crashes right when I switch the slides this app I don't know I try to share the screen share the pdf alone yeah, can you maybe maybe it's easier if you just send us the slides and then I for example can I send them in the chat? yes file in the chat ok, ok, ok yeah it's the first time from me, I'm sorry that's fine I guess that's how things are going on with the hybrid conferences yeah, I sent the presentation yeah, just I'm downloading it ok, so these should be your slides, right? yes, yes, correct yeah, let me ok, just tell me when to go on to the next slide ok, currently we're on slide 3 as you can see on the slide the result in output of the topic clustering in this case you can see the results for the google tweets it produced a list of 20 topics denoted by the distinct circles which were later grouped up into 4 main large clusters of the topics so next slide now we will connect these topics and market data and observe multiple time periods with high and low user activity during low activity period we can see proportionally equal spikes of clustered topics reacting to volatility changes and they are caused by just the tweet number increase on the other hand high user activity leads to great diversity in topic reactions all of them have their peaks and lows meaning that larger sample helps us make a distinction of topic trends resulting in higher correlation that is at least twice as large as you can see in the tables below the graphs at least further speaks in favor of our ongoing hypothesis that great amount of extra information is hidden inside the text semantics that can be used to predict stock market volatility next slide now we proceed to an overview of overall scheme of conducted experiment it consists of 4 main steps first is data retrieval a process we collected a 5 year data set published on Kegel it was containing tweets regarding Tesla, Apple, Amazon, Google and Microsoft companies then the data preprocessing so we deduplicated some tweets in order to clean the data then the covariates aggregation depending on the experiment we either used binary sentiment score or created the embedding vector and used this as a feature for the target variable we chose the close price last step is model prediction after covariates are were fed into either TFT or in linear models we train it and make a 3 or 5 step ahead historical prediction on validation data set dependent on the experiment in order to prevent overfitting and produce better results we introduced a custom loss function DMSE it was built on top of one of the most popular loss functions for regression text the mean square error but we added directional component it was done because MSC just focuses on the difference between true and predicted price in terms of stock price prediction the direction of the price moment is actually a more important factor than the value itself that's why the custom loss was introduced next slide depending on the experiment three main groups of pass cave areas were used containing market, sentiment and embedding features further they will be denoted by the abbreviations where HLOV is market features S denotes sentiment score and E is for embedding vectors next slide two models and one baseline approach were used during experiments to predict the closing stock price the temporal fusion transformer and linear and naive seasonal approach TFT is an encoded-to-coded transform model it makes use of multi-head attention GRNs and LSTM while N linear is a more novel approach of a simple linear layer introduced in 2022 that out-formed multiple transformer models on time series benchmark data sets that's why we used it as kind of a benchmark for our experiment next slide during elaborate exploration of sentiment score we achieved results confirming that the social network sentiment is a great indicator for stock price movement prediction as you can see on figures on the left we demonstrate stock price movement as well as sentiment score calculated in both ways first as a relation between negative and positive tweets and second as a fraction of negative tweets in the total amount visually we can observe a great amount of resemblance between price and sentiment variables in this case Apple which is above has a lower correlation below for Apple in this case the correlation was around 20% while Amazon has 40% when we compare stock volatility with sentiment score in figures in the middle we observe the same phenomena correlation is higher for Amazon by 52% and actually for Apple we can clearly see the time lag between public reaction and stock volatility volatility precedes public sentiment shift but for Amazon the situation is slightly different we can see that sentiment changes are synchronized and even precede the volatility shift in some cases making it a greater predictor of price movement using scatterblot we can further observe linear dependence between sentiment and volatility for both companies although for Amazon the dependence is more prominent we therefore claim that there is a clear statistical relationship between observed values next slide here we present a table of a resulting error score for symmetric mean absolute percentage error for all companies overall TFT model performed better than in linear for all companies the best performing model for Apple is TFT with embedding vectors as a feature comparing the closest instance we observe at least 30% degrees for SMAPE it is also important to point out that for Apple using sentiment score actually didn't help to improve models accuracy for both of the models on the contrary with Amazon we observed different behavior of models performance given the input features both and linear and TFT received performance boost with usage of sentiment score an embedding vector actually yielded higher error values models with embedding vectors show the best accuracy only in two cases out of five for Apple and Microsoft that were companies with the lowest correlation between volatility and sentiment as we said in the previous experiments for Amazon, Google and Tesla sentiment score auto formed our embedding vector approach next slide results could not lead us to a complete conclusion so we experimented further with a smaller prediction window of three days another approach was a different sentence embedding algorithm created by Microsoft mpnet it has twice the amount of dimensions compared with the approach mentioned previously but the accuracy of closing price prediction dropped significantly for our model in the table we observed that for some metrics embedding approach still shows better performance but the difference actually isn't significant and sentiment approach received better results for four out of five metrics like in the case of Apple another company that had nearly similar results for both features as was Google the difference are negligible only 0.04% for MAPE, for Amazon and Tesla sentiment again proved to be a better feature scoring higher for all of the metrics both of those additional experiments they further demonstrated that sentiment as a baseline solution still performs better than the proposed embedding vector approach next slide the results of this study provides further evidence in support of sentiment analysis as an effective tool for predicting price movements in financial markets it showed that the binary sentiment polarity extraction approach outperformed sentiment embeddings in terms of accuracy and training time for some cases the embedding approach proved to be useful on five-day prediction window outperforming sentiment baseline solutions this suggests that the choice between binary sentiment polarity extractions and sentiment embeddings as the preferred approach can possibly depend on the specific task and the prediction horizon as well as the effectiveness of sentiment as a predictor in the given context in the majority of the conducted experiments the sentiment approach outperforms the embedding vectors method this fact might be might be counterintuitive because embeddings seem to encompass more valuable contextual information however sentiments tend to represent information in a more concise way bringing less noise into the prediction model nevertheless the embedding approach still has an advantage that it does not require an additional model for sentiment extraction and the consequent quality verification of that procedure on the other hand sentiment embeddings approach that doesn't require us to verify it and can produce similar results to sentiment extraction retaining more of the semantic and contextual information contained in the text nevertheless the model training time for sentence embedding sentence embedding is significantly longer these findings suggest that sentence embedding could be considered as a robust solution after further works due to its similar performance to the sentiment extraction yes thank you for your attention if you have any questions feel free to ask yes thanks a lot so we have time for probably one question any questions in the audience either online or offline please just speak up okay yeah I have a question thanks for the talk I'm not familiar with stock price prediction field what usually related work do I mean which additional field do they use I guess sentiment polarity score should be somehow known for stock price prediction or yes yes so the current approaches most of them use the sentiment polarity extraction from multiple resources like social networks or some of them use the financial reports in order to predict a long term price movements but the text itself isn't very a researched topic but even if we talk about the sentiment polarity extractions most of them perform rather bad in the in terms of testing on the real data when we try to apply these models to the real stock market movements it only performs good on the historical data but it is actually very under researched topic because the price movements are rather random and it is very hard to find some robust solutions to bring the prices yes okay thank you thanks I believe we have to move on to the next and last talk but yeah thank you again Andrei thank you right and the next and actually the last talk of this session in fact the last talk of the NLP track of eyes this year is whether large language models learn at the entrance stage by Vlad Leonkulikov, Ilya Makarov and Radislav Neyshev and I see Radislav is here in zoom to say something just to make sure hello everyone can you share your screen definitely so here we go you should be up and running now yes please move on 15 minutes of yours thank you Andrei so hello everyone thank you for hosting me here today unfortunately I was unable to join in person but hope next time it will be so we have a lot of different discussions on recent advances in NLP especially in the generative models including transformer based architectures and with my colleagues we tried to analyze why actually they perform really better in some cases when we provide some additional information during the inference stage or is it so called the learning and reasoning effect in the inference stage so main goal was not actually to provide some new approach to make it even better but to understand and explain and maybe provide it to other people the pipeline how to make it work and analyze what are the main reasons of this happening so just in case the main question is how do these models especially transformer based models like GPT or bloom family learn during the inference stage learn in quotes of course so they incorporate some additional novel information which wasn't present during the training stage with no changes either to the architecture and to the model premise and in addition we tried to cover these two questions because they are quite widely known but they are not that strictly defined first of all what is a lender effect learning and reasoning effect and do they actually learn in reasoning in inference stage by them I mean the large language models or they simply simulate some intelligent behavior so there are a lot of different papers of course on this topic I've only brought three of them that for me seems a little bit most relevant for this particular research but actually in the original paper supplementing this speech we have more than 15 different sources then they all are important and there was a first paper which provided the chain of thought prompting in large language modeling in the same approach we do the first one, second one is just original paper on GPT3 and the last one is actually quite old compare it with all the other papers paper on the future learning invention translation from back 2018 but we have a few relevant ideas in the paper here so let's first of all formulate the hypothesis which will help us to move on main hypothesis in this work and we will try to support it with several experimental and literature results is a following, large language models and by large language models we assume all the models which have big enough parameter of space space parameters I will define what is big enough a little bit later we assume that they create some inner language spaces that contain not only the language itself like grammar, semantics and so on but some patterns of rules and reasonings which are implicitly not explicitly embedded in this space so LMS instead of learning something new during the inference stage simply adjust their state of mind to the particular already learned behavior trajectories so main idea once again of this hypothesis is that the majority of the information is learned and embedded into this space during the train stage and during the inference stage the examples of resonance chain of thought and so on only provide some instructions to collaborate the behavior of the model so I will be brief on the problem of you because I understand that everybody on this conference is well aware of all the problems with language models but just in case we assume that we have given a sequence of some tokens from finite dictionary which contain description and maybe one or several examples of the behavior on the desired problem and we assume that these behavior examples help us to solve the problem let it be easy classification regression generation whatever so first of all let's take a look at the data size and the model size for which we actually observe this learned and risen effect because if you take native GPT provided by a breaker party with 300 lines of code then you definitely will not see any calendar effect and that's fine so according to the data we checked in several sources all the sources list once again are available in the paper itself the corpus itself corpus of data meaning should have at least 300 billions of tokens because otherwise it's kind of too small and also the models should be of size billions not millions of cost that it should be approximately 33 or 50 billions of parameters however even 7 billions seems to be enough for certain types of tests for example if you are speaking only about the machine translation and speaking about the paper on unsupervised machine translation and how is it related for the current speech actually translation systems show us that if was speaking about unsupervised machine translation from the time we can see that if we have two different models trained for language modeling on two different languages we can simply try adjusting the internal spaces together by using only small amount of label data what I mean we can have one some kind of language model or just word embedding model like word2vec, fastex and so on and the other one in here on the slides adjusted some kind of strange clouds and the idea in there is the following we assume that all the languages are actually based on the real world real world scenarios outside there and we know that actually word sun and sky are usually much more aligned and they usually come together then words sun and I don't know dimitile or hedgehog or something else so when we provide some examples of words that directly translate one into another for example word gato in spanish I guess or adelian sorry I forgot it and cat in english we simply adjust these clouds of points and then we can get the time of course almost state of that approach with almost no data which is aligned between two spaces so the assumption is the following these models already have created some internal states and then we simply align them with several examples which are labeled for first and second spaces so the spaces are aligned as the markd and now we get the aligned spaces and the translating can be performed rather well that was the idea from the machine translation and later we tried to adjust this idea to the chain of thought reasoning because when we provide some chain of thought reasoning for the model or provide just an example to make the model work better we can see that the model starts working better but the reason might be the same we can adjust the internal state internal language space of reasoning of the model because the model is definitely over parameterized because it contains billions of parameters and it might cause the better behavior there is even an important note because which follows our observations and experiments and it somehow proves our results because when we provide some examples for the model how to work in the upcoming problems like regression or tech generation in specific form or some arithmetic problems even providing the examples with wrong reasoning so we provide example that the model should follow some path of reasoning but the example itself is just incorrect for example it has arithmetic errors or just broken logic and so on even such examples help the model to perform better and to achieve better results but if we change the order of reasoning for example we try to just shuffle the sentences within these examples then the whole sequence is kind of broken and the model behavior fails the model quality fails and the model behavior is not as we expected to provide some experimental proof of our claims not only to refer to some other external research papers we run the experimental setup with 5 different prompt and scenarios the first one was absolutely no prompt so we simply ask the model to answer to our question just like in zero shot learning for example compute something I don't know how many computers are in here and we also had for additional prompt and scenarios first of all standard prompting so we have simple example in here like one shot learning for example question there are 9 computers in the server room each day blah blah blah and the answer is 29 after that we send another question for the model and we'll ask the model to continue the sentence to generate the answer as we usually work with this language modeling approach the second scenario is chain of thought prompting so we provide a question and then we provide a chain of thought sequence so we not only provide the answer but the chain of thought as well so the second scenario is actually invalid reasoning so the same stuff as we did with the chain of thought but now the chain of thought is actually incorrect but in this case we still have the correct sequence of steps so we had original 9 computers then we have for each day we have arithmetical errors and the answer and finally the irrelevant prompting so we have question on something related for example with computers or with vegetables or with something and the answer is just random coming from another example for example so we had 5 scenarios and we used couple of open and closed source models because we wanted to check the behavior either on some on some state of the art models like GPT family or some open source models but in case this work was mostly performed during spring so we don't have recent models like second llama and so on so we mostly focus on bloom if I was speaking about open source models we started using models from half billion of parameters to 176 billions so including bloom MT0 excel and and of course GPT4 because it's kind of state of the art for now so I will provide the example table so it's much easier to see but the main idea was kind of simple models with a small size including bloom up to 7 billions could not provide any useful improvement using chain of thought either correct or incorrect so they're not present in the table and MT0 was also kind of inefficient if we try to provide some chain of thought reasoning while bloom 176 Guancao and GPT4 and also including previous to GPT4 GPT3.5 and 3 models like tax da Vinci 0.2 and 0.3 provided some useful improvements so first two columns in here speaking about the models correspond to open source models while last three correspond to open models which are kind of close and we have percent of the correct answers in 5 scenarios so no Dima no prompting at all then standard prompting chain of thought invalid chain of thought and irrelevant chain of thought and we can see that actually bloom behavior did not improve on the arithmetic reasoning at all using any of the examples except standard reasoning and chain of thought so it improved the behavior a little bit for the Guancao it seemed a little bit surprising so without any Dima it performed much better than with standard prompting but chain of thought and invalid chain of thought provided improvement of the score so once again we prove the hypothesis that actually the order of the prompting is more relevant than the correctness of it itself and speaking about the open air models we can see that we solved them we did not use them with standard prompting at all so we use only with standard prompting and chain of thought we can see that chain of thought improved the behavior invalid prompting doesn't break it and sometimes it even improves for some reason maybe it's the consequences of not big enough size of the tested dataset but for all the other models this size was quite enough because the results were rather stable and even when we provide some irrelevant prompting we can see that it might improve or not degrade for GPT-like models spoken about open air and last but not least GPT-4 provides great results out of the box and with prompting we can either achieve the same behavior or even break it despite we are using some useful prompting so it's standard prompting or chain of thought with correct examples the result was achieved with question and answer and reasoning on the bamboo bill test so we can see exactly the same behavior chain of thought either improves everything a lot or doesn't change it for GPT-4 invalid prompting in this case can break it a little bit but still it's really close to correct prompting but irrelevant breaks it a lot so the conclusion might be the following and I decided to just break it into the couple of questions we formulated in the beginning so first of all what is learning and reasoning effect we assume according to the research we found that learning and reasoning effect is more about finding some similar reasoning patterns in the latent space the model has created during the train stage and then these examples chain of thought or just several short learning provides the model example of how to collaborate itself to find the appropriate maybe a projection of its own parameter space for the desired problem to solve the problem better so it seems like the application of pre-existent knowledge rather acquiring NSNU and speaking about the idea of large language models learning during the inference stage seems like they okay I cannot say that I cannot learn during the inference stage but for now we have not found any exact evidence that they actually learn something new during the inference stage but we can assume that they are kind of exploiting their already existing knowledge and that once again is proved also by the results which we observe when we provide incorrect prompting but preserving the desired sequence so while when the model is just provided it it achieves better results even if we provide it with wrong examples so that's kind of it and if you have any specific questions thank you just from my sorry you welcome just from my side actually the error is rapidly evolving and I understand that if there are some new results this might become a little bit incorrect just every day we got into a really tricky area of trying to explain why all these models work so I would be glad to hear any of the questions including the questions which just question the correctness of this paper as well because I want the discussion to be useful for all of us thank you you're welcome thanks a lot this sort of concludes the ISLP session but maybe we still have time like for one quick question I guess people on site and online are already anticipating the closing of the conference but yes still any questions okay yeah just a brief one from me I'm a bit a bit confused in fact about your final claim or statement that large language models don't learn anything during a few shots in a few shots in a few shots scenarios isn't it sort of trivial I mean does anyone actually claim that models learn anything in a few shots scenarios of course they don't because their weights are not updated so like okay may I then fix a little bit this formulation yeah you're right so we're not speaking about the learning in classical paradigm when we just update the model weights or add some another adapter no by learning we merely meant that the models do not acquire any new knowledge so if they are unable to solve some problems at the if they're not trained to solve some problems during the train stage they won't acquire this ability even if we provide some useful prompts some examples and so on so we only can make models solve the problems they already have seen during the train stage maybe in a little bit different scenario but we cannot generalize them to unseen regions of our fish space or problem space well but we do know that we can just this claim sounds to me a bit self-supporting it's a little bit obvious right okay then it might be a little bit obvious so the main reason we performed actually this research was curiosity to find out whether they do or do not because when we started it was kind of a little bit at least less than a year ago there were several examples like providing iris data set to change gpt and it was able to perform the classification on the inference stage providing some examples and improving the quality of the answer a lot so we tried to make a little bit bigger overview including couple of different papers, different approaches to actually approve or claim that it's not learning during the inference stage so yeah we're not trying to provide any novel I guess result in terms of its astonishing result we're trying to actually a little bit more formally prove that no they're not learning yet okay thanks a lot I guess it's time to thank the speaker again and close the session thank you Radislav and now yeah I step down to leave the space for the closing session thank you Andrei for sharing now we proceed to the final and still very important part of our conference which is try to summarize what we achieved and what were the main highlights so this is allowance edition of ice conference first of all we're very glad that many people made it offline and many nice and really high quality talks were presented during these two days and the first nomination in the natural language processing area award goes to Antonin Lykseev Sergei Nikolenko will now they work on Kyrgyz language so with these I would like to please Anton to come and to make some short one minute presentation and highlight they work but first let's thank you well first of all thank you for the winner I guess now really have to continue the research on the topic to make the dataset even more fine grained and more justified the the the whole purpose of this work was to create the one topic classification dataset for Kyrgyz language which would and is essentially the first dataset for the applied NLP task in Kyrgyz language and then the overall idea why would that be why such a task is that there is an urgent need in some dataset to find out whether the multi-lingual models work for Kyrgyz language and as we've shown they do to a certain extent and outperform the very basic beg of anger approaches by a large margin work is going to be continued and now for sure and more interesting work works I hope are to come because during the this year and a year and a half maybe a large community of volunteers in Kyrgyz involved so this is it and I'm pretty sure that there will be a special time for that but may have some personal remark I would like to thank the organizers who make the conference happen who made this conference happen again and of course our fabulous hosts who managed to do everything like perfectly despite the trying times thank you thank you very much the next section is computer vision and awards go to Razan Didoa Andrei Galitshin, Pavel Astashev Dmitry Dilov and Alek Rogov and this is work on deep learning based one pathology localization with classification with extreme images so I'm not sure whether Alek is for the first time arrived just this morning and already got best-ever award so please this is probably because yesterday I was giving a speech regarding the artificial general intelligence so I think here all the credits go mostly to Razan Didoa who is now the PhD student of the Tenzor Network Slob with the SCOLTECH so we all know that attention is all paid but sometimes you have to look only once so we decided to combine these approaches and eventually we found an architectural approach to address very important medical task of Charles risk trauma detection in hospital we eventually got asked the preclinical trials and we developed the approach that combines the state of the art approaches in the object detection gradient attention mechanisms and they shifted window and blocking it so well I think that as once Andrei Kolmogorov said that really new things lying between something trivial and something incomprehensible thank you thank you very much, Alek the next award on social network analysis go to Sergey Siderov Sergey Mironov and Alexey Grigoryov on work on limited distribution of friendship index in scale free networks please please say a couple of words about your work thank you it's a big surprise for me thanks organizers friendship index was studied a lot in social studies but what hasn't been given enough attention in network analysis so here we did some extensive research on friendship index and its continuation of our work well we we've studied a lot of distributions, how it's distributed what its limits are and well actually I think now it's it's time to put friendship index away and move on cause it's not a great measure it's not a well all solving one and I hope this well you like my talk and thank you very much and congrats again so the next talk is on machine learning goes to Vladimir Berikov on ensemble clustering with heterogeneous transfer learning say please a couple of words about your work thank you very much for it's a lot surprising for me so this work the idea of some additional information which can give some insight to the analysis of data of target data is used and algorithm is based on finding some useful meta features and data from both domains are quite different the features are different for the domains and we should find some structural properties of data and transfer knowledge from one domain to another domain thank you very much thank you last but not least Vladimir Berikov on his theoretical work on subject coefficients for the maximum anti-chain of partitions and related counting inequalities and I really hope that Dmitry will decrypt now what is contribution thank you this is a bit unusual being an organizing member and receiving this kind of award for the theoretical section with something worthy and it was a pleasure for me that the committee evaluated my work by its scientific merits and I was also happy to apply data mining and machine learning techniques to the problems that were posed by such eminent mathematicians like Giancarlo Rotan known in combinatorics Ron Graham Gold and Kleitman so here I just added one small break to our knowledge on the number of maximal anti-chains and anti-chains of partitions in particular and also somehow helped to reduce uncertainty in some unsympathetic coefficients thank you and let me now say once again great great thanks for our hosts we all heard probably about Armenian hospitality but now we certainly all experienced the best of it and this conference this event would not happen unless Habed and Amalia would do so much during this half a year or more and first of all thanks for just proposing opportunity to host the conference and providing all the resources and all the support and our every need so Habed and Amalia let's thank them and computer science and engineering department of A.U. so thanks a lot thanks and of course thanks for all our supporters from Skoltec high school economics those who basically contributed time of their people or other resources to the conference so with this I'm glad to conclude this edition of IST and I hope to see you next time at the conference stay tuned when and where it will be and be sure that we working to make it happen next time