 Good morning, dear colleagues. I suggest that we slowly start. We are very happy to see that many people relatively early in the morning, after I guess many people had a great evening yesterday. So it is our great pleasure to have our third keynote speaker, Professor Samuel Horvat. Samuel is currently working in the University of Artificial Intelligence in Abu Dhabi. Before that, he did his PhD in Saudi Arabia in Kaust. Samuel is a very well-known expert on distributed and federated learning. He will be talking exactly on some aspects of this topic today. Samuel, go ahead. Thank you, Maxime, for a very generous introduction and thanks to all the organizers for inviting me. What I want to talk about today is federated learning. Let me just briefly discuss the outline of my talk. First, what I'll try to do, I'll try to give you some kind of gentle introduction to what we mean by federated learning and how it developed. I'm going to discuss several practical and research challenges that need to be addressed in order to deploy federated learning in real-world scenarios. Then I'm going to talk a bit about our recent work that we did on decomposable models that are targeted to face some of the challenges of federated learning. Let me start with the motivation of why we do federated learning. If we look at traditional machine learning, originally what we did, you would have a single machine, you would just put all your data in and try to train your machine learning model, but as we progressed, what we see that essentially this model gets bigger and they require more and more data. Therefore, we moved from local solutions to cloud-based solutions. Now the thing with that is like with this ever-increasing data collection, what we have to consider is that a majority of those data comes from clients and that data collection might have some negative privacy implications of these data collections. On top of that, many countries do have different privacy initiatives such as GDPR in the European Union or there is CCPA in the US that essentially forbids direct data collections. With this in mind, essentially, if we want to further progress much in learning by collecting ever more data, that might not be a standard solution, might not be a feasible solution because not respecting those privacy initiatives, we can almost completely lose the access to the data. Now this is where fair-to-learning comes to save us by bringing training actually to the edge, so to those clients that own data and the main premise of fair-to-learning is following. We do assume that we have some orchestrator who orchestrates the whole training process and what it asks those clients to do, it asks them to compute some kind of updates to the model when we do training and those updates are supposed to be very focused and they are intended for immediate aggregation, so essentially this orchestrator or you can think about as a central server only sees the aggregated information in order to prevent any data leakage. So therefore, fair-to-learning is kind of its basic definition what it gives us, it gives us at least a hope to train on this large decentralized data sets. On top of that, there is an increasing demand to have something what's called data locality paradigm, meaning that the data should be processed where they were actually collected and on top of that there are several recent studies that show that when you care about the carbon footprint of your model then if you design your fair-to-learning algorithm well, then you actually have a lower carbon footprint when compared to standard centralized learning. So some of the kind of success stories of fair-to-learning so where it's already commercially applied, this is in a big industry players, for instance Apple have it in their hay series or quick type, Google uses kind of the same thing for Hay Google and G-Board, then where we see this kind of fair-to-learning as a next game-changer, so this is pretty much all applications that historically wouldn't have much of a data and the reason for that would be very strong privacy constraints, this might be, let's say, smart health applications, there are already several startups working on that such as DoKi or Okin and also a lot in fintech applications where essentially privacy is a key and there are several banks already working on that, how to do kind of decentralized fraud detection, for instance WeBank. And I mean the thing is the applications of fair-to-learning, you can think about that as essentially any machine learning task that you would like to do but you have to collect a lot of private data. The good thing about machine learning and why it's kind of getting more and more widespread is several open source framework for fair-to-learning, among the most popular you have a flower, FedML also Meta has its own fair-to-learning simulator same for Microsoft and also quite popular is the NVIDIA Flare which is the SDK for fair-to-learning. I'll also do a bit of an advertisement, so when you're looking for the resource to get to know more about the recent research about fair-to-learning, so we've been running for I think roughly three years, so ever since COVID we are running the online seminar where we have more than 100 talks about different aspects of fair-to-learning there all on YouTube. So if you are interested just Google Flow seminar, register and you will have access to all the talks that we have. And also some of the great resources are some recent review articles or there is even a recent book on fair-to-learning as well. All right, so when we talk about fair-to-learning there are two main settings that we usually consider. So first is something called a cross-site of fair-to-learning and what we have in cross-site of fair-to-learning we have different organizations. You can think about them as a kind of a big organizations that want to collaborate. For instance, the example that I have here are hospitals that have patient records. They would like to train some smart health assistants based on all of these records but we have to realize that the data are private so what you kind of care about here the most is the privacy of the data. So and here we usually assume that the kind of a number of the institutions that try to collaborate is relatively small and the main thing that we would care about in this setup is privacy. So the setting that I will consider more in this talk is cross-device fair-to-learning. So in cross-device fair-to-learning what we have so all those clients that we have are devices you can think about them as mobile devices or different IOT devices and they want to collaborate in order to solve some machine learning problem. You can think about that as a recommendation problem where the whole process is orchestrated by a central server. This one you can think about as a provider and what's their goal? What's their goal is to train this model in this fair-to-learning way where we only communicate the very focus update that are intended for immediate aggregation and also one other thing that we have to respect here that this is all trained within those devices and also the model that we eventually going to obtain is going to be deployed for those devices and one of the kind of main challenge here is essentially the different heterogeneity that comes with the client and also the fact that the number of clients you can think about that as a number of phones that might be in millions or hundreds of millions. All right and that brings me to this kind of second part of the talk where I would like to discuss challenges and challenges that essentially makes or challenges that if we can address those we would be able to make fair-to-learning practical. So the first one and one of the most prevalent one is the communication bottleneck. So for those of you so the example that I list here is for distributed training. So for those of you who are familiar with this we know that essentially just by adding more and more GPUs that doesn't necessarily leads to this perfect linear scaling and the reason for that is at some point what you're going to hit is the fact that communication is actually much slower than computation. So for instance I have in this example right here what you have is essentially if you run let's say deep light which is a very communication heavy model on AP100s with still relatively fast network then what you're going to end up with is that essentially more than 90% of your training time going to be spent on communication and in many applications this would be the major limiting factor. Now if we move to fair-to-learning now here this issue is even more prevalent because now we do not have those machines, those clients being in a single data center. The issue is that they are all connected through some wireless links or some other end-user internet connections and this would be even slower than the data center links. Also on top of that you operate in a system where you have actually a lot and a lot of clients so even the kind of the capacity to aggregate those updates becomes the bottleneck and especially what we see in the real world application is that the model download is still kind of a slow so when the clients download the model from the master or from the orchestrator but then the uploading is the key limitation of the system. Thankfully there are several remedies that one can employ so you can think about the communication compression which means that essentially those updates from the client before they are sent back for the aggregation they're actually compressed and there is a nice line of research linking this that if you design your compression very well you can essentially add more privacy into the systems then another possibilities are that essentially you do not communicate very often what you can do you can try to define the local problem that is some kind of harder than just giving the local update by local update you can think about just computing the gradient with respect to your local data but you're going to do kind of more local work such that you reduce the amount of communication or what you can try to do in some smart way you can try to limit the number of devices that will communicate and what that smart way might be you can think about like how to kind of figure out which clients might have updates that are more important than others but still respecting the privacy of the clients now another issue and that's something that we're going to look at later more in detail is the system heterogeneity so the thing with system heterogeneity is like when you train in this ferretive network a lot of the clients might be either unreliable or very heterogeneous so and this vast heterogeneity is also one of the key challenges when you want to deploy and train these models in kind of in this world so the remedies that you can hear incorporate is try to kind of devise the algorithms that would be stragglers resilient meaning that like if you have some clients that kind of fall behind the training schedule you can still allow them to update the global model later also very popular here is the asynchronous updates here the challenge with the asynchronous updates that usually kind of once you have asynchronous updates it's much harder to guarantee privacy or another thing is like when you have the devices that are not able to compute and even like a store or work with the model then you can just simply drop them another challenge that also quite prevalent with ferretive learning and maybe not very standard in standard centralized learning is client availability so when we talk about this cross device ferretive learning what is the standard consideration is that in order not to reduce the user experience you only participate in training when you are connected to FAS network also you are connected to the charger and also with you there is a large number of devices connected because we want to kind of hide your updates such that your privacy doesn't suffer and with this essentially it creates quite non-trivial constraints and with that you have actually a lot of variations that might be undesired meaning that they actually can kind of break your optimization algorithm and this is something that needs to be addressed another issue with generally like with deploying ferretive learning systems onto devices is the fact that like systems are really not very mature so for instance kind of what's already happening for quite a long time is actually on device inference so that's already deployed in many devices but a full kind of ferretive learning loop is still kind of a work in progress but it's we are kind of moving towards there and nowadays there is actually a couple of providers already allow to run like a forward and backward on the on the devices another issue is the issue of limited labels so while we do have like when you think about your mobile phone like you do have a lot of let's say texts a lot of photos that maybe we can train on but the thing with that is there's very few labels and that's for instance one reason if we would go back to kind of this first applications which was next word predictions for instance for G board the thing is there you don't need labels and this is why kind of the first applications of ferretive learning works for next word prediction now so line of work that people looking at is how to incorporate semi-supervised learning while exploiting the data structure within the federated learning how to actually incentivize clients to label data and then okay and then we can talk about personalization so now the first thing about ferretive learning is that we might be actually training on very different kind of within the clients we might have very different subset of data and then the question is whether the kind of a global model which is kind of standard way of doing ferretive learning is the right thing to do so let's say when we look at this application right here which would be next word prediction that's very much depend on the context and of the uses of here that would be not a case so with the social kind of a possible direction for the research and things that people looking at is how to incorporate kind of a meta-learning approaches into ferretive learning there's actually a lot of work on trying to kind of link ferretive learning and meta-learning together and also how to kind of discover kind of interesting structures within our kind of a federated network by looking at the clustering or some kind of a balance between local and global model whereby local you mean you kind of think you can think about that as a model that trained purely on the kind of a distribution of the data that you honestly locally and globally that would be the global distribution and then other kind of popular subset of the ferretive learning as a split learning another kind of a very important issue that maybe was not discussed at kind of the rise of the ferretive learning or when it was originally introduced is privacy guarantees so the thing with this is while we have by construction that the data they actually never leave the device we do not have any formal guarantees that from the updates that we sent even though they are aggregated and there's like a lot of techniques to make them kind of private we do not have or we did not have originally formal guarantees that ferretive learning does not introduce any privacy loss and also when you think about that essentially you cannot have zero privacy loss if you have the learning so and then also there was a couple of works that essentially showed that even when you do ferretive learning in a way that you try to have very focused updates you average or aggregate over many clients still if you are smart then privacy leaks are possible and this is where remedies such as general cryptography comes so that is more like a kind of having a system more safe but when we talk about true privacy of the clients that's where differential privacy pops up and for those of you who are not familiar with differential privacy what differential privacy generally says that kind of the informal definition is that if I train the model and if you use my data or do not use my data the actually output of that model doesn't change much and by how much change that essentially like by the upper bound how much it can change that's the guarantee that you have in respect of differential privacy now another issue that comes with a ferretive learning is that essentially we can think about ferretive learning as a collaboration among many and potentially mutually entrusted clients so it's very easy target for poisoning and you can think about that essentially because everything is open you might have a competition people might be training the similar model so there is incentive to kind of destroy your competitive model so when designing the ferretive learning algorithm that you're going to deploy in a while you have to be aware of that but thankfully so in order to kind of prevent that there are several defense mechanism that can at least remedy this poisoning attacks now the last thing that I want to discuss is about the challenges as this intensive incentives to participate so when we think about like why would I even kind of give you my data to participate in this like ferretive learning training that's something that has to be somehow clearly defined also with respect to label, for instance like your model would improve if I label the data but why would I do that also with a local training there is a cause incurred with like energy privacy and so on and also another like last issue is also whether I do even benefit from the federation now also that's kind of from the point of a client but from the point of this kind of orchestrator we also would like to know like what's your contribution so do you even contribute to federation like by not having your data do I have any benefit and our issues comes with like the ownership of the model so who actually like owns the model is it a community or is it some kind of this global service provider and this actually opens up a very interesting question especially if you work on economics so in general we already have kind of a when we talk about these three edges which is computer science, economics and statistics we have already like a lot of discipline when it comes to when it comes to the intersection of two such as the economics plus computer science would be algorithmic game theory then statistics with computer science is essentially the basics of machine learning, statistics plus econometrics but the thing is when we look in the middle so this is where we kind of see the fairytid or collaborative learning when addressing all of these issues so with this I hope that I at least a bit appreciate you that fairytid learning presents kind of a new realm of unique complex challenges when we want to overcome these we must devise some kind of a system aware efficient optimization techniques and the key area to focus is optimization theory networking and scheduling techniques and with this in a hope that these techniques will help streamline this data processing and improve our performance and therefore enhance the overall effectiveness of fairytid learning models alright so that was kind of that was the first part and now the second part that I would like to share now the second part that I want to focus on is these decomposable models now with this I would like to start with the motivation so our motivation comes from like a very popular linear algebra techniques that are widely used in machine learning so there's principle component analysis and singular value decompositions so why these are very popular because they can give you dimensionality reduction noise reduction you can extract features and also you can significantly improve your computational efficiency and now the question that we start with is what we can make kind of a PCA or SVD version of neural networks so in order to answer that question let me first kind of try to look at the SVD how we could potentially represent that as a neural network so if you look at this decomposition right here then how we can have it as a neural network is we can consider the neural network that has one hidden layer without any activation, without any bias that would essentially represent this mapping now the nice thing that you can know about that is like if we do this kind of a pruning within this hidden layer then what you can get you can get the single body the SVD but the reduced SVD where we only keep the first let's say here two singular vectors and so on and so forth now so this at least kind of gives us the good example that we can represent this matrix factorization within the neural networks now maybe the most interesting part is whether we can actually learn this and the thing is that we can so if we just define kind of general matrix decomposition we can define it as a following optimization problem now if you want to retransform it into SVD then what you can, I mean SVD would essentially solve this problem individually for each kind of each rank K so here this one means the first K columns of matrix U now the nice thing with that you can even put it all together so you can have it as a summation across all the ranks I mean summation, the nice thing about summation summation you can represent with expectation and once you see this and you're already familiar with what we do whenever you have many terms in machine learning what you can do you can just simply apply SVD and that leads to our construction of what we call the order dropout so essentially what we show if you define your problem in this way you apply SVD but SVD not in the data points you essentially apply SGD to the models then you can actually learn SVD with a standard kind of machine learning training loop and also the nice thing about that if you're interested in optimization you can still see this problem as an overparameterized problem although even it's kind of a nut because the gradient at the optimum even for each sub-problems are zero now how does this look in practice so in practice what we do is like if we were about to learn the let's say like SVD so we would apply this order dropout techniques in the hidden layer and what in each step we do we have some distribution over the width in each step we sample the width and that's the number of neurons that we keep so essentially that's what we call a dropout and not the dropout because it preserves the natural order of the neurons now the nice thing about that is that here you can recover I mean with this construction as I saw you can recover SVD as a special case so for SVD if you have your mapping between the data and the output to be linear and then you sample your data from the uniform ball that essentially you can say that the training loop like the standard training loop with this network with order dropout recover SVD as a special case and that's what I have here so essentially you see that across any of these like sub matrices so here that would be essentially this rank K matrix that it converges to the best K rank approximation of the matrix A within the single loop and it's also within the sampling so we do not need to evaluate with respect to every single matrix but we can just sample one at a time now also you can recover PCA with this in this very kind of a similar scenario where the only difference is that now I'm going to sample uniformly at random from my data set but the thing is the mapping between the features and the labels essentially that's the same thing here and what you can see, so here is the example that we have here is we only have the data to come from like a three dimensional space and you can see that the quickly the network discovers that and essentially provides you this provides you with these three principle components and zero out the rest okay and then in general case that's I'm not going to discuss much about that but essentially if we talk about this general case for this linear network there's something in between what it does is essentially does PCA on the transform data set by the mapping A now how do we generalize that to neural networks that essentially comes or boils down to these two works that I'm going to present so first this title Fair and Accurate Learning under heterogeneous targets where we're going to use the order dropout and this is essentially the work where we introduce this notion of order dropout so first let me acknowledge my collaborators here Steve, Mario Stelios, Elias and Nick just to remind you the problem that we're trying to solve here is we're trying to look at heterogeneous devices so we want to address the issue of having different tiers of devices within our training but we do want to avoid this kind of a standard construction where when you look at freight loop in general that dividedly accepted norm is that the model that you have so all the local models everything that you're going to deploy has to have exactly the same architecture with the global model so in order to achieve that you can drop load type devices or essentially limit the global model size in order to accommodate all your clients now the drawbacks of those approaches that I just discussed is first when you drop load type devices you might have very limited participation and also now freight learning is kind of training on also not just like heterogeneous devices but very heterogeneous data so by dropping load type devices you actually might lose a very important part of your data set and therefore have a lot of bias of your data also by and then by limiting the global model size you might actually degrade performance for high tier devices now what will be our goals and what we want to achieve with this work is we want to have a fairness in participation meaning that every single client can participate and also we want to have very competitive performance meaning that all devices should have as good performance as possible during their local network constraints now how do we achieve that so I mean very kind of naive way is first we're just going to simply drop this assumption that every single client needs to run or store the same global model and what we're going to do is for these lower tier devices we're going to deploy the thinner models I'll show you how we do it later where the width that we're going to deploy will depend dynamically on the local network and network constraints that might be memory, computational capabilities load, battery level or let's say limited bandwidth so and how do we achieve that it's exactly through the order dropout that I just discussed so here so if we have a general network as the one on the left then what we can do is we can apply our order dropout technique in the following way so what we're going to do we define set of relative submodel width so these are numbers between 0 and 1 and we define distribution over those width and what we're going to do in each step in each step we're going to sample from the distribution and based on the width we're going to kind of apply that width limitation into every layer but the input and output layer we're going to keep the structure of the network and change now just to give you kind of a brief motivation or the how this relates to kind of a standard random dropout so for order dropout the motivation is not necessarily only regularization but the motivation is that once we apply this technique we kind of enforce as much knowledge as possible towards the left of the network random dropout the motivation is to prevent codependence of the neurons and also another kind of a difference between the random dropout and the order dropout is that our inference is exact where random dropout has an exact inference where you kind of uh where you just approximate the ensemble with a simple average of weights those are kind of the main differences now just to show you like one example so consider this is our network right here so we have the sampling let's say that sampling sample 0.4 that means we keep 40% of each hidden layer and which here yields two neurons per each hidden layer that we consider and then once we're going to deploy our train then we can accommodate the width of the network based on the tier of the device that we are about to deploy to now how do we apply this into kind of a fairly learning training loop so the construction is following so we have a set of our devices we're going to first have we take the architecture we take a set of the width that we want to train on our model and also the distribution of the width that we want to sample from here example is for instance like let's look at Cypher 10 on ResNet so we might have a file width that we want to train on and you can see that essentially leads to different number of max and parameters per network across different width and what we're going to do for the devices we essentially split them into several tiers and then what we're going to do each tier is assigned with the pmax value which is essentially the maximum width that those devices might work with while respecting the constraints now what we're going to do then for a single communication route or single training step of our training as standard we're just going to select the devices to participate based on the constraints that I discussed then for each device we send the model that corresponds to its pmax of the maximum width that it can work with now it perform local steps but with the order drop out in order to kind of train across all the width that it's kept able to train on just because to get the proper solution to get this kind of a decomposable network as I described before once we have this they obtain the update of the model that's just communicating back and then the server aggregates it's kind of in this non-uniform manner in a sense we just aggregate over the client for the given set of width we aggregate over the client that essentially updated it and then for the inference you just deploy based on the device capabilities now one of the nice things because you already have the model that train like that that's actually decomposable is that even after deployment imagine that now we take this device from the higher tier even if this phase we can then scale the model dynamically during the inference for instance like if that model has current increase or like has decreased battery level or increased load then in order to kind of perform within the constraint we can just simply decrease the size of the model because we train it in this decomposable manner okay so let me show you some of the experiments so here okay so that's strange so here the one on the right I'm going to focus on the cipher only so here what we did essentially we're going to just use the same training setup as for this model right here and what we're going to do we're just going to train it with the order dropout and what are these red squares right here that's a model that's optimized and trained from the scratch so here the red models essentially where you have you have five different trainings while this one the orange one is ours just training by a single loop based on the hyperparameters of the bigger model and what you can see essentially performance is more or less on par and we believe like with the tuning if kind of like a regularization hits in that if you would tune a bit of parameter then you can actually have even better performance than just train models trains from the scratch now another thing is what might be maybe a bit or what somebody might say that essentially like okay do we really learn the meaningful the composition of the model and one way to check it we take our model train with order dropout and we just compare it to the model that I was trained only for a full width and we're going to obtain with like random dropout we're going to obtain the models that have the same number of parameters as our model with order dropout and look at its performance just kind of to double check that it's not just some kind of implicit property of a general training that you discover that essentially like any model of that with your sample would have the same performance and here you essentially see very huge drop even if you take like 40% of the network width and essentially that corresponds pretty much to random guess or we can still do more than 90% apologies that seems that from the neck to windows doesn't like some figures so here the next thing that I want to link here that's a nice property of the order dropout is that you can actually increase the granularity of the width that you train on so when we go from the uniform grid of size 5 to uniform grid of size 10 and essentially you can see that the performance across the intersecting width variable intersects and and then once we actually going to deploy this into further learning where we see the best increase in the performance where the reason is that essentially the only available baseline at the time of this work was to kind of consider that in order to accommodate the clients that cannot run the given model we just when we train the model on that given client we just going to apply random dropout and that's the way how we going to match the constraint of the client but with that you can actually see by trying to train the larger and larger models you actually might see the decrease in the performance while for us you can see the steady increase of a performance meaning that the more compute power or more like parameters you can store the actually the better model you can get and this is also just to double check that the scalability also transfer to the further learning now what I'm going to discuss quickly is this next work that also is based on on the order dropout where this kind of motivation or kind of usage of the order dropout is slightly different so again to acknowledge my collaborators Steve, Shoshang and Hongi so what's the main challenge that we're trying to overcome here is actually how to train large deep models or like how to make kind of training of large small deep models kind of equivalent to train large deep models and if you look at the standard problems with training large deep models particularly those one with billions or millions of parameters there is energy consumption, resource demand and data requirements now so the thing is like why do we actually train large models the reason is that they perform so well the reason for that is this is still kind of active area of research might be due to implicit regularization meaning that the model biases towards the simple solutions and also smoothness in the last cave so it's actually easy to optimize larger models than smaller ones now what would be our goal is whether we can kind of keep the benefit of training larger models where we actually going to train smaller models and then the aim is to design models that can maintain high performance and reducing the size of computational requirements and how we are going to do it we are going to do it through order dropout and we are going to try to discover some kind of low dimensional structures via this efficient decomposition and by low dimensional structures here I mean a low rank of the weight that's based on the prior work that I essentially observed that that's kind of one good notion of low dimensionality alright and how we are going to do it essentially we already see that the dropout is kind of connected to the decompositions so what we are going to do is if we have original mapping and we transform it into factorized mapping we will try to decompose it try to essentially remove the zeros and that's how we get the low rank approximation and how we make this kind of low rank approximation nice and trainable this is the way how we design it with the order dropout so let's have this our original network what we are going to do is within the each layer we are going to input the layer that has an inner dimension of a minimum of these two layers then what we are going to do we are just going to input this as a factorized layer where we are going to deploy order dropout in this factorized layer the order dropout here doesn't have any bias or any activation for training now our sampling is going to be a bit different what we are going to do is we are going to sample one layer at a time so here that might be the first one and we are going to sample a rank so this is essentially in each step the network that we are going to evaluate and the reason for that is to essentially learn an efficient factorization of this layer that we can later prune and that factorization is essentially the one that should account the data for as well now that essentially these are the main component of the method that we propose here called maestro so just to walk you through the algorithm so what we do here so essentially this first part right here is just an order dropout so this is just a sampling and obtaining the prune network then something that I didn't discuss so in order to have essentially low ranks we somehow have to enforce it and one good way to enforce low rank is lasso penalty and because we have a natural ordering we can do group lasso and then the last thing to actually have a good adaptive pruning so to actually prune as we train we have this condition right here so essentially what it tells like once the network discovers that I don't need certain ranks I'm just going to simply drop it now let me quickly walk you through the experiments so essentially what this table tells you that essentially with any other method that does low rank approximation with just under SVD but not the decomposition that essentially tailored to the data and the network we do much better and we can do much sparse networks as well then also what I want to show here essentially it's here the one in the middle what we display here is ranks per layer for a different group lasso penalty and what you can actually see here that it might be not that the network learns only the composition within the layer but also some global decomposition because the thing here is that by increasing the group lasso penalty you always get a subset of the original network another thing just kind of a sanity check when we compare like when we kind of design a pruning technique because again you get a decomposition so even with having a network you can do post-training pruning because you have decomposed network already we can compare it to SVD which is kind of a naive linear approximation when we go for the decomposition so we can see that we do much better there and also another interesting thing is that we actually don't need to kind of look for a good pruning technique what we can actually do we can just look at the other networks found and I'll try to replicate that and it actually works relatively well okay so I'll just have two more slides just to summarize that we discussed two main one of the two key challenges infrared learning one is the efficient training and inference and heterogeneous devices we looked at all the dropout as a technique that would enable the composition of networks and we introduced these two techniques to exploit order dropout that is fjord and maestro and some other interesting application of order dropout that we're looking at right now is the automatic rank selection for Lola some model alignment once you're architectures and also network consolidation so with this let me conclude and thank you for your attention thanks a lot Sam for this amazing talk colleagues do we have questions thank you for great presentation my question about cyber security and privacy of the data learning what's the future of the data learning is it to for example integration with HAI and quantum we do cryptography and something else what else should be done to be more efficient and more private and more secure so yeah thank you for a question that's a very good question I mean the main thing is in terms of like when you ask about privacy we can make it as private as we want the thing with that is what you want to actually achieve is best accuracy or performance privacy tradeoff so that's one of the main challenge because I mean always you just don't need to communicate you don't need to care about any privacy whatsoever I keep my data private never communicate that so that has a perfect privacy but essentially zero utility like the main challenge that people are looking at is right now like how to first of all it actually turns out that it's even in this kind of federated network it's highly non-trivial to even train without any privacy constraint just as I introduced that's still kind of non-trivial task then when you add like formal privacy guarantees on top of that they introduce the extra layer of challenge and there is a lot of work going on on to like how to find the optimal kind of between the utility and privacy so that's kind of the main thing but you can deploy it essentially anywhere where you collect the private data or where you cannot collect private data but where you want to actually train on the private data Hi, thank you again for a great talk I have a question could we expect at some time some I don't know unified framework for doing that like hug and face it will say like combined all this and from the user perspective you just check the model and expect something similar for federated learning because you've mentioned a lot of frameworks and they're kind of independent to each other so what's the reason it's like still under progress Yes, I like pretty much I mean that's the main two frameworks I mean they kind of like where I think I mean they're also both of them like startups and one of the kind of main goal for them is to provide kind of like unified one-stop shop for federated learning that's thank you so much for very interesting talk so my question is about as I understand there are several approaches which helps you to make training process more efficient and you show some graphics on ResNet as far as I remember yes and on ResNet we have approximately no drop in accuracy approximately no drop in score my question is the result of this good score is the result of some of these approaches or you apply one for example drop out or matrix approximation of the layer Yes, so I mean essentially I mean the ones where you do not essentially lose any accuracy I mean what we kind of show here this is the one that you point out to yeah, so essentially what we show here that I mean first of all like if you kind of do things well you can kind of improve tiny a bit yeah, so essentially I mean that's when kind of a regularization kicks in like the standard thing that you would expect to get with the decomposition that you remove a noise and then we show like you can actually like if you really care about are you trying to push the number of parameters smaller and smaller you can still do that by just increasing this like a sparsity penalty Okay so am I understand correctly that this table shows the approach of regularization Yes, so it's I mean regularization plus pruning, right? Dear colleagues I think that we are a bit running out of time because next sessions should already start so I suggest that if you have any more questions to Sam that you just approach him like during coffee break and something like that, so let us thank Sam again. Okay, so thank you very much I going to present a joint work, mostly work has been done by Osge Sevgili so she is a PhD student at University of Hamburg and also this work is with some collaborators also from Hamburg and Institute of Technology so the work is about the task of ultra fine entity typing so what is entity typing and ultra fine entity typing is consider this example you have this Olympic National Park and this is a mention many of you know about entity linking task and this is when you take some mention to knowledge graph knowledge base let's say wikidata or wikipedia so you assume then in wikipedia there is a specific page about Olympic National Park and then it's quite useful to link these because you can harvest, get all the information from wikidata about this national park maybe attributes, mentions, description and so on. Now the reality is even the biggest knowledge graphs due to inherent power law of distribution on scarcity of the data they cannot cover everything so I remember this case then there is this bubble net database like knowledge graph and if you look for word python there are about 50 cents of the word python and there are like two roller coaster parks one in Germany and another in the United States and Florida somewhere and that actually gives you an idea that you even if you no matter how hard you try there will be always this gap there will be always the long tail of some entities which just nobody even either cared to enter in a knowledge graph or actually never managed to insert in there so that's why entity typing come to rescue then you just label with certain hyper name like is a relation and then you can still get a certain idea of what it mentioned is even if it's not inside the knowledge graph now inside this area there are different granularities because you can let's say this is a park or this is a location or different area or maybe you can say this is an Olympic park so there are really also a hard question what kind of granularity you need to take and we speak about a case when you deal with a pretty high granularity so you speak about like tens of thousands of census or labels but then the task is relatively easy when you take let's say 100 or 50 different types then you can collect easily a lot of data but this information might be not so specific not so useful for different applications and the problem with ultra fine typing when you deal with a large vocabulary is then it starts to get again hitting this problem or scarcity of the data so you don't have this amount of data and people try to come up with different ways with this problem and one of the approach maybe not the most successful even but still interesting I hope I will present today unsupervised approach so that we don't use certain manual annotated data rather we use distributional semantics and let's say bottom up approach to do it how people also approach this problem well people try to use the distance supervision for this entity typing what people try to do they try to think hey where can I just get the data for free where did it occur you might get entity linking data sets and for entity link you can look in the knowledge base and see what is the hypen of this entity and then use it maybe generate several hypenings this is a common approach to let's say automatically generate such kind of data sets people also just try to use something like HRS partner so rule-based approach to extract these easy relations from text let's say I say a sentence like such courses Mercedes BMW in Audi they are expensive and luxurious well that would be like a second approach but there are also some approaches which use short unsupervised techniques again to bottleneck what we are trying to do so we are trying to leverage unsupervised induced word senses using this job in text framework which is based on distribution of semantics and it contains not only distribution representation of words but also distribution representation of word senses labels with hypenings and what we actually do in this work we try to see how much useful this hypening labels to the task of entity typing so we try to disambiguate the context with respect to this induced senses so how it actually works so at the core of this approach a repository of this usually induced senses these are coming from this job in text framework and you can look at this paper and this paper just for the background work so this is not what has been proposed in this work but rather like infrastructure which we are using and what it proposes for every word you have let's say in this case is word RANS it might be a city in France so in this case you see this first cluster let's say Lille, Montpellier and other variations of RANS but you also have this football club well basically this football club has a different hypening and here you see these labels for easy labels the common is labeled for one sense and the city is a hypening is a label for these sense and well basically the magic of this is that none of this is done using human labor instead how people actually obtain this first you obtain something like a word to work related words and then you perform clustering and then you group these words obtain these clusters of words but again you don't have hypenings but how hypenings are obtained every word is assigned a list of automatically induced hypenings let's say using these patterns and then these counts correspond to common hypenings for this cluster let's say so of course this term will contain a lot of hypenings but the common hypenings will actually pop up at the top so that's actually the trick how you get relatively clean hypenings here at the top so city or club they will be common hypenings for a lot of these distributional decisions because they are all cities they might be even some noisy words but still but this is what we are kind of leveraging what we are using now how method works well you have this input always you have of course certain mention you need to assign the hypening to in context so you need to disambiguate this and this becomes your sense repository when the rest is very simple actually so you vectorize the context using input representation and you vectorize the mention because it also contains the word rinse and then you vectorize these guys as well using of course the same vectorizer so they can be certain similarity competitions can be done and basically the rest is just picking the most most relevant cluster and picking hypenings from it to label it here of course a few additional steps apply so if you go for real data sets for entity typing you see not just a new word but something like this so mentions might be really long might be really elaborate even if you obtain sense inventory in this bottom up way from the text corpus there might be no sense representations or mentions for this kind of multiple expressions so that's why a lot of additional preprocessing are done so headwords are collected so different keywords from these mentions might be collected and then candidates for senses they are obtained not necessarily from the exact mention of course also certain other steps like singularization of the hypenings and the mentions are done so mention might be done most smoothly but essentially as soon as all these normalizations linguistic normalizations are done well you perform this vectorization of the context and you compare the context vector with candidates prototypes for different senses and you pick the most appropriate hypening label so experiment wise what has been done we look at the setup of choi et al which was this paper and compared to it and to some other baselines the first baseline was just to pick the first cluster for given word or pick a random cluster so this means picking something so this is a very important baseline in every words and disambiguation task it's called most frequent sense baseline why it's important of the distribution again so most words are used in the dominant sense and of course the largest cluster or first cluster in this case will have dominant sense and this is kind of considered always a strong baseline random cluster just pick a random sense and then choi et al well this is approach which rely on encoding with a bad direction last tm cnn and train some multitask objective and also some other approaches based on masculine which model and nli were considered of course there are quite a few moving parts in this method so you can select headwords differently you can singularize these different words mentioned differently here and entity linkin and principle the way actually how do you yield candidates maybe is more important the kind of neural network you use this is my many studies show this because this is a similar study this is not a difference if you do some entity linkin business or this kind of entity typing business the way how do you select mentioned how do you match it it's very important how do you generate candidates okay of course some parameters correspond to job and tax itself for job and tax is a fixed let's say granularity of clustering so clustering can be fine or more coarse grain so you can have for different word let's say for jug word you can have three senses or you can have ten senses and of course they might be also different more noisy less noisy and so on so forth here of course last parameter we will take into account and important one is a number of predictions so you can take only first or you can take more just five or ten here you can just return to this example so first second third labels they seem to be relevant but if you go down the list this is just automatically created list and of course you hit at some point where there will be like a very noisy hyponyms which correspond to some very generic or irrelevant senses all right so here is a table with results so first of all we see that indeed first cluster and random cluster they relatively strong baselines so but the method itself is actually outperforming these baselines so actually it does a certain disambiguation consistently however the results for the literature are pretty strong based on these other approaches so in the end what kind of a contribution of this work managed to do is to in combination with approach of joy the method yields certain improvement so basically itself the methods show that the clusters are pretty noisy so just using the method by itself yield quite noisy results but it yields some additional information and if you combine this with method of joy especially taking only a few predictions so you see here the precision really drops quite significantly if you take the first hyponym or second three, first five, first seven so precision drops but recall of course increase with time so what actually worked much better is also to drop in pronouns what is the reason why if you combine with pronouns if you work with pronouns it doesn't really work well the reason is very simple think of pronouns like it if you have senses even if you have some senses several senses of the word it or she he hyponyms of the senses will be completely meaningless and if you take some mention in context and you have like mentioned it well there is no way to correctly generate candidates for this so that's why if you take setup without pronouns this boost actually for ultra fine setup can be obtained like you see here but that's kind of summarized in this picture if you don't if you consider pronouns the results more or less will stay the same but this additional information coming from distribution labels they're useful in setup without pronouns ok so in this case slight improvement of score improved and of course there are certain errors in the method you can see here this is what was generated and these are kind of two predictions so for instance you see the method generated something like violation difficulty for this mention but the true label was like a crime or something and again you see the task is pretty challenging for real datasets as you have very long mentions and in this case the method seems to generate quite plausible results but again I couldn't sometimes these might be somehow relevant but they're also not mentioned in the gold standard ok so the summary of this talk is like this we explored how information from job in text from unsupervised induced word senses can be used for the task of unsupervised for entity typing it seems that if you don't consider the pronouns results can be improved if combined with method of choice that means word senses contain helpful and complementary information however word senses induced just from text they are pretty noisy and you need to deal with them with extreme care because also you need to be very careful about how you can this generate that's pretty much it in case you have any questions I will be answering and Osge might be also on a zoom link so in case Osge you want to say something yes I am also here thanks for the talk so yeah just one very simple we are done with the English data so I am not familiar with this Job in Text framework but how easy would it be to extend it to other languages and does it exist for other languages yes it actually exists for multiple languages I am sure it has support for German for Russian for Italian not sure how long this list is but yes I think about let's say maybe 5-10 languages are supported and yeah you can look at the website they have a nice web demo you can just enter a word and look up all of these word senses actually it's a snapshot of this demo so you enter certain term select example and language and you will see something like this you see this automatically induced since clusters and these labels thank you very much good morning guys my name is Maria Maslova and today I am going to present a project which title is Rukam comparative argumentative machine for the Russian language and my colleagues Irina Nikishina, Stefano Rebrokov as well as Chris and Sebastian so I am going to speak about quite an important topic the problem of choice I am sure that all of you have faced the necessity to choose between iOS and Android holiday places car models and so on and one more topical issue is the choice between cats and dogs so who is for cats in this auditory raise your hands and for dogs I see so in the end of the presentation we will see how our system responds to this question so it is quite logical to create a sort of system that will help to solve this problem the problem of choice with the support of some reliable arguments however this task is quite complicated as it lies concurrently in the field of question answering and argument mining still one of the most known and prominent research in the field is CAM the system which can answer users comparative input with the support of some arguments extracted from the large text corpus however CAM is English-oriented and there are no analytics for the Russian language so now we present RU-CAM a system aimed at comparing two objects from general domain in Russian with argumentative explanation if compared to its predecessor RU-CAM has following differences it allows to work with comparative questions in natural language it has the component for object and aspect identification from comparative questions and it uses an elastic search index of open stupor-ratch crawl-degraded corpus abbreviated OSCAR we do not only develop a similar system and a pipeline from an engineering perspective we also try to pose and answer following research questions what are the main peculiarities of CAM that need to be taken into account when adapting it to other languages and one more question more specific one what are the main challenges for adopting CAM specifically to the Russian language to start answering these questions let's look at the system design the process can be split into two steps Russian analysis and argument retrieval argument the first step is about the identification of interrogative and comparative nature of a sentence also it includes object and aspect identification process and the second step consists of the search of relevant arguments for the important objects their classification and ranking let's consider each step in detail the processing of a request starts with identifying the question type so whether it is comparative or not it can be done in different ways including a rule-based approach here we stick to the idea of special patterns in comparative questions which include comparative forms explicit mention of comparison, similarity difference etc to implement some machine learning approaches we first compile a data set from the sources shown in the table and here is an example taken from this data set then we use a rotating bird and a fine-tuned bird from another research as you can see in the table the letter shows the best results but comparative questions are quite specific kind of question that can be identified with nice quality even using rule-based methods after identifying the question as comparative we need to extract objects and optionally aspects to further provide them for the argument retrieval stage at this step we also implement several approaches including a rule-based one it is founded on the idea that all requests have certain structure namely they contain two objects and connective of comparative nature between them we consider foreign cases two nouns, two verbs the combination of noun and adjective and the combination of noun and two subordinate adjectives also we expect a connective from the list of conjunctions and synthetic words expressing comparison between these two objects in order to create a data set for the task we take thousand sentences from the previous step that have been labeled as comparative and manually annotate them and the results in computational linguistics were asked to label the first and the second object and optionally aspect and common object common object is a specific structure with noun subordinating two adjectives so for example in a sentence in a phrase green or black t one object so the level of annotation agreement is shown when creating the final data set for models fine tuning we use the notation version supported by the majority of annotation and that's one more example considering our own example we use fine tune transformer in quarters and if you short approach on generative transformers to solve object and aspect extraction task the table presents the results which model we see that generative models perform on power or even slightly better than baselines and significantly worse than transformer encoders still they may perform much better after proper fine tuning as we have shown them only five examples regarding and zero scores for common object and aspect labels we claim that there are two problems the first is the inconsistency of annotation and the second is a complex nature of these labels in semantic sense in order to retrieve documents in favor of one or the other object we use open super large called aggregated corpus oscar we use oscar instead of the common crawl while it is claimed to be its filtered version we store and index this data with elastic search when indexing documents we decide to create two indexes the first one for storing document information like the number of sentences web and so on and the second is for storing sentences themselves two or two sentences we first use we first do the snowballs terming not lemmatization because of the type constraints and then apply wildcars to be able to find all word forms in the boolean gson query and require that the course must appear in matching documents we consider the step to be the most challenging in all camp pipeline as russian language has highly fusional morphology which makes it much more difficult to retrieve sentences than in english because query words may occur in any form and look at the elastic output for our objects just an example after the candidate sentences with possible arguments are found it is necessary to understand whether the sentence argues in favor of the first or the second object again we have a rule based approach that requires a list of keywords with adjectives and adverbs with the meaning of superiority or inferiority of the first object over the second also take into account negation cases when the sentence is reversed we collected data set from 140 ps and annotate them using the yandex taloka system for data crowdsourcing to do this we select same or similar ps from the same domains as in english research like programming languages car manufacturers food drinks and so on and make queries to elastic search to extract all sentences matching the query then we create a system of text butter tag means that the first item wins or the second worse tag means that the first item loses and the tag none means that there is no comparison between the objects we are interested in unfortunately the annotated data set is highly imbalanced 75% of sentences and for example only 9% belong to the worst tag and the study also implement several transformer encoders and few short approaches with generative transformers the results for comparative sentences classification are inconsistent and relatively low for all the models due to the class imbalance problem it is interesting that a real based approach produces a quite decent result on better better better better sentences than it even outperforms both large on worst sentences and generative transformers on non sentences the process of sentence ranking is identical to the one in CAM we score comparative sentences by combining the classifier confidence and the elastic search score when displaying the arguments in the CAM on a certain object we sum up not only better arguments where the current object is the first item but also worse arguments where the object is the second one in the sentence for instance both sentences are better than dogs and dogs are worse than cats are used in favor of cats when comparing them with dogs the main outcome of our research is the final system where we integrate all the parts described above the evaluation of the system is currently working progress we plan to evaluate to come analogously to CAM evaluation pipeline by asking whether users are performing faster searching something of comparative nature if compared if compared the work with CAM with the keyboard search and also we can ask some users just to play with the system to collect the feedback that's it about the pipeline and it's time to answer research questions in general so in general when transferring CAM to other languages you should take the following peculiarities into account the difference in the notion of comparative sentences in different languages difference in the syntax and morphology of languages when implementing rule based approaches and the existence of the relevant data sets and pre-trained large language models for training for different sub tasks as well as large text corpora containing comparative sentences for search in the target language so nevertheless as it has been shown in Russian it might be quite smooth if at least some of the required tools are available what do we have now here we have CAM the first instrument which helps to answer general domain comparative questions in Russian inspired by the CAM system we create a similar pipeline adding new steps for comparative question identification object and aspect identification and sentence classification we also present several new data sets in Russian that might be further used for the fine union of language models for the sub tasks and from the preformed experiments we can see that rule based approaches show decent result on all sub tasks of comparative question answering as well as few short generative transformers and this sub task needs to be further investigated and finally let's look at our comparison of small research according to our system cats or dogs and do that quite confidently so that's interesting and that's an example of some top ranked sentences extracted from the elastic search as future directions we plan to incorporate a summarization system that would be able to produce a coherent answer from two lists of arguments for each object it will allow us to compare the results of various instructions tuned models for Russian and charge gpt with the rootcamp pipeline thank you for your attention any questions thank you for your work can you elaborate a bit more on how you plan to evaluate the system outputs because your experiments were based on different components whether classifier argument identification works well whether sentence classification works well but ultimately a user has certain information need and it's presented with these outputs whether cat is better or not and how would you judge whether the system is satisfying this information need well or not so for users there will be sort of frontend to use and to input and work slower, faster and what else should I say okay yeah that's a human study do you think it's a possibility to do it automatically like a reproducible way so that tomorrow somebody will develop another system and it can be compared as well or it's just not obvious as I understand for now the main way of evaluation is human based but maybe we should think more about some ways of evaluation of automatic evaluation also that just a sort of future work I suppose so okay and maybe okay I will pass thank you for the talk related to what Alexander asked have you tried to ask questions like compare something that is incomparable and how system deals with such thing I mean it's partially due to lack of data partially due to some strange query and what is the default behavior, what is the intended behavior in your system like if I compare cats and Audi or something like this but what yeah of course in this data set if you train it on this data set cats will always better better than humans yes in fact I suppose that will be not enough output from the corpus if we pose such question yeah clearly but what is the default behavior in this case or some like what system should do for now yeah for now it's gonna be done in request so if we identify that the question is comparative then the two objects will go to the elastic search system and will retrieve some sentences in my opinion there will be almost no sentences comparing cats and BMW sometimes we have a cat as a machine and then this is the disambigation problem in this case but mostly it's for future work so there might be some incomparable objects and this is quite limitation for now but in future work normally we are planning to play some taxonomic structures to look for hypernames and looking for how the clothes are related or maybe they're like in the different parts of the graph so they cannot be compared not compare yeah yeah that could be done with some more frontunia something or some kind of unsupervised baselines using taxonomy and wordnet in my opinion thank you for the talk have you considered to do more classes it's like it seems to me that sometimes the answer to the comparable question is things are equally good or equally bad and have you thought of this classes we've never thought about that but I think that as a final result we should receive a sort of process for the first object and the second object and that output requires the classification better of course and none and nothing more okay thank you let's thank the speaker again now it's time for the third talk okay hello everyone I'm Maxim Savkin and I would like to present our paper called tuning free discriminative nearest neighbor few short and end detection their consecutive knowledge transfer so let's start from an introduction to the task intent classification is task of identifying user intent given an utterance naturally it appears in dialogue systems and comes along with the task of out of scope detection here you can see on the slide tell me a joke is an example which doesn't belong to any of the predefined in scope intents so it belongs to the out of scope class I would like to emphasize the importance of out of scope detection as it is crucial for generating an appropriate response we'll solve both of these problems simultaneously so the motivation behind this work is that most of the existing methods for intent classification rely on expensive fine tuning and have high training requirements especially state of the art models and also most of them are focused on in scope classification completely missing out of scope detection our approach on the other hand in our approach we try to create a model which can work as a service so it doesn't require any task specific fine tuning it takes a few short data set and a set of unlabeled utterances as input and produces a set of intent labels we try to inherit the discriminative nearest neighbor architecture it utilizes a standard k nearest neighbors and replaces the distance function with a deep cross encoder repair the model this model takes an input and a training utterance and tries to predict the probability of these utterances belonging to the same intent class so it is some sort of a similarity function but based on a deep cross encoder model so in the original paper strong capabilities of this similarity function were achieved by fine tuning it on pairs of examples from the target data set however we suggest to completely skip the fine tuning step and focus more on creating a strong pre-trained similarity function which can differentiate between unseen and tense so further I'll focus on pre-training this similarity function we consider using several binary classification tasks the first one is natural language inference it is popular for pre-training some strong binary discriminators the second one is paraphrasing it suits a bit better for similarity prediction and the final one is a consecutive pre-training where model firstly trains on a large natural language inference data set so that it can learn some utterance relations and then it tunes on a smaller paraphrasing data set so to better measure the similarity prediction we also for natural language inference merge the last two classes neutral and contradiction into non-indulment so it's a binary classification task one problem with paraphrasing is that it lacks large high quality data sets so to mitigate this issue we've tried using a small high quality data set and augment some non-paraphrases for generating non-paraphrases we use some sort of clustering you can notice that paraphrasing is an equivalence relation so paraphrasing data set can be divided into equivalence classes and utterances from the same class have the same meaning and can be considered paraphrases and utterances from different classes can be considered non-paraphrases they possibly have different meaning so all those missing connections you can see on the slide will become our newly generated non-paraphrases which we will use to increase the results for paraphrasing data as for the metrics for the final model for classification we are using in-scope accuracy out-of-scope precision and recall which are defined as standard recall precision but positive class is an out-of-scope and negative class is a combination of all in-scope classes we randomly sample 10 few short data sets from the original data sets and report the average and standard deviation for all the metrics and we will be using a large clink 150 data sets which contains 10 domains and a wide variety of intents and a banking data set which contains only one domain but 77 fine-grained intents so let's move on to the results of pretraining as you can see on the left plot a natural language inference task despite being so popular achieved the worst results due to its directional nature and the best results we were able to obtain so far are with consecutive pretraining where model firstly trains on a large natural language inference data set and then tunes on paraphrasing data set with newly augmented non-paraphrases this is the first third column I would also like to notice that augmentation really helped to increase both in-scope accuracy and auto-scope recall by introducing new non-paraphrases so let's compare our model against other tuning-free methods the first one is TFIDF KNM classification and the second one is embedding which is actually just a KNM based on being coder Roberto model which was pretrained on a natural language inference task only pretrained, no fun tuning at all so here you can see that on the clean data set and banking subset of clean we are achieving the best results so far even getting a much and our approach is much more stable to threshold selection our approaches as it has a much larger area under the curves of in-scope accuracy and auto-scope recall we also thought that it would be important to compare our model with some fine-tuned methods DNC is the state-of-the-art model for auto-scope detection and here you can see that here you can see that as expected fine-tuned methods are better in initial in-scope accuracy however for standard Roberto model you can see a huge drop in accuracy which means that it has a lot of low confidence predictions and our model has a larger area under the curves of in-scope accuracy so it is more stable to selection of threshold these results are not included in our paper but we decided that it also would be important to see how our models tax against chat GPT so the whole clean dataset didn't fit in the prompt so we've used only a banking subset of clean dataset and with standard auto-scope examples you can see that chat GPT with zero short prompts achieves nearly ideal results and we suppose that it's all due to the fact that it memorized this dataset quite well to change all text labels with indexes and we also replaced standard out-of-domain examples out-of-domain out-of-scope examples with harder in-domain out-of-scope examples and as you can see our model still attains relatively high recall and accuracy while chat GPT really struggles with out-of-scope examples so only a usual chat GPT is able to produce relatively good results yeah, so summarizing our paper I would like to say that we've developed a model that doesn't require any task-specific fine-tuning so it can be applied to any dataset for intent classification it supports an out-of-scope detection has the best performance on clean dataset and it is robust to threshold selection thank you thank you can you elaborate on this difference between setups in-domain and out-of-domain so you mentioned just next this one just the one before conclusion so you mentioned that you removed some identifiers and converted them to numerical what does it actually mean I mean that chat GPT probably has memorized the dataset quite well so we decided to reduce the effects of this memorization and replaced all text labels with just indexes so it couldn't memorize just by prompting so well would it be fair to chat GPT in the sense that maybe your system memorized this indices no, it doesn't take labels just at all so it just takes input utterances and compares them between themselves only and doesn't take a label as input so it's totally fair I think do we have more questions in the audience maybe we have some time for one zoom question if we have one no, we don't have questions in the zoom okay now it's time for the fourth talk hello my name is Vasily and we are going to talk about whether it is possible or not to find the number of topics in a natural language processing dataset so a couple of words as an introduction what is topic modeling about a topic model receives as input a huge text collection and as output it produces topics topics as probabilities over words so we can see what text is about and we can locate places and text where each topic is covered so the problem of topic modeling can be viewed as a matrix decomposition problem topic model receives as input a matrix of word in document frequencies and it decomposes this matrix into matrix multiplication of two matrices first matrix of word word probabilities in topics matrix phi and second matrix of probabilities of topics in documents well, but it is not clear how one should select this hyperparameter is it crucial or not how to find it so it seems that in some text collections at least the number of topics can be well defined beforehand for example if we take some articles from Wikipedia they are labeled split intersections so if we take some articles from let's say art section some from biology section some from history we get a data set and we can say obviously that there are three topics in it so but in real life it may appear more complex because these topics labeled by humans this is something that just helps to simplify the categorization process so in real life it may be well much more topics in this text collection these bigger topics may be split into smaller ones and furthermore topics may combine and produce some new topics so is there a number of topics in text collection or not can we find it or not we are trying to find an answer to this question and what are we going to do we are going to train a topic model for a text collection with varying number of topics from low to high and we are going to track some topic model quality measures and look at the plots, at the dependence of quality measure against number of topics and if we well let's say see some local minimum or maximum or plateau it may be a sign that the number of topics corresponding to this interesting point is an optimal one so this is the picture which we are well expect to see after going further I think it is important to say a couple of words about similar projects we are not the first to try to track a lot of quality measures while training the topic model we are even not the first to try to find a number of topics using these quality measures but we believe that our research is one of the most extensive one so what quality measures are we going to use first perplexity one maybe the most common measure while training topic models the lower the better second block measures diversity measures they compare topics with each other computing distances between pairs of topics because if a topic model produces topics which are all similar it is bad the second block is clustering measures because topic modeling can be viewed as a soft classification problem words are split into topics which are soft clusters so we can adopt several measures from clustering analysis to use it as a measure of topic model quality and the last on this slide is a block of stability measures well topic models are unstable it means that if we train a topic model on the same text collection with different random initializations we can get different results so we compare different topic models obtained with different random initializations with each other compare the topics well this is not all measures which we use the next block are information theoretic measures we use several of them but the idea is the same they compute roughly the difference between model complexity and model likelihood the bigger the likelihood the better but the bigger complexity the worse so these metrics are trying to find the balance between the two the two model characteristics the next one is entropy there are works where the authors propose to use this metric because another analogy between topic modeling and a complex system where we have several possible states topics and particles that is words can occupy several of them so the optimal number of topics using this measure is the number of topics which gives an equilibrium state to this system and the last block which we use the last block of measures are top tokens measures we compute coherence and lift scores that's it well the methodology of our experiment is roughly described here for each data set we train a topic model with varying number of topics from minimum to maximum compute topic model quality measures and look at the plots minimum and maximum topic model in maximum number of topics which we vary depends on the data set the models which we use are the following PLSA the simplest model which have only one hyper parameter the number of topics LDA the most known topic model and a couple more the related topic model which is trained in order to produce topics which are distinct sparse topic model which distinguishes its topic its topics it splits the topics into two groups background topics which are smooth and well about nothing and specific topics which are sparse and exact sparse means that the largest probability mass is spread only on a small group of topic words well this is models and this is data sets we use several data sets in English several in Russian language for each data set we know at least approximately the number of topics in this data set they expected this is our ground truth the problem I should say a couple of words about the last data set this is our data set composed by our research group it consists of good articles from Russian Wikipedia this is the result table with three numeric columns each column is a score which we assign to each quality measure so what do they mean the first column is jacar metric it tells how the predictions of the same topic model but trained with different random seeds are consistent with each other the lower the better in this table it is called jacar because it is computed the following way we take the predictions of topic models with different random seeds and we make these two sets the first set is union of predictions the next set is intersection of prediction and we compute the jacar distance between these two sets the next column is informativity it tells how the plot of dependence of quality measure against number of topics is readable by a human local minimum maximum or plateau or it just random up and down and without any possibility to make a prediction out of it the last column is called expected well it tells whether the quality measure succeeded in finding the exact number of topics whether it corresponds to the ground truth from the table with data seeds well what can we see from here the best values in table are colored with blue but these best values are obviously far from good so it may be an indicator that there is no such notion as natural number of topics in the data set and there are several illustrations to explore some other conclusions first we found that optimal number of topics depends on the topic model used for example on this plot oops on this plot we can see three curves each curve correspond to one each curve correspond to a sparse model with different sparse hyper parameter so one is more sparse and as we can see these models this is the same sparse model with different parameters they give different results as optimal number of topics as local minimum what is more we can see that this well broad lines have reached over random seeds also each random seed produces a bit different number of topics so it is also what is the result and on this plot we can see well not about this plot other finding is the following different quality measures produces different results as number of topics this is general case which was started however sometimes on this plot exactly we could see that different quality measures give the same result for example here one topic model and several quality measures and they point roughly at the same number of topics seven but this is not a rule this is an exception and that is probably it so conclusion we found out that number of topics is probably not a natural characteristic of a data set it is just another hyper parameter of a model and it is also dependent on the quality to find the number of topics perplexity and coherence maybe bit surprisingly they failed to give any decent results however information theoretic criterion and ring entropy achieved best results well as a remark as a final remark we want to say that probably this is not such an important task to find the optimal number of topics finding a way of training a topic model which has all topics interpretably good whatever topic number you assigned at the beginning well that's it thank you for your attention I assume it's fine listening me from the mic here so I have one small question so could you please summarize what we should do if we need to select the optimal number of topics so basically run as many models as possible and then to select as I understood correctly but the best way we think that if you should know beforehand at least roughly what number of topics you have in your collection and it is the best way and the second if you don't know how many topics you have or 200 then this is best to just probably train experiments and in order to find a topic model which best describes your collection another way to train many topic models and collect good interpretable topics put them aside and by looking at how many topics you collected you can see how many topics you have so something like this you start with some number of topics and start experimenting making your topic model better or collecting topics do you think you have more questions yeah thank you did you consider some classification based experiments where you apply obtained representations to measure quality of this no we will concentrate on just intrinsic quality measures without trying to assess topic models by experts or secondary tasks because well if we try to select the number of topics based on secondary task it obviously would produce bad results because we just find the number of topics which give us best results we wanted to find out whether there is some natural number of topics which could be found by intrinsic quality measures yeah I think there might be so different for different classification tasks or for different applications different granularities might be needed so that might be at the heart of this issue right so you have different hierarchy of granularities maybe for retrieval you need coarser, granularities for I don't know authorship or identification or something you need very specific things so there might be a general idea that everything in computer science depend on application and there is just no universal representation I mean at least in terms of topics which works always well like in clustering you might have different views on this data yes yes if you are trying to solve some secondary task it is best to search the number of topics based on this task yeah we have more questions yes we do no no Andrei we have one question from the audience and then yours thank you for the insightful talk I have a couple of small technical questions that data set were weekly good and as I believe in all those data sets you have the true number of classes the topics right so how did you obtain the topics from the rule right so this category is probably but what was the label set well yes it was great from Wikipedia we knew the categories of which article so that's how we found the number of topics I mean what was the topics like some categories from the Wikipedia some large categories yes which are good topics I divided into there are articles which called good which are checked and well big and thorough and there are several categories which these good articles belong to so there was some manual post processing for the label set right yes yes great and second question also the technical what motivated the choice of the models in concerned right PLS ALDE and different flavors of ARTM so I mean the kingdom of topic models is rather large so why these well our main idea was to take just several topic models in order to exclude some well biases towards topic models we just wanted to take more than one topic models so we took PLS ALDE as best as well known as simplest approaches and several other solutions just to to make more I mean wouldn't like no topic models contribute some something else or it was not really that important for the study no we didn't consider neural topic modeling because as well as I understood as well as I understand there is also a hyperparameter tier for these models well we just excluded what we excluded the models which also tried to find this number of topics as a result for example hierarchical direct lead process we excluded these models because they introduce additional hyperparameters which needs to be optimized and they are also known universal because in terms that they are assessed differently than the majority of topic models which have adjust to matrices f and theta yeah of course but some topic models have t as a parameter like ETM for example but anyways thanks for the answer thanks for question let's thank the speaker again and now we have the last talk of the session hello everybody welcome to this talk this work is done in collaboration with Kazan Federal University so this is a part of the project of studying text complexity on different levels so this part is related to text complexity at the sentence level and the other parts of the work of this big project done in KFU is related to academic complexity academic text complexity and also lexical complexity so the context of this project is that we want to do the prediction at the sentence level and at different levels as I already discussed and sentence complexity is one of these well-studied tasks already but in Russian it's not that well-studied and there are limitations for classical measures so we tried not only the classical approach with the features but also the deep neural networks to train and measure the performance on this task one thing that is there are other languages like Italian, English and others that already get data set for sentence complexity in Russian there was no data set and brief discussion of the work people tried to do before us and they collected data in different languages that trained some classical approaches like on lexical features and syntactical features as well and deep learning methods like pre-trained language models and collected a lot of different data sets in this domain including English and Italian and our first part of the project was collecting data in order to run the evaluation so the data set we resembled the methodology for English and Italian data set collection so we used the TOLOCA crowd sourcing platform and asked TOLOCA to annotate sentences according to seven levels of complexity but we also wanted to experiment with different features so we sampled the data set from the syntagruse corpus there is syntactic annotations and other interesting things that we can make use of later yeah, that was a sample of the sentences was related to the frequency of the lexical sounds in the sentence so we tried to sample the sentences of average frequency not to sample to complex to rare words and so on as our colleagues before did and it each bin contained 200 sentences so overall it's like 1,000 100 sentences so it's just a sample interface of the UI people just were asked to pick one of these scores there was a 10 assessments per sentence and just an example of a sentence we collected data from people who are native speakers there are no other restrictions or something like this we collected the data there is a slight imbalance of the data here you can see the distribution of these scores in the data set and yeah it's biased towards complexity more complex sentences which is a kind of contradictory to what we have in two other data sets English and Italian but it's still interesting observation regarding the assessment of agreement so we have this distribution of distribution of complexities corresponding to the so the x-axis here is the sentence length in tokens and y-axis is the average score distribution of average score and yeah of course there is a clearly the correlation between two parameters but also we measured the average number of people who agreed about the score on the sentence so out of ten people assessors 4.3 on average agreed about the score the complexity score of course the sentence length is a crucial parameter it's important to assess the complexity or the readability of the sentence but it's not the only one and previously we analyzed the previous work we analyzed what features can be important based on the features that are more important more related to the target we build a simple linear regression and then in this work it's kind of a step further we try to push the performance more to get more quality of this data and then in the second part of the talk there will be some modeling like classical and deep learning approaches here you can see the difference between discrepancy between Italian and English and Russian data and you can see that in Italian and English you have a relatively small number of complex sentences not complex here but complexity on the y-score but on the x-score again just length of the sentence two data sets there are relatively less number of sentences with a long relatively less number of instances with long sentences the same you can see here is a distribution of the complexity score so pink is Russian and this one is English and this one is Italian so the average complexity also different on different languages so it can be due to sampling of course but maybe it's just also related to some other like linguistical properties such as average length of word average length of sentence and maybe the also annotator here you can see the distribution of these properties like this Russian English and Italian data set and here the frequency of lemmas sentence length so it's more or less similar so here the discrepancy is appearing but because we got the logarithmic frequency so after the normalization they look the same so these three data sets are comparable and you can run the experiments several models on this data simple approaches like based on linear regression and decision trees in the CBMs based on classical like TFID of matrix features and then we try to also like modern approach based on BERT model and fine-tuning of BERT and also graph-based neural network so that was the initial idea to select features three features that we can select from the feature set of the and then build the linear model with the three parameters gives quite not very nice results and for this for this quality like RTL score is quite low and MSC and MIE quite low so the next idea about BERT is quite obvious I don't want to stop more on this just fine-tuning pre-trained RU-BERT Italian BERT and English base model regarding the GNN we make use of this model that gets the syntactical tree dependency tree augmented with additional edges and also we we use the features of this nodes in each syntactical tree in dependency tree the features provided by fasttext so it's the convolutional model actually gets the features and applies the graph-based convolution in order to find the representation of the whole of each node and then we do the pooling over the all these tree and the final linear layer decides which what is the complexity of the sentence results of fine-tuned BERTs of course it's as it was expected they were quite decent and this is just number usually low numbers good numbers because of large number of epochs when we compare it to GNN SVM and other models we can see that BERT-based models for all languages are much better and linear regression sometimes just doesn't provide any reasonable result because of maybe a number of features who knows and this GNN model actually is not that bad compared to some other languages maybe that's dataset is available and we are going to continue these studies in both cross-lingual, multi-lingual setting maybe building a model that can be applied to several languages and there is an interesting direction of research when the other group working on lexical complexity so they want to measure the complexity of the word in context and then using this you can measure the complexity of the context itself not just simple some sentences with words but simple sentences with the same word and analyze the complexity in context that's probably it if I have time I think I have some time for questions please go ahead I have one small question why English and Italian dataset use seven scale why do they use seven scale? instead of five for example for me five was the simplest way to think if you are not satisfied with binary classification because it's not the type of binary classification task why not five? why did you decide seven? that's a good question actually I don't remember they explained it well in the paper somehow it's called a Likert scale and that's usually from one to five and in this case it was like seven for some maybe more fine grained is better I don't know really thank you for the great talk and for the dataset finally and the question that I wanted to ask is when I put myself in the shoes of the annotator yet again about the scale well it's sort of hard to choose I believe however you explain each button so I wonder if there are any gamified things or some procedures where people annotate the complexity of the text implicitly so that the dataset can be derived from the procedure is there anything of this sort? something that comes to my mind is this tracking thing that people like measure something like indirectly like how long time how many times you have this circuits or something like this and when you read the text and there are measures and also I think in KFU there is an investigation of this part how students in school read the text and how they perceive it using the eye trackers devices but for this crowd sourcing it's I think the only thing to do is to increase number of annotators and somehow control the output quality like measure the time how much time people spend on each sentence for instance to annotate in each sentence it should be not less than like several seconds maybe so that's kind of general approach thanks okay we have one more questions in the zoom yes please yes thanks for the talk so my question is also about the data set and also about putting myself on the shoes of the annotators so the obvious strategy for an annotator is of course to just label long sentences difficult and short sentences not difficult so the guidelines for the annotators did they include any specific instructions about taking into account sentence length or not taking into account sentence length yes good question yeah there was a short guideline actually we provided examples of sentences hard and not hard and we completely rely on the intuition of annotator in this case I think any guideline in this in this part when you try to gather an assessment of the complexity that is or difficulty that is purely like perceptional thing it's not kind of cannot be formulated or defined objectively in my opinion so we rely on the intuition linguistic intuition of assessors and we minimize these guidelines as we can and then again just if you have enough number of assessors per sentence on average you will get a good assessment good score for the sentence the guideline was very short and the guidelines didn't mention the sentence length in any way no no no it was not related to any bound to any parameters or whatever so we just didn't want to push assessors or bias them towards some attribute or whatever ok thanks ok I can't oh two questions ok yeah thank you for the talk I have one small question have you considered any relation of text complexity with linguistic acceptability if there is some connection between these two notions maybe the text complexity could be assessed with the perplexity of the models and other methods used for acceptability no we didn't measure this exact relationship between these two things what I was thinking about the measure connection between the grammaticality or the some syntactical features that's why we're looking into syntax rules you can compute some syntactic based parameters from the sentence but like grammaticality or some you know quality of sentence can be related to complexity of course but acceptance in this case well I don't think we have a lot of data in russian maybe I'm just I just don't know about the there is the Rukola dataset so it's maybe possible to do this it's a good idea it's just a very short comment related to Andrei's question do you think it makes sense to when annotators are given different sentences to provide the same length of the sentences so that they they never biased indeed by this very strong maybe a modeler can learn better if you provide all that yes that's good point actually it needs different scheme of annotation you give two sentences with the same length and decide ask annotator to measure to answer which one is more complex maybe they pay attention then to more non-trivial features similar thing we tried in previous work that was a classifier that was trained on the pairs of sentences of the same length but for annotators here we collected only the scores per sentence that could be biased towards the length of the sentence to avoid this yes we didn't do this but the idea is quite obvious and maybe we will try to do it right now the goal of the project is to develop a model that will just sample complex sentences or simple sentences thank you thank you hello everyone my name is Alexey Andutov I'm from Manosov Moscow State University I am my colleague Natalia Ukashevich I'm an appropriate researcher with the topic document level relation extraction in Russian what are we going to talk about this today first of all why is the task of extracting information and sub-task relevant we will short-introduce what is the relation extraction task and what is the difference between sentence level and document level relation extraction and I will show the problem of nested entities in this task and what models for document relationship extraction are used traditionally one of the most important task in the NOP is information extraction and one of these important sub-task and least export is relation extraction this task has a broad range of applications starting creating and updating knowledge bases like Wikipedia and WorkNet to structuring documents like why this work is important first we address extraction with nested entities and ignoring these aspects without information loss and next we will focus on the document level relation extraction it's a complex problem and it's crucial for understanding entire document not to be no studies focusing on Russian language have been published yet what is the task of relation extraction look at this example given a sentence the previous cance film festival was held from May 19 to 26 given as entities festival and date we should predict the type of relation point in time in this case and also relation can be formulate as triplet subject object and relation type but in this work we can see the document relation extraction document level and let's look an example about Konstantin Rubinov we can see the two entities music band Verzdonsk and Baronova and Konstantin Rubinov at the same time we see that Konstantin Rubinov same have some mentions like Kuzilio that is nickname of Rubinov and we should recognize relation type founded by and recognize evidence supporting sentences that can help you to understand the relation type in this case and what is the difference between sentence level and document level first of all in first case you have single sentence and usually only two entities but in document level relation extraction you have one entire document entities can be mentioned in various forms and you should also predict the evidence we use narrow data set each contains annotations of entities and relations between some words there are about 30 different types of entities like person or place and about 50 different types of relations like workplace alternative name or work as and importantly that some of these relations occur between sentences in the text ok how this task of relation extraction can be solved the baseline approach to its classification task for a pair of entities in text for example giving a sentence you should mark two entities and apply some of crucifier but we have some problems with more advanced approach for example when we mask our entities we have problem with other weapon entities like Moscow State University and it can it can be used in some models like Spunbird ok and one of these approach is this problem and it's about tagging in my previous work I presented that this problem can be solved by several approaches first of all decomposition into sub tasks and each solved by a separate model and the main approach is joint extraction where you use a single model to extract in your text and in short that our results that using a single model provides better quality this allows for incorporating all knowledges and into one model ok we took this into account and consider it some approach that addresses relation extraction in a general sense and what approach did we use to solve the task of at the document level we started from baseline BLSTM that was mentioned in doc red paper text with features like Gov and entity type vectors and coordinates of entities in text applied by BLSTM and we aggregate some of entity mentions and classify by multi class classification layer but in this task it's important to consider all of entities during prediction and it seems a good idea and in approach of document after encoder our text we build a relation matrix and use uned architecture from computer vision approaches and finally we can pass this matrix and classify the output in addition we previously mentioned that we previously mentioned the concept of evidence or of supporting sentences that can help to recognize the target's relation in our task and in the extended narrow dataset such weblin is also provided the dream approach allows us to the additional markup beyond predicting the relationship let's also model the importance of the sentence in this text and finally we can keep only the most important sentences and make predictions based on this document and this approach is called fusion approach so we see that best result was from dream and fusion approaches using Berta large encoder and also we look at some main problems we encountered when working with a model on the Russian dataset in the rail first of all in this task you shoot process entire document and we use window based text processing with open windows okay next the second problem is that the quadratic complexity based on the number of entities and this is because the number of relationship in the document is the square of the number of entities and a lot of relations in your text is label like no relation and it's important and in cases when are using some large models it can be a help that remove random negative relationships up to set ratio within the document okay finally we conduct first metrics on document value relation extraction in Russia and specifically in the narrow dataset also metrics achieved the state of the art models for English I mean benchmark do credit we explore the issue of nested entities and implemented some enhancements to process longer text and also we see that balancing negative relations helps optimize the training process in the task of document relation extraction okay that's all any questions your experiments were done on only narrow dataset how you would estimate how generic this model is if you just go and start applied to new text let's say from newswire or internet websites just for harvesting very large database of relations about persons would it be accurate or the main shift would be very severe mm-hmm yeah I'm sure that it's a general approach for relation extraction while we use these models first of all we checked all the models that have the best results of the English benchmarks like do credit we used approach that can be used in another dataset like for another language like Russian so I think that it can be used for any domain or any area but in some of cases you can see some problems like problems with I mean that you should optimize your model to a process for example in real time and so on okay thank you very much second question okay let's thank the speaker again and now it's time for the second talk of the session hello everyone can you hear me so hello once again my name is Maria and today my colleague Lona who is here via zoom online and I we are going to present our presentation on the way to control but text summarization however today I'm more like a presenter because the main research was performed by Alona and she's the main contributor so all credentials to her and I was more like a tutor and advisor unfortunately couldn't come today but she's here with us and after the talk she will be ready to answer your questions with me so the application of the new promising approach namely the hydro some approach to controllable text summarization in Russia and the structure of the I'm sorry okay so today first we will talk about the motivation then briefly discuss all automatic text summarization approach and then I'll give the brief overview of the original method and then we will switch to our research talk about the data the experiments and of course we will discuss the result so basically the concept of controllable text generation brings an additional layer of flexibility and customization to the summarization process and these controllability allows users to specify specific attributes of the desired text such as length or style for example this customization enhances the user experience by providing summaries that align more closely with the information and individual preferences moreover it allows users to specify the level of compression to make it easier to summarize the desired text as for the objectives of this study the main one was to investigate whether the multi encoder architecture utilizing transformer based model called the HydraSum is applicable to the Russian language because it has shown great results for English and whether HydraSum could produce more stylistically diverse or higher quality summaries than the classic approach of fine tuning language model this research lies in the field of natural language processing with the focus on automatic text summarization as traditional methods of text summarization are mainly divided into the two big groups namely the extractive and the abstractive ones they provide no controllabilities producing texts which are not stylistically diverse the HydraSum method on the contrary introduces some control and briefly speaking HydraSum is a mixture of expert architecture with multiple decoders which is based on the pre-trained language models as a base model the authors of the original choose facebook's bar large but they claim that it can be applied to any transformer model in HydraSum architecture the base model is extended to consist of multiple decoders namely the authors experiment with two and three decoder architecture where each decoder captures different stylistic features of the input text and each decoder has a total number of decoder block where the parameter of the bottom layer are shared among the coders this is done to minimize the number of additional parameters introduced in the model architecture the top layers of different decoders are pre-trained independently thus each decoder can specialize and learn distant representations to suit its specific task and an important thing is the gating mechanism which is an important part of HydraSum basically speaking this gating mechanism is a weighted sum of the k decoders output and it dynamically determines how much each encoder's output contributes to the overall result enabling flexibility in decision making based on the weighted contributions after utilizing the gating mechanism the outputs from the shared layers are fed into a feed forward by a softmax activation function which outputs the overall mixed token probability thus these process assigns ways to the outputs determining their relative importance the author provided three inference strategies namely sampling from individual decoders where one decoder is more extractive and the second one is more abstractive and mixture of decoders using the gating mechanism mixture of using manually specified gating mechanism basically speaking to adapt HydraSum architecture to the Russian language we've chosen the classic summarization dataset known as Gazeta dataset and recently it has been one of the most popular dataset for summarization tasks in Russia the dataset sees soft news articles and there are summaries from Gazeta news website titles of the articles is date, URLs and additional information we also introduced two additional binary column namely the gate column and the sand column as a base model we took M-BART model which is a multilingual language model pre-trained on the massive corpora and besides M-BART we also trained three baselines namely the standard fine-tuning of the transformer model like root C5 base, root GPT-3 small and fine-tuning of M-BART itself because of course we wanted to compare just the performance of the fine-tuned M-BART with the M-BART incorporated in the HydraSum architecture to compare the performance we evaluated it using the classical metric namely we used root scores measuring the quality of generated text with respect to reference summaries apart from root scores we also measured generated summary relevant metrics such as obstractiveness specificity, length and readability just in two words obstractiveness is measured with the help of two additional tools namely coverage which counts the proportion of words presented in both input text and in the summary. Density which counts the average longest continuous extra copied from the input text and this metric for evaluating generated summaries was suggested in the paper newsroom a dataset of one million of 1.3 million summaries with diverse extractive strategy to measure specificity of summaries specitailer 2 was used and the length of summaries is measured by two additional metric its absolute length and the compression rate. Now on to the results we see some significant difference between the performance of two individual decoders the first decoder which is called decoder 0 here provides longer summaries and has lower coverage and density then the second decoder it is called decoder 1 here and decoder 1 is more extractive than decoder 0 and shows bigger coverage the most extractive summaries were produced by root 5 and embark and most abstractive appear to be the reference one showing the low coverage density which is predictable because this is the true answer so mixed of decoders produced more abstractive summaries than each decoder individually and in terms of specificity all generated summaries have results which are quite close to each other it can be explained by the fact that all models will fine tuned on the same data set and therefore they share the same vocabulary however individual decoders of hydro sum architecture have also shown different results on specificity metric decoder 0 has generated a summary with the lowest specificity score among all summaries whereas the highest result on this metric were shown by mixture of decoder and root 5 based model to sum up in this work we studied the application of the hydro sum method to the Russian language we found that the first decoder in it is more abstractive the most extractive summary were provided by root 5 and mixture of decoder provided more extractive summaries than other models and all models showed close results on specificity metric the time is up and practically finished during our experiments the hydro sum approach proved to be quite promising for the Russian language and as a part of the future research we plan to train this model with decoders and on a big and more diverse dataset to try capturing more stylistic features moreover it is important to try manually specify the gating mechanism during the inference stage so now my colleague Alona and I we are ready to answer your questions Alona can you hear us are you here this is Alona hello we have questions thank you for the talk my question about these metrics that you use for evaluation of decoders behavior like naive question is it possible to somehow adjust the behavior of the model or fine tune the model towards more abstractiveness output more specific or something like this as I understand from the last table you just measured this metrics but is it possible to change the metrics the output corresponding to the required criteria is it clear Alona can you answer yeah in fact yes it's like the future work which we are going to do when you are manually specifying the gating mechanism you can do it yourself so you can assign weights and thus you can kind of change the output to be more abstractive or more specific one yeah by adjusting the weights by assigning the weights yourself and yeah so it's manual you can do it manually okay thank you my question is about the mixture of two decoders as I have seen from the results table the results are lower so can you please elaborate how this mixture of decoders was done maybe I missed it and why do you think the results are lower thanks just a second yeah so the mixture of decoders it's when the modal kind of it's when the modal decides which output to generate so it kind of samples from both decoders simultaneously so the result was lower well I think I may add in the mixture of decoders we basically assign the mixture of probabilities from the two decoders thus we basically obtain the weighted probability from the two decoder and here as I remember we used just the averaging from the two of the probabilities from the two decoders and thus maybe we should have carried out and this is our future plan to experiment with different with different decoder weights so that for example the first decoder or the second decoder is more important in the result yeah thanks for the talk just a quick question about this mixture of decoders what about it's computational requirements I guess it's more compute intensive than just using one decoder so how much of a problem it is yeah so I guess when you are using a mixture of decoders when you sample from two decoders instead of one right it means that yeah I guess it's more compute intensive and so so what's like is it like twice as expensive or the dependency is not linear so just how much of a problem it is will it be like twice more expensive well it is expensive it it took a long time much longer than training with one or yeah with one decoder and but it's not really a problem because I did it in my co-op notebooks so yeah I kind of spent some resources on it but it was okay to do that well I guess it depends on the size of the training data or the inference data but just my question is essentially is the dependency linear so is it just the case that when you use two decoders when you sample from two decoders it takes twice the amount of time or compute resources as when you sample from one decoder or is it more complicated because as we mentioned in the beginning they have the shared number of layers the bottom layers they share the bottom layers and that's why of course the inference from the two decoders is not twice as expensive as sampling from one decoder but I'm not ready to write the exact dependency it's much more complicated than but it's less than twice because they have some shared parameters just for this purpose to save some computational resources during the fine tuning and the inference stage Hello everyone my name is Anna and my work is dedicated to machine translation for Russian HAKAS language pair my goal is to present the results that we were able to achieve and also to try to explain how we did it so that you can maybe repeat it on another language pair of your choice so the HAKAS language is a language spoken in Russia by about 40,000 people and have a very limited amount of digitized data so it is considered low resource actually there exist 60,000 pairs in Russian and HAKAS and they are from TIL Corpus of Turkic languages you can see the results of training the baseline model exclusively on Russian HAKAS data and as you can see on the picture everything that is less than 10 is hard to get a sense of and everything that's above 50 is considered high quality and good translations so the basic approach to improve the results of the model is transfer learning which means you initialize the weights randomly and then you pre-train the model on resource-rich language pair and then you take these weights and initialize another model with them and fine tune it on a low resource language pair so the biggest question one of the biggest questions was which language to choose for pre-training we wanted it to be Turkic because we thought that maybe it will help the model to train well and we also wanted to be in Cyrillic script because we wanted to use the shared vocabulary between the parent model and the child model so here you can see the languages that meet these requirements and the sizes of the available corpora so as you can see Kazakh language is the largest one but the problem about these languages the data is mainly web scraped and it is not of a very good quality in terms of translation but also it tends to lean to certain domains like news or government documents so we decided that we will stick to Chuvash language mainly because it is manually aligned and checked and actually it is the second in size so for quantitative reasons as well the preprocessing of the data didn't include much for the parent data because it was of good quality and we just did some additional shuffling and the child data from the till corpus it contains some trash symbols random numbers but also the Russian sentences were of good quality but the Khakas sentences for some reason they lacked punctuation at all and also had some mistakes but luckily we had another corpus that consisted of 30,000 pairs and they were of good quality and also they were the same sentences that were in till so we replaced the translations that we were able to get from electronic corpus of the Khakas language and thus we improved the quality of half of the data set the next step was tokenization and this was done by byte pair encoding with dropout for those who are not familiar with the technology I will briefly explain so it goes like you split the sentences by characters and then you set up the amount of merging operations so you first glue together the letters that appear most often together in the text so on the left side you can see the traditional byte pair encoding and the dropout technique actually is the same but you just skip some merging operations for example you can see on a picture B that on the left and in the middle the RE are glued together and on the right one the RE is skipped and it starts with AT so this is to kind of augment the data because when you split when you do the dropout several times on the source data and get different picking nice sentences you can then assign the same translation to all of them and you kind of increase the amount of the data you can train it on so here's the short information about the setting that I used, I used transformer model in its classic way from the original article I optimized the biometric and I used the shared vocabulary between Chuvash and Hakas data the experiments I did was the increasing the share of the Hakas data in the vocabulary from being the same amount as Chuvash to 1.5 times larger another experiment included chuvash in byte parent coding and the third one was adjusting the maximum sequence length parameter because it turned out that 99% of the parent data was about 100 tokens long and the child data 75 tokens long so we played with this parameter a little bit and actually it showed that all of these experiments combined show the best results in metrics comparing and has quite a big improvement comparing to baseline I also compared character F metric which compares not word engrams but character engrams and sometimes is considered better and here I compared my results to other works on low resource languages but this is maybe not very representative because there is a lot of factors that affect the result of the model especially for different languages but just so that you can see how it generally works what is more representative though is examples of the sentences that the model gave to us you can see on the top example that actually the meaning of the resulting translation is like the same to the reference translation and in the below example you can see that sometimes the model still makes mistakes like here it missed the word to study basically and the future work it may include adjusting the number of byte pair encoding merging operations because some hypothesis includes that the more morphological the tokenization is maybe the model will train better also traditional way is to expand the corpus and also trying another Turkic or not Turkic language for the parent model so this is all I think and I am ready to answer your questions okay do we have a question we have three questions did you consider pre-training on some other languages to enable better transfer learning capabilities let's say you pre-trained on some other languages in unsupervised way which might boost performance for the corpus yes of course this is a future work we will try different languages and maybe adjust the NLB model whatever so that we can compare the results do you have any idea which languages are more beneficial less beneficial because some might hurt some might improve yes actually I read some articles about this and the main idea I think is to be similar in morphological structure because for example when you translate from English to Russian it will be the same word for all the cases and things like that and it is difficult for the model to catch those those differences so the more complicated I think the parent the language for the parent model the better for the child model since Hacass is very morphologically original but not necessarily the same linguistic family no actually there are there are studies that show that the size of the corpora makes the biggest difference so they even pre-trained on Finnish language and fine-tuned on Turkish something like that and they still managed to achieve some good results some colleagues had similar experiments and that was counter-intuitive to me but they pre-trained on some completely different language but with similar morphological structure and actually give a boost thank you I have a small question so are you familiar with Hacass language or is one of your colleagues knows it how did you evaluate the results I am not familiar with Hacass language but I know some Tatar and it is quite similar so I could actually visually evaluate it somehow but for the sake of science my translations were evaluated by the native speakers that I asked to do thank you thank you for the talk I have a couple just I think technical questions could you please maybe I've missed that just deliberate a bit more what kind of model did you use just transformer architecture with 6 encoder and 6 decoder layers and 8 attention I see okay and with blue right what kind of for blue evaluation what kind of tokenization did you use it was the same tokenization as for the training the VPE thing well that's interesting because maybe for the sake of I don't know getting a different view on the quality maybe you should try to I don't know try some some like raw tokenization by or some I'm not sure about the existence of morphological analyzer for Hacass language but something like standing on limitization and raw tokens because this can give a different perspective on the evaluation it's like the same issue that was widely discussed at the time character level machine translation was popular so it's a bit different thing so definitely worth comparing I think yes I think we can do that thank you for the talk I have not a question but rather a suggestion you have said that one of the corpora was in some other script in Latin probably maybe you could use just some transliteration scheme and augment the data that way yes as I said we wanted to be in Cyrillic because we thought first we used the shared vocabulary but the transliteration crossed our minds but I actually didn't find the good transliteration tool because for example for Turkish language it is very difficult it has many rules and to transliterate it to Cyrillic is a very complicated separate task so I didn't find the tool and I didn't go for this idea yeah as far as I know for some languages exist like specific transliteration scheme so if someone has developed it it could be used I will have to look at it, thank you okay more questions let me check with zoom okay we don't have questions zoom so thank you and the next talk will be online yes that's true nice to meet you, I am Zaitsever XA and I will be talking today about how difficult it is to make another Cyrillic attack for machine translation models I have two authors that did most part of the job, most part of the experiments in writing I am Pavel Burnushov and he is Veta Kostunov we are participating in this project as well so let's start on what and why we are doing so basically okay all models most of the models we see today in NLP neural networks and we often see some very rare examples on how they work if you know it's the right way to attack it how to make it do what you want and for this for example prohibited by the model authors so it seems like a good goal to try to examine and how can we find some big points of the model can you hear me let's say we can see the slides let's wait a little bit okay so what is the problem we want to find some vulnerable points of the model and we can try to find this with adversarial attacks so basically adversarial attacks are way to find this vulnerabilities we try to find in an efficient way on how can we break the model how we alter the output of the model by small enough change of the input for example we can see the following idea here is an example of an attack in the bottom so we have some models that tries to mark this input flat misguided comedy okay to a label in this case you can see it is definitely a negative review and so we try to fix this input this as the model changes possible so in this way in this example we change flat to which a basic reasoning is almost the same and in this case the model outputs positive label a marking that we succeed succeeded in breaking the model and now the label is not correct because definitely it is still not a positive review but model sync it is positive so the question is how should we design such attacks to common classification models for NLP or for example translation models so okay the problem maybe the main problem for this area is that we don't have differentiability and the search space for NLP is discrete and we somehow should move in the space looking for adversarial attacks, adversarial changes of the input basically what we can try to do we can try to for example in code flip calculate the gradient with respect to particular token in the input and so we can try to say that if you have this test order approximation of what is going on we can then find what and how we should change to change the loss most and what we are doing and what is a pretty successful strategy and also we can try to use modernized models this was proposed in our previous paper it is based on the idea that we somehow should modify the loss function when designing the adversarial attack so basically what we can try to do we have this input sequence x1 is on the right x1, x2, x3 and so on xd where d is the length of the sequence and we want to train a generative model or to modify it so now the output is some x prime, x1 prime x2 prime and so on and so what we want from this sequence we can really express in the loss function as a test is a classifier so we want the label of the output to be different from what we have for initial sequence and this we can encode with this second term here and we see capital is the score of model also we can try to say that okay we should not be far away and this the deep Levenstein distance or whatever other distance like below similarity we have we wanted the similarity of x prime and x to be high in terms of how many tokens we change or in terms of some semantics metrics and in this case we design the loss function and we can backpropagate this on the softmax trick into the generator and it generates adversarial this is what we can do and this also transfers problem from a discrete optimization problem to continuous optimization problem so we can try to adjust it and basically this idea is followed by another one that allow us to do practically the same thing but for machine translation model basically we try to find the change in the embedding space that leads to okay small change of the input but a significant change in the output so we look at this gradient this color product of the gradient of the adversarial loss and our embedding we try to find the embedding that is good and using this embedding it generates back decodes adversarial sequence and also change we hope significantly the output and this we can proceed in different ways for example we can say that okay we have another more okay we have this loss function and we focus on predicting below and below for the initial sequence should be high so we should not change x significantly but if you look at below for a pair of y and the adversarial translation by prime are short based significantly different and we can direct organize flow of the gradients in all parts of this model and train it to generate to find good candidates for adversarial attack so it seems like a very natural approach but does it work let's look at the results for classification problem can be changed with our generate deal model input and obtain good results and the thing is that yes we can and what we are looking for in this table we have four problems and we have the accuracy for these problems before attack the scores of the accuracy is pretty high it's like for example AG is 0.95 here but what's going on next after different attacks there are some sort of attacks and the last two approaches are different variants power attack and we see that if we use our top performance attack deal with this deep Levenstein distance we have for some problems almost 0.5 accuracy that means that we can almost rock the model here and here a significantly decrease the quality of the model for AG and FTE problems and so on what else we can look what we usually should check as well is that our change in the meaning is pretty small and we don't change much the semantics of the sentence we can try to achieve this by looking at the scores of this right part this is the score of a discriminator so what is discriminator classifier we generate a sample of adversarial examples adversarial sentences and we have a label that it is an adversarial sequence label 1 also we can check for label 0 for natural sentences from our sample and we can try train the classifier and we see that in many cases this classifier are not very good for our attacks it means that this discriminator even after training can distinguish adversarial examples and normal examples it means that our generative model and our generative attack works pretty well if we try to solve the classification problem but let's go to the main goal of the paper let's try to check if we can do this for adversarial attack on machine translation models and basically what we do we run the attack with different hyperparameter value that corresponds to the power of attack how much we should corrupt the output and input because it definitely is to correlate it how we change the input and the output what next we can do we try to assess the quality of what we are doing of the attack so we basically look at some similarity score for the initial sentence, the changed sentence x' after an attack and the original translation and the translation after an attack so we want this guy similarity to be large and this similarity to be small for a successful attack so basically we need to plot both of them so let's do this and look at what's going on so we try a different sort of attacks and see actually what's going on so each point here is a different hyperparameter setting different colors corresponds to different sort of attacks and basically what value we want to be, we want to be here where we have a similarity between initial bird and okay for initial sequence x and x' attack but low similarity between the translation after the attack between the wire translated and wire translated after the attack and as you can see for all metrics, despite we want to be here we have a small improvement I would say that for all classes methods we have significantly lower the diagonal which means that we slightly or slightly lower the diagonal which means that we slightly change the change in the input and the output are pretty much the same okay and also an additional experiment shows that in some times we still can be pretty successful we added another type of attacks that's called character based attacks we swapped characters and also we combine these ingredients and we can see a new growth of points here that again corresponds to the different now for the different health parameters and generally we see that we change the input less than we change the output it means that in this case we can attack machine translation models basically we think that this effect comes from the idea that we need to find some big spots of deep translation models and basically it's not an easy task and we focus on something that lies outside of a training sample of such models and swaps of characters make us to get here and to improve over the baselines so basically it means that if you want to change the classification label we still can try to find some change in the input to get a change of the label with very small perturbation of the input if you look at machine translation models the situation is different typical approaches doesn't work here and what we can do is try to go to the character level and in this case we are pretty much successful so we can say that machine translation models are pretty strong like many other sequence to sequence models that we think so that's all about my talk do you have any questions? Any questions? We don't have questions in the audience we have one question in the audience understood you right but do you check that the sentence that you change for example the review that you showed in the beginning do you check that the actual meaning of it stayed negative but the label changed? Yes I can try to show some examples from the paper and you can see that typically the change is pretty small and if we show this to a human we would say that it still would identify something like this the first example may say that this changes to something that's not very meaningful here we say that human can say it's a misprint but definitely it's a gallon but after translation we see that the translation is pretty different and the meaning is not the same for even machine translation some successful examples so you check that the meaning stays the same while the label changes actually if we are talking about the first paper we even did some human elevation and they say that in most cases the meaning is pretty same okay thank you you're welcome okay let's thank the speaker again okay thank you good everyone let's start so the prevalence of online shopping has made it the foremost method for purchasing various goods the online customer reviews play a crucial role in providing valuable insights into customers' interests and knowledge of the product but how can we discover these insights automatically and formulate them our paper user review summarization in Russian researches this question so what is user review summarization well the logical extension of text summarization is multi-document summarization where we summarize multi-document but from the multi-document summarization stems the opinion or user review summarization which uses the specifics of human opinion presentation in various sources on the internet here on the slide you can see an example of user review summarization the input consists of reviews covering different features of the entity in different colors and the summarization model should analyze given reviews and produce a summary which should cover all the reviewer's opinions so the researchers in the field explore both supervised and unsupervised settings but while supervised setting is widely used due to its effectiveness it requires golden summaries for training which can be extremely difficult and resource-intensive to produce moreover actually the majority of the existing studies focus solely on the English data and neglect non-English languages because they lack publicly accessible resources this hampers the opportunity for broadening our understanding of cross-languistic properties of our summarization and we try to change it to collect the data set in recent years the researchers suggested a lot of methods for opinion summarization some of them suggest using automatic data aggregation while others use weak supervision in the form of seed words for further major topics identification some articles suggest using information about such topics or aspects to model summary generation during our work in the project we experimented with the best performing contemporary abstractive and extractive models from the articles on the screen we employed different methods which use and aspects information to create a summary as our work focuses generally on the data and training and not on the architectures let us overview only the main ideas of these methods so the first method that we will be reviewing is plan sum it extracts sentiment and aspects distribution from the data and then fuses these distributions and token embeddings the result of the fusion are then fed into elastame decoder with attention which composes the actual summary another method is sum does not have a colorful scheme but mainly it uses multiple instance learning model from the previous works of the authors to induce aspects controllers and then feed them into the transformer model the original transformer model the controllers are actually calculated using with supervision in the form of manually collected seed words another method is quantize transformer which uses transformer encoder and decoder to quantize the sentences of the reviews to find the average opinions so as valuable aspects and formulate it with the existing using the existing sentences and the last work which we review in this project is the semantic cutting encoder which continues the work of context transformer but creates the distribution of aspects and ranks them and choose the top elements as the basis of the summary so let's talk about the data the standard data sets for opinion summarization in English language are Rotten Tomatoes Yelp, Amazon, Oposum and Space data sets each presenting data in different domains unfortunately from our knowledge there are no publicly available data sets for user review summarization in Russian and the majority of Russian-aligned services do not allow to use their data therefore the data was collected from the open internet source TripAdvisor.True with the help of verb scrapping resembling the collection of space data set we collected around 1 million reviews of hotels from 11 cities and structured the data in a convenient way hosted by one of the reviewers which also show the statistics of the collected data sets some of the data distributions but as you can see the statistic are fairly similar among the train tests and validation splits the labeling process for the validation and test splits of our data sets involve manual annotation by us the authors similar to space data sets we have summaries per split we created 28 summaries per split we wanted to compare the performance of the models trained on the collected Russian data and the existing English data sets which is why we followed the annotation of space data set the set of aspects was taken from some paper translated to Russian and fixed for manual summary construction every hotel was labeled based on 50 reviews stratified by rating to ensure that both low and high rating reviews are present in the data we manually filtered the concepts and phrases corresponding to specific aspects and chose the most repetitive of them as a summary basis we well let me that the human written summaries might provide some restrictions on the operation of models and leave this for further research we choose models with the highest metric values on space and Amazon data sets which utilize aspects information for the correct comparison on the Russian language data set during our experiments we trade abstractive and extractive methods and compare their performance some methods were not taken into consideration because they either utilize self-supervised learning apart from unsupervised learning in these methods or propose novel algorithms which not consider sentiment and aspects information for our experiments we employed the Rouge metric which is a standard metric in this area we also experimented with changing methods for processing and sit was extraction steps but as this did not lead to any improvements we wouldn't stop on this here you can see the results of different models evaluation all the models were trained on the collected data but the models with FT postfix were additionally fine tuned on the small part of translated space data set while the evaluation on the collected data set shows clear dominance of fine tuned ASSUM evaluation on the space data set shows several well performing models among them fine tuned ASSUM which show the highest biogram Rouge metric and variations of plant sum which show higher unigram and longest common sub sequence metrics so having conducted a manual analysis of the produced summaries which can be also found in the paper we found out that obstructive summaries contain more specifics from their views Extractive summaries contain more general information but on the other hand quantized transformer and semantic coating coders so as extractive summarizers included personal information in their summaries and plant sum which produces more human readable texts catches indirect relation to aspects while ASSUM describe the fixed aspects more precisely but with limited lexic variety so in this work we explore the application of modern solutions to the Russian language user review summarization we managed to collect around 1 million reviews of hotels and annotated parts of the data for evaluation we compared models from unsupervised and weekly settings the best performing models among the approaches were adapted for the Russian language analysis and trained and fatuned to summarize the opinions in the hotel domain our conclusion is that abstractive approaches outperform extractive approaches on the collected Russian dataset in contrast to findings for English data presented in the recent article proposing the semantic coating coder model so specifically the best performing model on our data is ASSUM and on the space dataset are plant sum variations further research may focus on the stylistic limitations of human written summaries for better model performance coherence and readability analysis of the summaries and ways to improve them researching the self-supervised methods which are actually excluded from the comparison as some of the well-performing models on the standard datasets so thank you very much for your attention if you have any questions you're welcome thank you for the talk I think about three slides ago you've provided quite a cool yeah this one quite a cool summary of the specifics of which models and of their performance on the reviews you've collected and I wonder is it possible to well some of these specifics clearly are but is it possible to measure all of these effects were you going to do that it's actually first of all these are the findings of the work we've done so it may differ from more general information what we found out and secondly it was measured manually as I mentioned before because the test and validation splits are actually not that big so we managed to measure it manually I see could you please remind me what was the size of you know the datasets based the side of the part of the dataset that you were making the conclusions on based on okay so we followed the annotation of space dataset which actually achieves some kind of good results they had 25 hotel summaries so for every hotel we have a review set around 100 or 200 of reviews and for each of the hotel we create a summary so they had 25 summaries for validation split and for taste split so overall 50 summaries and we had 28 summaries for both splits so 56 summaries in total I see thank you you're welcome thank you very much for your talk I was wondering do you consider now only the hotel reviews or do you plan to do more on other topics or it's just hotels now so the standard as you can see the standard datasets you can see but anyway the standard datasets review the different domains but we chose the space dataset because it stated the most clearly the procedure of collecting and annotating the reviews and it was the easiest domain to start from but of course it is a bright prospect of the research to try with different domains which were actually done in English data in different articles so yes the answer is yes thank you okay more questions we don't have questions in the audience zoom okay I guess we don't have questions on the zoom well we do have questions okay yes just a quick question thanks for the talk and as far as I understand you annotated the reviews well you wrote the summaries for the reviews based on a set of aspects taken from one of the prior works and of course this choice of the aspects influenced the result in summaries quite a lot so I do think if you choose another set of aspects your findings would be much different about this we actually explored this avenue so we tried to find the seed word extraction methods in order to comprise a different set of aspects different from the aspects from ASUM and we tried all the different word embeddings all the different methods described in different papers but actually it didn't produce any better metrics according to the summarization matrix which is why we stick to the ASUM set of aspects but if anyone could find a better way to extract aspects this could be another avenue of research. I guess it was not exactly what my question was about because you evaluate your model on the summaries which are written by humans taking into account some set of aspects so your gold standard would be different if the aspects are different right? Yes of course the production of summaries was based on the given set of aspects the set of aspects was taken from the ASUM method and they actually like provided it because they collected it manually from a great like from a great experience human written and they extracted the most valuable aspects but if we had another aspects we could also experiment with that. My question was about your expectations for example your current set of aspects includes things like cleanliness of the apartment I don't know food quality etc but if if you choose some completely different set of aspects the easiness of finding the hotel do you think that your ranking your results will still be the same? What is your expectation? I'm not sure the experiments are not just I understand it is quite difficult because the model generates the summary based on what it was trained on so I guess it can find different aspects and it would give even lower quality but better summary if that makes sense so probably Rouge for that purpose is not the best metric but the one that we used because we had the fixed set of aspects okay thanks thank you we should announce that please don't forget that we will announce the best paper during the closing session so please make sure to attend it so as to learn what is the best paper in the NOP track okay so let me just first introduce you so thank you for coming so Artem was supposed to be here but unfortunately for some personal reasons had to cancel his 3 plus minute nevertheless it's a great pleasure to me to present another very interesting speaker from Emirates today Artem I know him since 2019 and I should say that he always amazed me by his energy dedication to NOP research and dedication to the topic as well which he's going to present today over the years so I've seen how Artem emerged into a leader in this topic in this field he started a lot of active learning and certain estimation techniques and published extensively at the very top conferences and continue to do so so his record track in NOP is really impressive and today we will learn about this very important I think topic uncertainty quantification for NOP and more specifically for LLMs so without further ado Artem please go ahead Thank you very much for the kind of words I'm very apologize that I cannot be there in person because of health issues but never mind let's dive into very interesting topic I currently work on safety of NOP models and let's go ahead so we'll have some introduction and some estimation background this topic will be more focused on certain estimation and especially on some estimation for language models I will also present our last effort regarding summarizing and systemizing these diverse works on some estimation for language models in one package in one Python library and finally we will conclude with some suggestions so first and first we know that language models are very good becoming very good in multiple tasks and basically regardless of what task you'll pick up you'll find something like that on the papers with code that we approach or exceed the human baseline you can see that well most of the tasks might be considered assault or maybe most of the middle complexity tasks assault but playing performance as accuracy is not the only thing that we want to pursue that also safety and safety is related to two other aspects first of all it's bias and reliability so let's start with bias so what we usually have we have some data with some problems with maybe some garbage and maybe some biases some biases pure correlation between target variables and some feature that shouldn't be really being taken into account and we train a model and we get of course a bias model that picks up this particular dependencies and use them to do these predictions which is of course not good because well we can think about many biases for example one bias if you classify a mood a person by a picture and you pick up a color of his eyes as a feature that's probably one example for bias which you shouldn't take into account because of course the color of the eyes makes any sense for predicting the mood but due to your data you can pick it up as a viable feature and use it in the future so what we want to do is in the biasing in fairness we have like garbage data with some biases and eventually getting and then bias the good model so how we can do this one important case of bias in NLP which is usually considered is social biases and stereotypes like depending on gender, race, age and here's one example where we try to determine the profession of a person where we try to determine the occupation of a person want to determine whether whether a person will be a surgeon or a nurse and we can find out that sometimes classifiers look at the gender of the person and just put all women to nurses category and all males to the surgeon category which is completely unfair and should be avoided of course in practice because if we do this we will facilitate biases in the data in our social stereotypes so how we can deal with this well there are two techniques that we can apply to our models and techniques that reduce the importance the effect of bias in the original training data first we can reweight instances in the training data so the signal from the loss from the training loss will not bring this spurious correlation between protected attribute and the target variable this is when we try to unlearn or forget about some attributes in our future representations and here for example we have this additional loss component that help us to forget about gender race and other stuff so the model cannot predict from the future representations these protected features so eventually it becomes less biased balanced alright that's how we deal with bias now let's go to reliability okay so what is reliability and why we should care first of all let's look at this an example as you can see there's like a banking application where a person asks for his balance and well this banking application works perfectly fine when the questions are related to the topic but when a person asks something unrelated like from the sports it tries to answer his request in the way it was trained to do and it fails because it answers something which shouldn't be answered and that is an example of out of distribution question where a person asks something from the domain which is not part of the training domain of the model so instead you're probably saying that okay sorry I can answer your question because it's not my type of concern not of my type of domain knowledge so alright so that's one of the example of out of distribution input another example well a little bit more critical safe critical application diagnosis in medicine where we have for example some symptoms or maybe anamnesis of a person and your model has to make a diagnosis might be affected the treatment of a person of a patient in the future and of course there might be some out of domain questions that they should try to avoid try to ask a person for the help but it also can be some situations where we have something like very close diseases like SARS and COVID and it's very difficult to distinguish between this particular amount of information you have we probably should try to ask for a physician or a real person another more complex system to make a decision for that instead of making a diagnosis that probably can harm a patient in the future so that's two examples let's go ahead so what we have reliability we have to always remember that model capabilities are very limited because due to limited amount of data it was trained on there's always training data set and it's limited and going beyond this data set always has a risk of making model answers completely unreliable and it's also a situation when our task has ambiguities like in SARS and COVID where it's very difficult to distinguish these two diseases with lack of information like lack of features so this second thing ambiguity in the task is another concern we have to spot these areas of ambiguity and also try to abstain from making any decisions in these so what we need to do to achieve abilities to develop mechanisms that take into account these limitations when the model is deployed and reliability is basically a capability of our system to apply this mechanism in a perfect way to detect ambiguity and many other stuff like maybe adversarial attack detection so how to handle reliability how to make your model reliable this slide is a little bit marketing here because basically we're promoting here some of our work which was published in ACL in 2022 and 2023 and other conferences actually there are some solutions to reliability problem in terms of classification and how to detection the last one will be published in ACL this year it is an intersection between devising and uncertainty but of course there will be a lot of there are many more works in NLP and uncertainty for NLP and we will discuss them later just a small slide here about our work okay so our new concerns what we have today today we have a generative AI era and in generative AI era models can generate something inappropriate something that shouldn't be generated and in particular case with charge APT here this is a case where it gives an incorrect answer here we can see that well we're asking how many letters in the word 19 and it's incorrect then we ask when we try confused model even more and it becomes even more confused so yeah this happens all the time with language models and I should check this example for this lecture and well the first one was fixed but the second one is not fixed it's still confused model into thinking that there are like any number of letters in the word 19 another example I apologize before Alex Panchenko for that but this is a question about his biography like tell me about Alex Panchenko and the model also hallucinates some facts about his biography and very confidently say that he's a professor I mean like he will be probably in the near future right now he's not a full professor in Hamburg but it's pretty convincing in saying that saying some false facts about Alex and that happens all the time with this kind of models so what we have what concerns we have in the area of generative AI so models can generate unacceptable output and we should first consider the hallucinations like the false facts that they generate we also should consider something like toxicity because you know the data model can be trained on can have some leaks of toxic toxic texts and we should avoid that we also should care about truthfulness because sometimes model cannot do the task and it should be quite explicit about that I cannot do this it should be truthful and also one thing which is commonly put into consideration is personality that the model should not pretend somebody else it should be explicit about it's personality that it is model it is like like assistant it's not some sort of like a human professor etc giving you this answer so how to deal with this problems in general there are a few ways to do this already and people are applying currently to all this language models gender GPT and others it's reinforcement learning of course where models is trained on a big amount of scores of assessors to and the scores are essentially aimed to reduce bias reduce success to reduce many other issues with the model there are other approaches to this problem like contrastive learning you can also just fine-tune on some good answers that people craft by themselves also you can do some filtering of the input data or output results with a stop-list another stuff one of the colleagues that were visiting the embassy I was from my tutoring and said stop-list is a way to go with the language models very good thing it works much better than anything else fact-checking and another stream of work where people try to look at the output of the model and do some you know fact-check checking with the databases and finally which is I think one of the most important interesting for me is uncertainty estimation can we apply some decimation in for language models right first of all let's just go through as an estimation background I hope pretty fast again let's consider one small example here consider we have a classifier that is going to distinguish cats and dogs and it works perfectly when has perfect picture of cat and perfect picture for dog it works good it predicts what it was intended to do but what we have here is something weird which you know cannot be classified into a cat or dog class here probably in model should output something around 0.5 something 0.49 or 0.459 to demonstrate that it's something weird it's something super uncertain and maybe we should call it person or somebody else to work with this image but unfortunately like this probability scores that obtained through softmax classifiers are not very good the reason is here the softmax probability it gives you it can be used as a estimation score but unfortunately it gives you uncertainty like this so in most of the area you are completely certain here and here and you only uncertain in this small decision area between two classes and that's what we get from softmax classifier but this of our example with the cat and dog this cat and dog image can appear like anywhere it might appear something like here and of course in this areas we will be completely certain and we will predict particular class so overconfidence is the problem of softmax probability and certain estimation so what we really want to have is something like on the right picture where everywhere we don't have any particular evidence and we don't have any data we are uncertain and also we are uncertain in the area between two classes that's what we want to have alright so that's why we cannot use softmax probabilities well we can but sometimes it doesn't really what we want to okay so what is uncertainty in particular well there is no unified way of specifying certainty scores it can be anything which works very well for us for our task like audit detection or selective classification it might be some sort of distances probabilities entropy of course or estimation of an error but while in biasing theory there is a particular way of defining uncertainty and this is basically entropy of a probability distribution and so we have a trained data and we can well we can say that uncertainty is basically an entropy of predictive distribution and this distribution can be parameterized with a neural network and of course if we are biasing we probably like to put some sort of like prior on our parameters so they have also some sort of distribution and then if we then we can rewrite our predictive distribution in this way so we have this formula which we can put it into the predictive entropy formula and get something like this but of course I mean we could do this in a biasing way we could try to try to estimate everything correctly using some sort of approach from the biasing theory we can do some variational inference etc but usually we don't have an access to we don't have a way to do this usually the biasing neural networks is very hard to train and they are usually very difficult they also have some drawbacks like storing additional weights and another stuff so let's go back to types of uncertainty and look at our classifier again here we have a situation where we have a lot of information on the data and we have a classifier tries to distinguish two classes but unfortunately in this area, small area between classes, these classes overlap this example of high aleatory uncertainty where we have noise in the data where we cannot easily distinguish between two classes just simply by drawing a line so in this situation we are saying that we have high aleatory uncertainty which is related to noise and ambiguity in the data another situation where we have lack of data and we have some sort of freedom to draw this linear classifier in various ways and it will perfectly you know, distinguish two classes but eventually the situation like here is ambiguous for us because again this might be a trained set but in the real situation that it can be located here in this data can be attributed to one of these classes so in this case we say that we have high epistemic uncertainty which is related to lack of knowledge so we don't have enough information enough data to draw this classification boundary all right so by definition the whole predictive uncertainty is basically a sum of epistemic and aleatory uncertainty and then what we can do is to write some I would say from the previous formulas we can derive several formulas for epistemic uncertainty and polarity uncertainty first of all let's look at epistemic uncertainty epistemic uncertainty is basically as I said it's a lack of information so what we here define epistemic uncertainty as information that we get about model parameters after we see the target variable y for the object x so this is how like our knowledge about model parameters change after we get information about this particular instance so if we write this formula a little bit we eventually get this formula where we subtract our predictive uncertainty from this term which is basically an expectation of entropy for each particular realization of model parameters all right and okay so what is aleatory uncertainty then well by definition we have a predictive uncertainty the sum of epistemic uncertainty and aleatory uncertainty so simply we can derive a formula for aleatory uncertainty here so this is basically this expectation of entropy of predictive distribution for each realization of model parameters so again in bison theory we have like perfect formulas for these things but in reality we cannot you know estimate them directly it's quite difficult where there are several ways to do that in bison modeling of course like bison backprop variation for instance etc etc but usually people don't do this because it's a little bit more complicated and a little bit more overhead to do that and it's very hard to train bison models right now again let's look at this formula so predictive uncertainty is a sum of epistemic and aleatory uncertainty we can see that epistemic uncertainty is a mutual information between model parameters and target variable and aleatory uncertainty is an expectation of this entropy of each realization of model parameters and again let's look at how like a certain just look on the charts we have some sort of like two moons dataset aleatory uncertainty basically this line between two classes, this is a decision boundary epistemic uncertainty is basically all area around this training dataset and then we sum these two uncertainties together so we get this predictive uncertainty so we have both certain areas in the like in the area without training data and between these two classes where it's about where we apply different types of uncertainties aleatory uncertainty usually can be used for like clean data we can spot some noise in the datasets like epistemic uncertainty here can be used to detect outliers of course it can be used for undesirable detection and there's also a possibility to use it as as a criteria for active learning for selection criteria in active learning total uncertainty is used for selective classification and that is kind of more the most critical thing for safe critical applications where we need to abstain from making a decision somewhere for example in medical applications all right let's just have a very broad overview of methods for incentive estimation we'll get to in particular realizations of these methods where we discuss particular methods for language models so in general a very strong approach to incentive estimation is basically using ensemble of models and look at the diversity of predictions of these ensembles of these ensemble elements so if while the diversity is high the uncertainty is high so the only thing you can do with a Monte Carlo dropout Monte Carlo dropout is a technique where you make not one prediction but multiple predictions but you enable your dropout layer so each prediction becomes a little bit different from one another and you can assume it as a cheaper version of ensemble you don't need to train multiple models but you just have like one model but you apply different dropout masks a little bit different predictions and again like we are looking at diversity of predictions if diversity is high then uncertainty is high there's also density based methods where we try to approximate our training distribution with some with some models based on the Latin feature representation so we just try to create a training data distribution and analyze whether our instance belongs to this training distribution or not and finally there are some techniques related to training loss regularization training loss regularization well you add a regularizer to the loss it helps you to calibrate your model a little bit sometimes it's also going to be helpful in selective classification okay so why is uncertainty estimation for language models is hard well first of all we have not one but multiple prediction predictions in a sequence these decisions are based on virus sampling algorithms we also can get the final sequence not just like simply looking at the maximum probability token but have a multiple beams and select the beam in the end and of course we should remember that predictions of language models are can is hard to normalize them because again each like a sequence well you can create infinite number of possible sequences and how to estimate the probability of each of each of this sequence and how to normalize them so this is basically a very hot topic for hot ongoing research topic right now okay finally we get to the standard estimation for language models let's look at what we have again the simplest techniques are very similar to what we have for like for the standard classification models we can look for example on maximum sequence probability here is basically we can estimate the probability of sequence by just multiplying probabilities of each decision in a sequence and you know subtracted and get our uncertainty score there's also like a very common thing is to have an average of log probabilities of our tokens well sometimes it's called perplexity, sometimes it's called normalize log probability again it depends on whether you do it with explanation or not and also you can simply just you know calculate entropy for each particular token and average it and there are many other you can figure out or you can think about many other ways of aggregating individual predictions of individual token but these are just the most common ones in the literature okay a little bit more complicated stuff I mean not the complicated stuff but a little bit more interesting stuff it's point-wise mutual information here for example it's a way of addressing it's a way of addressing that some sometimes model outputs generic stuff and basically maybe uncertainty of this generic stuff is not that crucial so I mean like if it outputs something very generic maybe we shouldn't be that much of a concern about that so there's a approach to correct things with this additional term where we simply run the model with some query and run the model without query and so we look at how the model how the sequence what is the probability of sequence without any particular input how generic it is a little bit more complicated way of doing this is conditional point-wise additional information this is an idea where we basically fall back to the simple perplexity but sometimes where we are very uncertain we look at the probability of the generic sequence so this basically it's so most of the time we using this term but sometimes if this entropy is high enough we also add this term with the generic it's a little bit faster than this one and words also a little bit better okay assembling I will not spend all of time on assembling you are unfortunately compared to like classification assembling in generation where it's not that good I mean classification is one of very reliable algorithms assembling usually if assemble doesn't give you any improvements probably nothing will do but in sequence generation ensembles according to our experiments do not perform that well anyways so for assembly we can do same stuff as before we can just look at the average probability from picture of the ensemble element we also can look at the average probability of course and there's also more stronger technique which was suggested by Malin and Gels is a reverse mutual information where we look at the logarithms also of the each realization of of each individual ensemble element we also could use some sort of like Monte Carlo dropout here Monte Carlo dropout in sequence generation works a little bit worse again and note that you probably need to keep the same dropout masks across multiple sequence generations to keep your scores good enough yeah all right so density based methods these are quite strong methods especially for out of distribution detection in here are two papers that propose this for for order detection in sequence duration tasks these are this our work which was published on ACL in 2023 and very concurrent work which was published on iClear in 2023 also basically the idea of using the density base techniques in sequence duration it's basically the same stuff in classification we use so-called mathematical distance as a certain estimation metric a certain estimation score and you compute it like this so the idea behind that is that you estimate the probability of you estimate the conditional probability of your particular instance belonging to some sort of class in your training data like for example here we are looking at the probability of x belonging to yellow class and we say that well this probability is basically normal distribution with a centroid with a parameter m in the center of this class with the covariance matrix sigma which might be computed just looking at this particular class or the whole data set this just different ways of computing the whole distance so if you do the directions correctly you will eventually get this distance score so uncertainty will be basically the distance between the centroid of your class to your data point where which you are trying to estimate uncertainty for and the higher this distance is the more uncertain you are so because your point is lying far away from the training data and well of course in sequence generation tasks you don't have particular classes so basically in sequence generation we have only one single class where you have all the training data in one of the single class you compute this centroid you can compute the covariance matrix and then you can estimate the distances between centroid and your instance and that is the question okay so how do we so what is H here H is a representation of our instances and in sequence to sequence models we have two options here we have encoder and decoder and we test both of them and they work pretty well and you can take the representations of each of the we have an embedding of the input question input query and you also can do the same thing with the output you can get the embeddings of all output tokens and also average them and get this embedding of the output sequence so then we can compute this we can compute the covariance matrix we can compute the centroid of course using our training data and determine whether we whether this instance is out of distribution or not pretty simple there are two modifications of that one is based on distance plus pca plus minimal covariance determinant pca helps us to reduce the dimensionality of our representations and reduce the effect of outliers the same thing goes with minimal covariance this determinant helps to filter out some noise when we compute the matrix and okay there is another option where we calculate the distance for our particular point and we subtract the distance to the some sort of like global central centroid of the background collection in this case we have a big background collection like maybe C4 data set we calculate the centroid for the C4 data set and look how close our instance is and this idea is just like that we don't want to be high and certain for very common very common sequences to very common queries to very common outputs so we are in this case we are uncertain only when we are both far away from the training data and far away from like the pre-training data set okay all right semantic entropy this is one okay so let me reiterate with density based methods these methods are very good for audit detection and there are type of specific and certain estimation techniques so if you have even especially sequence to sequence models they are showing pretty good at tasks for machine translation summarization question answering to find some experiments in our paper in the paper of our colleagues in the concurrent work all right semantic entropy is more interesting stuff you know probably that model can generate multiple similar sequences in terms of meaning but very different in terms of surface realizations like you can ask who is the president of United States and for example like model can answer like George Bush or George Bush is the United States president etc etc this will be essentially the same thing and we want to take into account the similarities and semantic entropy does exactly that it's samples several predictions from the language model like A1 to AM then it clusters into meanings like meaning clusters and then it estimates a certain entropy on top of these meanings not the particular sequences so and each meaning can contain multiple generations like again regarding this president they all basically can be put into one single cluster to calculate the probability for each meaning the author just simply is some probabilities for each individual sequence and then basically calculate the standard entropy formula on top of the probabilities of these meanings well this works pretty well but surprisingly that black box methods work even better in semantic entropy we need an access to lockets and probabilities of the output language models but sometimes or even actually usually some models like charge gpt, gpt4 and many other APIs shouldn't provide you an access neither to embeddings nor to lockets of the models and then you probably need to only only you can what you can do only is to analyze its outputs so what you do here you again sample multiple outputs from the language model then you can calculate get bearable similarity between these responses and then you can compute and sort of set the estimation on metric on top of that first what these authors of this paper tried they tried multiple similarity parallel similarity scores one of them is basically jacar similarity where they simply look at the bag of words of two answers surprisingly it worked and another thing which is more elaborated is to use of the shelf and lie the Burton model for an LP entailment so the idea is just to look at two outputs and analyze them with the Burton and determine whether one is entailment of another so if one is basically entailment of another another is entailment of the first one you say that they are essentially the same thing so they are similar so this is the second way of computing similarity which is more effective so now let's get to uncertainty scores again surprisingly very simple score uncertainty score here is a number of semantic sets here you again you calculate this pair of similarities you basically cluster them into semantic clusters and calculate the number of these clusters and if the number is high then you are uncertainty if the number is clusters is low then you are certain that's basically it again this works but there are some better versions of that the second version is to compute doing the spectral clustering on top of these similarities so instead of like just trying to create this hard adjacency matrix you create like I would say salt adjacency matrix where you put you have an adjacency matrix with scores of similarity of which output sequence to each other and then you can do some sort of spectral clustering where you can basically look at the number of these spectral clusters so essentially this method does this another more simpler method but actually from our experience a little bit more effective is basically looking at average pair of wise distance again we can construct this matrix and we can just look at some of pairwise similarities between output sequences and then we can just average that and get the uncertainty score so again if the average pairwise distance is high then we are uncertain if average pairwise distance is high then we are certain yeah so these two things work actually pretty pretty good so and moreover I want to again emphasize that for this particular case you don't need anything you don't need an access to the model you can just do this you can apply this method to charge a bit for anything that has an API finally the method is called p2 from the antrophic paper which basically says well why not we can ask the model directly and they did they just did this stuff they they related the question multiple choice question answering they provided the question and the answer of the model and then ask it whether it's true or not surprisingly that the model as you can see here it's answers to the second question like whether it's true or not can be used as a proxymetric for uncertainty for its original answer as you can see that for for the wrong answers the model is usually uncertain like the scores are below 0.5 closer to 0.4 and for the right answers it's pretty confident it's it has this distribution as you can see so this interesting way of looking at and certainly like again asking the model directly itself it's probably also interesting direction for the future work but again I want to emphasize that this despite it works in this particular case in work this researchers in our experiments and experiments of other people that didn't work really well for example machine translation of other stuff the question answering also sometimes it doesn't really work alright so let's do some sort of summary what works and what doesn't for audit detection I was just going with the density based methods of course and I was just considering robust density estimation first for second classification I would look at the black box method its average pairwise distance and lexical similarity also yeah and also probably it's a good way to combine density based methods and perplexity that was shown in one work on iClear work that it helps we also get experiments where it helps but in text classification current this P2 method sometimes doesn't work and symbols also sometimes it doesn't work right now let's get to the final part of this talk to our program framework LMPoligraph so it basically helps you to know what LMS do not tell you so LMPoligraph is basically a Python library which accumulates a state-of-the-art instant estimation techniques it supports state-of-the-art models, state-of-the-art models, GPT-like models it has wrappers for API you can just use a very small number of just a few lines of code to add uncertainty to your language model it also provides a benchmark to label methods instant estimation techniques maybe on your data etc. we also will provide a live demo I hope soon maybe on MLP maybe on AAAI alright, some demonstration examples with LMPoligraph in our demo, for example here we are asking model to translate non-existent language like translate into wasabi and language and you can see that model is completely uncertain because of our instant estimation metric is zero and for French it's pretty easy as you can see model is completely certain about its output another example with with the knowledge of the model as we can see the model doesn't know Russian songs from Russian artists from Russian singers but it knows pretty well that the songs of British of Beatles and as you can see it tries to in Russian case it tries to imagine something like it tries to predict something like Irina Allegrova Allegrova, Allegrova tries to predict something similar but essentially it fails and now we can detect that it fails another example is like asking some simple and complex questions, if we ask a complex question like how we can cure dinosaur, consider like dinosaurs came from Earth, we can cure them there from Pumania and despite it provides some you know, a list of suggestions about how to cure Pumania for dinosaur we can see that it's completely uncertain but when we ask the same thing for the human well, it shows pretty decent confidence the same thing with like how to perform a kidney surgery well, and surprisingly the model gives you a pretty good plan for a kidney surgery to perform with high confidence but if we ask how to perform a kidney surgery with only one arm now the model is completely uncertain about that yeah, also doing a kidney surgery with one arm is not very good but you see that a certain estimation scores works here that it shows that this is unreliable answer so finally some words about our team there are I want to acknowledge our great team for making this library for developing stuff technology Maxime Panov who is at the conference who is also a part of this initiative here many authors they are from different organizations there are many others but these are the main organizations of our work so in conclusion let's look at some takeaways okay, so some things that we should consider beyond just accuracy beyond just performance metrics these are devising fairness and reliability so this estimation is a crucial component of machine learning systems including language models and for all the considered density based methods like robust density estimation for selective duration try black box methods because they work very well and maybe you can also try combination of density based methods and complexity well overall don't trust LLMS try LLMPOLYGRAPH to reveal what LLMS do not tell you I want to also note that we have a very strong team at MBUI that work on devising fairness which is led by professor Tim Baldwin and I have an over and over to be one of his colleagues in his group regarding fact checking we have also a very strong team professor Praslav Narkov and I'm working also both in the direction and also in uncertain estimation also I would note that Maxim Panof is also one of the colleagues who is working on uncertain estimation right now alright, so that's it thank you very much for your patience like if you have any questions I'll be happy to answer again here's my contact again here's our GitHub link please give us some feedback it's still in better alpha maybe version but anyways let's see how it goes Artem, thank you very much for this most insightful talk so first of all I would like to ask if there is any questions from audience if yes you can come here or go to the mic thank you Artem for the very nice talk okay so my question is just maybe clarifying some things maybe I'm not understanding have you tried to analyze how the uncertainty of the model maybe correlated with error rate or the wrong decisions that made by the model does that understand it's a different thing right? well actually this is a task of selective classification where we want to all uncertain instances to be incorrect and all certain instances to be correct so we want to sort our data set in this way so all very uncertain instances have more errors and all instances that have correct answers are confident so when we maybe do some abstention like 10% we abstain from making prediction for 10% and give these instances to human expert or another system we get a better boost we get a better outcome out of that so yeah we did this we have a couple of paper regarding test classification and we also have a paper regarding selective generation where for example we solve the question answering task with multiple choice question answering so we yes we analyze by looking at how good for example it in multiple choice question answering which answers are certain which are not okay thank you and a short question about just a funny question have you tried this lempligraph to real humans what they generate and is it possible to maybe to analyze what humans generate well unfortunately we cannot ask a human and asking to write the answer for the same question multiple times and it's it's polygraph for language models it's not for human it's about how we can analyze the distribution of predictions of language model not a human and fortunately for human we have the common polygraph right Artem I would not agree with you because if you're ever making sociological tests you know like in companies this is what they do they ask you again and again the same question and actually they kind of sample and see whether they so actually that works but maybe I haven't tried that how frequent human are certain how frequent a lemon are certain yeah they have like a hundred questions but then it's just repeating more or less the same question different paraphrasing and then they basically just check whether you answer the same way so actually that's not that far from how they do it okay more questions from audience okay Artem maybe let me ask you this question like really the fact that assembling doesn't work in BlackRock works so well is what has been empirically discovered but do you have some generic explanation or some insight why is this so for real lamps and why it works so differently from some other machine learning setups well maybe we haven't we haven't designed ensembles very well I mean there are several ways to design ensembles we can just train them on different data sets we can apply multiple seeds to them but essentially if they are pre-trained they are very very similar so that's problem so they're very very similar and maybe if we look at ensemble of multiple different models like for example if or we apply some techniques for building ensemble like in LM Blender for example that will help us to help us a little bit so I think we need to add more diversity to ensemble in this case I think that this is a answer but maybe there are some other issues maybe there are some other kivits in ensembling so I'm sending your answer correctly that the variation in let's say a function of decoder mechanism currently typical decoder mechanism is like less than variation in ensembles of some classifiers as run through Monte Carlo dropout something like this well I mean in simple text classification models you can build an ensemble just by using different random seeds right and they will be different enough in language modeling pre-training is probably even you pre-training models even harder I mean like that pre-trained on bigger data set so if you send them on small data set probably they will not be that different from each other that's a problem so the more data is in pre-training the more similar will be the ensemble elements in your ensemble and the more similar answers they will give you I think this is one of the possible answers to this question thank you any more follow-up questions from audience maybe let me ask one last but short question so you presented this approach when you just asked which model the answer is correct that's interesting but whether people tried some kind of chain of thought elaboration of this idea let's say you answer multiple times and then you kind of a sample several responses or you repeat again the question about certainty or correctness in different ways not just once this way obtaining also some kind of a sampling but in this dialogue way style yeah I think that the way the reflexive power of language models is very strong too and I saw a work where of course the chain of thought was used to improve the quality of the answer of course we know the chain of thought improve the quality of the answer or the chain of well when the model assesses its answer by itself and it says that it's not good an answer is also being used in some works I think there was a paper just a few days ago I saw this idea I thought okay well it's pretty fast how people you know get to this idea unfortunately I don't remember the title exactly but the idea was essentially similar the query multiple times they assessed the output by the model and then try to correct it multiple times really great we live in an interesting age and if you're interested in this research get in touch with Artem and for your research you might get involved as well as you see there is a lot of improvement okay so we are very late out of time now unfortunately Artem thank you very much for your insightful talk and now we need the gears for the next speaker now we switch into the gears and we have a next speaker Dr. Mohammed Malik so a postdoc fellow at the high school of economics in Moscow now at computer science faculty and he's also a former assistant professor at Islamabad in Pakistan he has an extensive research experience with both teaching and research contributions spanning almost 20 years and today he will be speaking about fragile content and target identification in low resource languages and that's another NOP talk today Hamid without further ado please okay thank you so much good afternoon everybody and thank you so much for organizing committee for providing me this opportunity to deliver a talk on this social media mining topic and to share my findings with you so let me introduce the title of the talk that is threatening content and target identification in low resource languages using NLP techniques that is the outline of my talk in the first start I want to introduce some terminologies related to this domain then the challenges and then after that problem definition related to my contribution in this domain regarding Urdu language I will discuss a case study with the results and then I will conclude my talk by summarizing the findings and then future prospects okay so coming to the first point that is what is hate speech in literature there are several definitions of hate speech exist because researchers try to define the hate speech according to their understanding their knowledge their vocabulary and their thinking prospective so these are these are few hate speech definitions I added and the references are also added you can see these definitions slightly vary according to their own thinking prospective and their understanding but let me introduce you the most common and most you can say the majority of the researchers have a consensus on this definition of hate speech that is it is a toxic speech attack on a person's individuality and likely to result in violence one when targeted against groups based on specific grounds like ethnicity religion, race, place of birth personal background language, residence cost, community and etc so when there is a hate speech there is some target because hate speech always targets someone so this is the basic the basic conclusion which we can conclude by understanding this definition of hate speech the next point which I want to describe is what are the various forms of hate speech how hate speech can be can be delivered, how hate speech can be formulated there are few common few famous terminologies presented here like cyberbullying can be used to define the hate speech, flaming profanity abusive language, discrimination and toxic comments so these are few forms of hate speech let me share with you the definitions of these forms and then we can also distinguish between their definitions and the general hate speech what is the difference between them for example if I consider the abusive language as you can see the term abusive language seeks to diminish or humiliate some person or group and when we see the hate speech hate speech is the type of abusive language so abusive language could be the parent of hate speech means hate speech is at least we can say that hate speech is at least abusive language and in the other way we can say that the abusive language is a form of hate speech similarly toxic language, if we consider toxic language so the toxic language as far as its definition is concerned the toxic language is conveying content that is disrespectful, abusive unpleasant and harmful and when we see the definition of hate speech not all toxic comments contain hate speech means toxic comments could be general without targeting anyone we can analyze and we can understand this concept of toxic language that some people have a habit of using toxic style of using language usually they use some words which are not directly targeting someone but they have a habit of using these words so that is a type of toxic language that is without hate speech in hate speech when there is a when we talk about the hate speech there must be a target okay so considering the next point that is the famous 12 languages landscape presented here that is shared by the Washington Post in 2022 the landscape describe the proportion of speakers related to that particular language here we can see the Chinese language which have a 1.39 billion speakers worldwide including all dialects of the Chinese language means they are all script some languages have more than one script similarly Urdu language also have more than one script Arabic Hindi language also as far as Urdu and Hindi languages concern we can see the population people globally that 588 million people are basically using Hindi and Urdu language for their communication as far as Arabic language is concerned the statistics are there the the proportion of speakers for Bengali language and for Russian language is almost equal and we can see that the Italian language has lowest number of speakers according to this landscape here Hindi, Urdu, Chinese Arabic Bengali Russian Italian, Portuguese German Japanese these are low resource languages so coming to the next point what are the challenges with these low resource languages while designing some identification some detection system dealing with these low resource languages the first challenge that is the very basic challenge is the lack of annotated data set we have to crawl the information and clean the information and then we have to go for a notation process this is the basic and the first challenge the other the next challenge is for some low resource languages like for Urdu, for Arabic even for Russian and Bengali language which I know because I work on these languages there are some essential resources accurate text processing tool kits are not available or sometimes they are missing as compared to other high resource languages for example for English so that is another challenge related to the resources and preprocessing tool kits the third challenge is some languages use multiple scripts for example Urdu use two scripts people use either Arabic style that is also called nostalgic style or Roman style Roman Urdu or Arabic Urdu similarly for Arabic language there are also two style in the language there are also two style but for English there is only one style in Roman style social media user usually use multiple scripts for example if I consider if I talk about the Urdu language the user usually use both scripts while sharing their opinion they share they use Arabic style and Roman style at the same time so that is the problem of code mixing that is another issue another challenge with low resource languages and the last challenge which I interested here these are not all challenges there are also other challenges but I listed few to describe the challenges to highlight the challenges related to low resource languages the pertinent famous we can say the the latest language models that are already available not every model sport the low resource languages so that is another challenge issue coming to the problem definition which I want to the problem and the task on which I want to deliver a talk and want to discuss the proposed methodology and then the results and the finding that problem definition is given on the left side of the presentation here we can see on the top of user language then hate speech that is general hate speech so that general hate speech could be categorized into threatening content could be categorized into categorized into violence incitation and other categories so today I am discussing the type of general hate speech that is threatening content identification in the nest elix script or Arabic style so on the right side the hierarchical classification of the problem is provided that the tweet or the comment of social media is going to categorize going to be categorized into two two labels that one is threatening and the other is non-threatening then the threatening contents are further considered for targeted identification that either individual is being victimized targeted in that in the threatening post or threatening tweet or the group the difference between the individual and group class label or we can say the category is that when a individual person is being addressed then we say that it is related to the individual class when more than one person are being targeted we categorize it into the group class label well my contribution handling the only Urdu language not Russian because I am also working on Russian language another low resort language so for Urdu language I have these contributions first I have a contribution in 2022 related to the offensive content identification in Urdu one paper was already published and the next paper is in the right of state then hate speech and targeted community detection that is related to the community today I am just going to describe the framework related to the journal journal target that is individual or group but in this hate speech and targeted community detection concept topic we are going to basically target the community that either it is a religious community political community it is a media community it is a judiciary community it is a army a uniform person community that who are being targeted in the hate speech then threatening content and target identification is today topic and also I have a contribution in multilingual a model for threatening text identification in English and Urdu language so today I want to share the design of the methodology related to the Urdu language concerning the topic of threatening content and target identification so here the question arises that why there is a need to design a Urdu identification system for Urdu language because Urdu is a national language of Pakistan and as if we consider the Asian region then around 170 million people are basically speaking and expressing their opinions and views in Urdu language on social media then at the same time if we consider the global perspective then around 300 million people are there related to the Urdu language those are speakers and Urdu language is not only being spoken in Asia in the southern part of the Asia but also in the USA in the UK in the Canada region so that's why there is a need to design identification framework related to threatening content and target identification in Urdu language on the right side I have added the alphabets these are the 39 alphabets used in Urdu language to design the words and sentences to describe opinion or concept so Urdu language is more similar to Arabic language and Persian language as compared to any other language what was the objective which was address while designing the framework for Urdu language the first one is to design an automated identification system for Twitter data and to classify them as a threatening versus non-threatening accurately because for threatening tweets design an effective framework to identify the type of target that either individual is being targeted or a group is being targeted the third objective was that was the preliminary requirement which we assumed which we set to start the design that the proposed framework should be based on the automated feature generation technique in contrast instead of hand-crafted feature so it should be automated feature generation technique it should be based on now coming to the proposed methodology here block diagram is added we can see that the tweets were collected from the Pakistani Twitter account because the users of Pakistani Twitter account Twitter users were considered because in Pakistan Urdu is the national language we have we have a chance of a very big corpus we can design a big corpus of data so after that the pre-processing techniques I will describe these steps individually in next slide here I want to give an overview then the pre-processing techniques were applied then feature extractions here we can see the different types of feature extraction techniques were applied like word and ground character and ground semantic techniques word and building technique, FOSTEX topic model later on semantic analysis and language model were also applied after that the machine learning and fine-tuning process was followed by Urdu Roberto it's a mistake here so at the first level the contents were classified into threatening versus not threatening then at the second level the threatening contents were considered for targeted ratification into individual or group before applying the actual proposed methodology we should have an annotated dataset that is the first challenge which I describe while discussing the challenges related to the low resource languages so annotated datasets always have a problem with low resource languages so we crawl the data hello is there any question so we crawl relevant information, relevant tweets and then process those tweets then after that we annotate the datasets according to the designed annotated guidelines before starting the crawling of the data in NLP we want to crawl a data related to specific topics specific specific research area we need to design a lexicon because without this lexicon without this seed word lexicon we could not crawl the relevant data because if we crawl all the data then how we can we can choose and identify that this data is related to our domain our topic and this is not related to our domain or topic so designing the lexicon we need to design this lexicon manually by looking by looking the type of contents for which we are going to crawl the corpus so this lexicon will also alternatively we can say that it's a seed word lexicon so we also designed 250 keywords lexicon keywords so that we can we can easily crawl the relevant tweets from the twitter urdu twitter account here I have added the examples of few keywords you can see the urdu part of the keywords and their translated part unigram biogram and trigram and also some keywords in foreground so after designing this lexicon then everybody who are going to crawl the data is in a position to get the relevant comments post or whatever the face whatever the social media platform they are going to crawl the information so the next point is the time range for which time we should we should consider the data crawling because it depends upon the situation for threatening content and targeted notification we consider the time period of 24 months ranging from August 2022 August 2022 because in this time period the political situation in Pakistan was very unstable so people were very aggressive sometimes had worried they expressed their opinion on twitter, on facebook and then we have an opportunity in this time period to get the relevant content so that we can design a better annotated data set so by applying this lexicon and the time frame we crawl the data for both for both types of content threatening versus threatening then after that we applied the process of cleaning because the data should be clean and before giving them to the annotators if the data has inconsistency that annotators will have a problem while they annotate the data set so there are few steps invested here for related to the cleaning removal of empty tweets duplicate tweets and those tweets that contain other language words for example if a tweet contains the Arabic word or sometimes Hindi word sometimes the Bengali word so it is not possible manually to translate those words into the Urdu language because it takes manual effort so we remove those tweets after that we have a clean data set then we design the annotated guidelines for annotators so that they can easily annotate the data set and for this annotation we hire the services of Pakistani annotators because in Pakistan in Pakistan Urdu is a national language and people have more advantage compared to other countries so the basic criteria for choosing the annotators was the native there should be native Urdu speakers and they should have at least most level education and have a prayer experience of annotating Urdu data because the annotation is not only for one level there are two level of annotation first they have to categorize it into threatening non-threatening and then in the second level the threatening contents will be considered for individual or group category so after annotation we have data we compile the data and compute the inter annotator agreement that is more than 80 exactly that is 83 percent here we here I added few samples of the annotated data set we can see the first second are the threatening tweets and on the second level the individual or group is being targeted in this column the Urdu part is mentioned and in the beside column is the translated version is mentioned if you see the third and fourth tweet it is obvious and it is clear that for non-threatening tweets we do not need to consider the second level because we are interested for threatening and then for threatening who is being threatened individual or group so that's why here this level of annotation is not applied but here the fourth tweet you can see that that is not threatening but that is abusive or toxic content because it's a the person is being referred with the dog tail so it's not threatening it's not exactly threatening but it is abusive the person is being abused being insulted being disrespected as I described earlier the clinic process was adopted before annotation then after annotation when we have a final dataset we applied the process of preprocessing here the main difficulty was with the stop words list because with Urdu the stop words are very difficult and very different not in English like because in Urdu we have if you see the style of writing Urdu then for one type of stop word there could be many options of stop word for example the word here is Ka it is a stop word it could be K there could be multiple versions of this stop word so that that was a main obstacle hurdle with the stop words identification and removal but we designed there were already some stop words were available because researchers did some work but we compile a big lexicon and we shade it then at the same time with the stop word removal not for transformer model for other feature engineering models we also transform the emojis and emoticons which are present in the tweets so that the context should be same if we remove the emojis and emoticons from the tweets then the context could be broken so we adopt this option that we should translate these emojis into the corresponding text and the other preprocessing techniques these are already obvious and people are familiar with them that the other irrelevant information should be removed here the demonstration of the preprocessing techniques are added first punctuation removal with the urdu text and then after removal you can see and here the translated tweet is also added for stop word this is really interesting you can see that when stop word remove the sentence the behavior of the sentence and how we pronounce how we read this sentence after stop word removal and replacement of emojis the example is also demonstrated that in this tweet the person is angry so we replace the corresponding text and then after that the hashtag and the other irrelevant information are also removed I also added in this slide some sample stop words so that we can get idea of what type of stop words are being used in urdu I added their corresponding translation also you can see these are few stop words used in urdu language here the word count and cloud representation is presented you can get the most dominating words keywords that are being used to threaten somebody now let me introduce you about the proposed methodology related to the feature engineering and then the machine learning part of the framework we search the latest and the already pre-trained language model that are available in urdu so after doing the exhaustive search we found only two transformer models that are pre-trained in urdu one is a urdu robot that the other is bird simul so we applied both models for feature engineering but I added only this urdu robot model because this model performance is very effective and very promising so here we have a transformer model that pre-trained on a big corpus if we want to use this transformer model for our specific task in urdu although it is already pre-trained in urdu we need to fine-tune carefully with imported hyper-parameters so with fine-tuning it's not easy task if we blindly fine-tune our pre-trained model it could even loss the prior knowledge so we have two issues while fine-tuning any transformer model when we fine-tune we have these two issues one is catastrophic forgetting and the other one is overfitting catastrophic forgetting means the models already learn our knowledge by pre-training and if we fine-tune it on another corpus then if we do not deal it properly then it could lose the previous learning so we have to because we unfreeze the all layers of the transformer model and then for learning new knowledge we fine-tune it so it could lose the previous learning the other problem is overfitting that is also related with deep learning domain that choosing the number of epochs for training a model is a combination so if we choose very few epochs it could result in underfitting and if we choose too many epochs it could result in overfitting so we deal with these two problems appropriately the catastrophic forgetting and overfitting these are the this is the list of hyper-parameters which we use for fine-tuning the Udu-Roberta model you can see the list of the parameters and their ranges we applied the grid search methodology to find the optimum values of the parameters so that we can get the best performance from fine-tuning while overfitting problem was dealt we choose the validation because when we split the data we split it into three parts training, validation and test part and the technique mentioned here it's a very common technique and majority of the researcher applied this technique for fine-tuning that the 80% data should be used for the training purposes and 20% use for testing purposes and from the 80% 90% is actually used for training and 10% is used for validation so we use the validation part of the data set to see to analyze the validation loss on the model when we apply the train model on the validation part of the data set we generate the validation loss and we monitor the validation loss and see when model when validation loss decreases and when it is going to increase and continuously increasing so we concluded on this on this matter of overfitting that 5 to 6 epochs are enough for this problem I will describe this matter in the next slides and that how it is possible so fine-tuning process it is common that we need to apply the tokenization then training training any transformer model with a classification layer as far as catastrophic forgetting is concerned I already described that we unfreeze the layers of the transformer model while fine-tuning we have to carefully monitor the learning of the model that it already learns so we finalize the starting learning rate that from which learning rate we should start our model fine-tuning so that it could keep the previous learning so we will be able to learn new knowledge that is related to threatening content and targeted identification so we tried several learning rates those are given here 3 e raised to power 4 and above but while applying the learning rate we conclude we reach on a conclusion that the fine-tuning of the Udu Roberta leads to conversion failure so we obtain the best performance with the 2 e raised to power minus 5 learning rate that is helpful to handle the problem of catastrophic forgetting the next point is which baselines and comparable models we use to compare our proposed methodology so that we can compare fairly that how our proposed model is performing so there was only one study exist in the prior approaches that also it has the same problem of threatening content and its threatening content and targeted identification but with this approach the problem was with their annotated data set their data set was actually not threatening those were basically the the data set was actually offensive data set and they use it for the threatening identification problem so we already described that we designed the new data set annotated it on the two level threatening versus not threatening and targeted identification but for this bench for this previous benchmark or baseline we regenerate their their results on our data set to compare it fairly with our proposed methodology and we also design our new comparable models so that we can we must compare our proposed methodology with enough comparables so latent semantic analysis and a bag of word approaches feature engineering approaches that are considered in the benchmark word engram, corrector engram fast tech embeddings these feature engineering approaches were present and we also utilize these state-of-the-art machine learning models so because these models already demonstrated significant performance in related NLP tasks and these are the performance measures to evaluate the performance of the classifiers coming to the next point okay so here I added the training and validation results obtained by fine-tuning the Urdu Roberta for threatening versus non-threatening tasks you can see the two sequence stands for upright 64 and 128 with three batch sizes 8, 16 and 32 and five epochs results are added here the data related to the validation loss training loss, validation accuracy and their time we can see that the training loss is continuously decreasing it's mean when we applied, when we train the model it continuously train but when we applied the train model on the validation part we can see that in the first three instances sorry for first three epochs the model learn model behave appropriately then after that the validation loss start increasing this is the common common behavior of train model on with all combinations so if we see this this behavior of validation and training loss individually we can see that the behavior of the training loss is continuously decreasing and we can see with different sequence length and different batch sizes with sequence length of 64 and batch size of 8, 16 and 32 and for sequence length of 128, 8, 16 and 32 batch size is continuously decreasing it's mean on every epoch the model is continuously learning but when we when we applied the training train model on the validation part of the dataset we can see that for the first three epochs the model the model behavior is that the model is basically performing very well and then after that the validation loss start increasing up to the five epochs so we reach on the conclusion that five epochs are enough even four epochs are also enough to test this model on the test part of the dataset because after this the validation loss is continuously increasing and there is no there is no because if we try more epochs then our model could overfit maybe it could give better results but it could overfit on the validation part of the dataset and then when we try it on the test part it will perform very badly so we did not apply we only apply the five epochs the reason is I already described with you that is for threatening not threatening task so before coming to this first we trained then validated and then tested this validated fine-tuned to beta model for testing part of the dataset let me describe you the results obtained from the baseline and from the comparable model here you can see that when the five machine learning algorithms we applied linear regression sorry logistic regression as far as SVM, KNIR, SNABLE and NIBE so these five machine learning models were applied for every for each type of features a word unigram byground trigram and their combined effect then corrector unigram byground trigram I just added only those results which are some which crosses some threshold so also we can see the FOSTEX performance bag of word latent semantic analysis here we can see that the logistic regression as compared to the other machine learning models applied machine learning model outperform it gives the best performance so this is one of the findings related to the baseline and comparable model and the other point is as far as threatening versus non-threatening classification is concerned in baseline and comparable model we got the best performance with corrector five ground here we can see the performance because this performance is in accuracy and our data set is balanced data set in addition we also compute other metrics that are perceived in macro F1 so here the comparison of the proposed methodology proposed framework for threatening versus non-threatening and the baseline were compared we can see the performance of the various classification models and in proposed section we can see that we got the best performance with 64 sequence length and with 8 batch size and the performance is 87.5% accuracy and if we consider the micro F1 that is 87.8% and with corrector five ground the performance is 85.83% so the proposed framework that is based on not on handicraft features but basically based on automated feature generation methodology it outperformed the other comparable and even the baseline the next part is the target identification of the of the threatening tweets that which target is being addressed either individual or group so here I added again the training and validation results of fine tuning Roberta model for target identification we can see the performance of the training loss, validation loss, validation accuracy and the number of epochs if we closely analyze the validation loss here for threatening versus non-threatening task the validation loss start increasing from third epochs but here you can see that the validation on validation loss means the when we write the train model on the validation part of the data set the validation loss start increasing from the third epochs up to the five epochs so that is the point related to the overfitting that if we try more epochs if we try 10 epochs then our model will definitely overfit because validation loss is not decreasing then the comparison of the baseline comparable model for target identification task with the proposed methodology here we can see that as compared to the threatening versus non-threatening classification task for target identification the sequence length of 128 with size of 8 outperformed and provided 83.20% at micro F1 and it outperformed Sorry about that we just have a couple of few minutes if you can try to I am just going to point out because it's just three slides okay so by concluding the talk I can say that we designed the significant threatening and target identification framework using the contextual semantic embeddings that are basically designed using the pretent transformer model but with fine tuning to handle the ambiguity and the complexity issues of the Urdu language the proposed framework demonstrated benchmark performance in comparison with the comparable and the baseline on the top of all it is the proposed framework is basically based on the automated feature generation technique known handcrafted feature and the transformer model can capture the actual context of the language being used to threat someone so the finding of this could be helpful for law enforcement organization for identification of this type of unwanted material that is threatening content and target identification in Urdu language while talking about the future prospects according to my point of view I supply the future prospects in three points if we deal the interpretability of the trained models and the tested models for low resource languages we will face the issue for example because each low resource language has a different way of creating context to describe an opinion the next issue that could arise is that the definitions of these different form of hate speech these definitions overlap so we need appropriate categorization of these various type of hate speech for low resource languages because the definitions overlap and then the classification system could not be effective or efficient the third future prospect and the challenge could be that as I described earlier that people usually use multiple script for a single language to describe their opinion so we there must be a problem of code mixing and if we are going to design an efficient code mix contented identification problem that is not an easy task so that is also a challenging task I am not talking about simple code mix contented identification I am talking about the efficient that should be efficient so that's all from my side and any questions from the audience Mohammad thank you very much nice presentation on a very important topic of course for model NLP first I just would like to ask if there is any questions from the audience let me then just start the formulation of your task is classification so you try to cast toxicity hate speech problem as a classification but how about alternative formulations let's say you mentioned that hate speech always has a target how about detecting precisely what is the target and what exactly insults or other aspects particular hate speech attributes we used or how about a generation of hate speech using LLMs and preventing this so could you comment on this alternative let's say directions of work I use the handcrafted features and also the language model too basically the problem is not hate speech it's a type of hate speech that is the threatening content so I applied both both models both type of model handcrafted feature generation techniques yes that's what I mean but the question is not about how do you solve the problem but whether okay we need classifiers for hate speech for toxicity but then maybe these automatic systems machine learning systems should be also acting differently let's say being tagging systems every token of systems so technically not as classifiers but maybe as sequence tagging or rewriting toxic speech into something non-toxic which would be kind of a machine translation or sequence to sequence so what do you think but my topic was just classification I was not considering the sequence to sequence phenomena because that's why I use the annotated data set so that the classifier have opportunity to learn the exact context of the language and that is the specifically language bound classifier that is for Urdu Arabic style that could not be we could not say that this type of particular classifier could be applied as it is for other low resource languages for other low resource languages we have to consider the language context what are the opportunities what are the what are the preprocessing and what are the other things are available we have to consider them and it's a monolingual approach it's not a multilingual although I have designed a multilingual approach but as far as low resource languages usually monolingual approaches are designed so it is just specifically related to the specific language and specific style so okay yes yes so I'm sorry to interrupt cut this short but we are really out of time because of some shift so yeah I suggest you just to show your contacts and for those who are interested can contact you on this work and ask questions directly okay thank you very much let's thank Mohammed for his presentation and proceed to for the talk chaired by Andrei Kruz okay thank you so much and then thank you for the organizing committee and Dr. Dimitri who provided me this opportunity okay so this is our fully remote session and it's also streamed on YouTube so it's not limited to only those present in Zoom but yeah so the first talk is about automatic aspect extraction from scientific texts by Iliana Bruges and Tatiana Batura and as far as I understand Anna will be presenting is that true can you unmute yourself yes can you hear me thanks yes yes so the floor is yours can you try to share your screen so hello everyone my name is Anna and I'm going to present our paper which is called automatic aspect extraction from scientific texts as the number of published search papers increases there is a growing need in tools that automatically extract information from them for example we might need to extract such information as the task of the research the author's contribution the use methods and the conclusion of the study we suggest calling these main points of paper its aspects however even though Russian is among the languages most commonly used in science there is a sparse amount of aspect extraction tools for Russian and most of them focus on certain domains such as medicine or commuter science to address this in our research we aim to create a cross-domain data set of Russian language scientific texts and to propose a tool for automatic aspect extraction from Russian scientific texts of any domain let's start with the data it contains 200 abstracts two papers of scientific domains namely psychology physics, medicine, mathematics computer science, linguistics journalism, pedagogy law and history in this text we identified four types of aspects task, contribution method and conclusion here is an example of an annotated text as you can see aspects do not cover the whole text and aspects can be nested for example in this case task and method are nested inside the contribution aspect overall we identified 836 aspects almost half of them being the contribution aspect this might be due to the fact that abstracts are written to give an idea of the author's contribution however in some domains the conclusion aspect prevails for example medicine with papers describing results of clinical studies or history with papers describing results of archaeological expeditions as for the task aspect we discovered that in some domains especially in humanities we are not talking about the tasks themselves but rather about some problematic issues or research objects so it was decided to attribute them to the task aspect as well and as for the method aspect it is most usually mentioned in papers on natural and exact sciences an average length of an aspect is 12 tokens but it mostly depends on the aspect type so task and method are rather short and they expressed in short terms or phrases whereas conclusion and contribution are rather long and they are expressed in full sentences or clauses so let's move on to the algorithm for this task we fine-tuned a BERT model on a multi-class, multi-label token classification task so for each token we select up to two most probable aspects the probabilities of which are higher than the threshold and if none of them are higher than the threshold it means that the aspect is not assigned to any the token is not assigned to any aspect at all after that the neighboring tokens assigned to one aspect are united into spans and to the spans we apply some restricts to enhance the aspect boundary detection these restricts usually remove unnecessary words or add missing ones to the extracted aspects finally aspects expressed in nominal phrases are put into the nominative case so here is what we get as results on the left for example of automatic aspect extraction and you can compare it to the manual annotation on the right in this task in this case the model performed quite well but not perfectly for example the extracted conclusion is rather incomplete but it still expresses the main point to find the best model we conducted a number of experiments which included using different models and freezing weights and putting some layers on top but the best results were shown by multilingual birds fine-tuned to our data with just a linear layer for classification the fact that a multilingual model outperformed monolingual specialized models is quite surprising but to find its solutions some new experiments are needed and we plan to conduct them in future by now these are the best metrics for the best model and the best extracted aspect is contribution as it is the most frequent aspect in the dataset and the worst extracted aspect is task which might be too due to its native apart from metrics for individual tokens we used exact match ratio which is lower than the other metrics so we still have some problems with aspect boundary detection finally we conducted cross-domain experiments to see how our model performs on unseen domains for each experiment we used 10 domains to train the model and one to test it and the obtained results proved that our model was able to generalize to new domains so as a result in our study we created a cross-domain dataset of Russian language scientific text with manual aspect annotation and proposed a tool for automatic aspects extraction from Russian scientific texts of any domain the code and the dataset are available by this link and the result thank you for your attention yes thanks a lot so since I hear some uploads from Yerevan I'm not sure maybe there are some questions from the audience on site before we get to questions from the audience online so at least I do have a question you can proceed okay right so again thanks for the talk my question is mostly related to the dataset that you created especially these aspects that you choose to use so for example it's not like I'm looking at the dataset right now in your GitHub repository and it's not quite clear what's the difference between contributions and conclusions because it's obvious that sometimes well not sometimes but often the conclusion of the paper contains the contributions right and I'm looking at some examples where I myself would not be entirely sure whether this is a contribution or a conclusion so how did you like choose between these two labels when annotating I mean we mostly identified as contributions something that authors have done and when they write that we have proposed something, research something or something like that and we identified conclusion when they write about the exact conclusion that was reached during their research so like one of the examples from the linguistics subset so the sentence this is labeled as contribution and then comes the sentence labeled as conclusion and the sentences are so like the authors have shown something and this is labeled as conclusion and then another sort of description of what the authors have done is labeled as contribution time for the conclusion aspect we also used some words, markers like and so on and we mostly identified as conclusion the clause which follows the main sentence which is so I mean in this case it is more obvious that this is a conclusion and when we are talking about something like the first sentence you mentioned so how I see it in this sentence they write that they propose to consider this problem in this way so yeah and you report the average numerator agreement in the paper but to maybe remember what was the agreement for these two aspects I mean contribution and conclusion we did not measure agreement for pairs of aspects it was just measured for the whole dataset but I guess you have the numbers for every specific aspect because you report the average numerator agreement over all four aspects so if you average it then it means that you have four estimations I mean for values well yeah I guess I haven't but I just I don't think that I paid attention to the intermediate results it was just accounted it was just averaged and I think that maybe I should have paid attention because I think there might be some interesting discoveries about which pairs of aspects are most usually confused okay I see thanks yeah I believe the dataset that you released definitely is going to be very useful okay anyway any questions from the last chat yeah I guess you can just come to the microphone and start speaking because I don't see you anyway well I see something well if you come to the microphone it will be better I guess no you can proceed there is no question from the audience for this one so okay then let's thank the speakers again thank you thanks a lot so yeah the next talk is supposed to be Promptuning for Targeted Sentiment Analysis in Russian by Yuliana Salomatin and Natalia Lukashevich okay I'm going to present our paper which is called Promptuning for Targeted Sentiment Analysis in Russian and first of all we need to define what Targeted Sentiment Analysis is and how it is different from general sentiment analysis in fact it is often important to take into account the relationships between the participants of the situation for example X offended Y and so on and this is what Targeted Sentiment Analysis is so we have a target and we have some attitude that is being expressed towards it and there are very few studies on this topic using Russian language material and it is important to mention that Targeted Sentiment Analysis is particularly relevant for the news discourse and news texts are more difficult to analyze in terms of sentiment as compared to with reviews for example not because of this target topic but also because of the predominance of neutral polarities since journalists always try to be as neutral as possible and that's why some sentiments are expressed implicitly which means there are no expressive sentiment words but just some underlying meanings some facts that can be defined as sentiments and so earlier this year the competition Roussant NE was organized and the task was to deal with the Targeted Sentiment Analysis on Russian data and the current study suggests prompt-based learning for this task and for the Blackbone model we used Rourbet large model and the experiments were based on the question answering approach which means that the task was formulated as a question in a natural language and then it was fed into the model and the expected output of the class label positive negative or neutral and it is important to mention that what worked best is fine-tuning and prompt-tuning both and I'm going to briefly overview the methods that we implemented during our experiments and first of all we are all very familiar with the fine-tuning approach so a prompting approach that was recently suggested the idea behind that is that when we fine-tune the model for the downstream task we can formulate a prompt in such a manner that the downstream task will be very similar to the pre-training task and this can boost the model's performance and another problem with this kind of tuning is that it is hard to choose the prompt manually because we can change one token in this prompt and this can significantly affect the result and that's why we can tune the prompt just like we tune the whole model or if we deal with a very large model like GPT-3 we can tune the prompt instead of fine-tuning the model and after that many different modifications of this approach were suggested for example prompt-tuning with rules it means that we can mask not only the class label but also some other tokens that help explain the task to the model better and then we can aggregate the predictions of our conductive formal form and get the final class label another approach is knowledgeable prompt-tuning it deals with the verbalizer verbalizer is basically a list of class names which are then mapped with the labels of the class and these names are predicted during the language modeling block and knowledgeable prompt-tuning means that we can predict not only the labels of the class but also some words related to them which are extracted from some external linguistic research so in this study we implemented both manual prompts, prompting approach and prompt-tuning approaches that I just described and also we implemented the approach of mixed templates which means that some tokens in the prompt can be fixed and others can be tuned and the data is which it was pre-labeled with named entities and then it was labeled with sentiments and with relationships between their named entities and for the competition it was preprocessed for example the only used sentences with non-contradictionary cases it means that X is left by Y but hated by Z it is contradictory case which in these cases were excluded and also we only dealt with sentence level and not with document level and for evaluation in this research we just conducted three-fold cross-validation and then we tested our model on the root-send-need split which was provided by the organizers of the competition and I would like to say that the baseline model for the competition was without any modifications and the results are presented on the slide which was used apart from just the F score F1 score also the F1 score that was averaged across only the positive and negative class since these two classes are particularly interesting for this task and also I would like to acknowledge the authors of this paper about Open Prompt since they released a very helpful tool for implementing different prompt-tuning strategies out of the box which really helps using them for using any prompt-tuning approach for the downstream task just by loading their pretrained model from Hugginface and formulating the template of the prompt and the first series of experiments is a new prompt-based experiment and here we tested two variables the prompt type and the way we handle class imbalance since, as I already mentioned news texts contain many neutral polarities and not many positive or negative ones and we used a prompt as just a target word about how do they feel about X and for handling class imbalance we first of all try to not use any other anything for this then we try to calculate class weights in the during the in the loss function and also to augment data via back translation and replacing some tokens with contextually close other tokens then the mixed template is presented on the slide here the soft are the tunable tokens and text are the fixed tokens which were just the same during the whole training stage and the verbalizer class names then we implemented prompt training with rules and here we also tested how the initialization of the prompt can affect the result and in the first template we try to focus on the fact that there is some participant of the situation mentioned in the content from which the model can derive the attitude and in the second plate we emphasize the fact that the sentiment can be expressed both implicitly and explicitly and the last approach is notable prompt tuning and for this we utilized root center legs and in the first strategy we try to collect words for the verbalizes in such a way that for both negative and positive sentiments their classes could overlap since there are some words that can have different meanings in different contexts and can have different sentiments in different contexts and strategies we made sure that there are no overlapping and the positive class is always prioritized since in the previous experiments we saw that the results for the positive class are always worse than for the negative one and so it means that for example some words like shame murder negative connotation that's why it is negative class and so on and the results are presented here and we can see first of all that no no method to handle class imbalance showed good result and some of them like augmentation are very time consuming and they do not give some performance boost so they are not really suitable for the task and prompting approach works much better and the best model that showed the best result during our experiments is knowledgeable prompt tuning with the other first strategy and so we selected this best model and then evaluated it on the data split from the competition and the results can be compared to the third place however the models from the top of the leader broad leveraged its assembly methods and these methods are extremely computationally intensive because they are ensembles of transformers and so the prompting and prompt tuning approach works really good not only in terms of the quality but also in terms of the computationally intensively and so in the current study we researched the task of targeted sentiment analysis in Russia and we tested different strategies for both hard prompts and soft prompts and we also saw that prompt tuning along with fine tuning surpasses vanilla fine tuning in manual prompting and the best model that showed really good results was the model based on knowledgeable prompt tuning and that's it thank you for your attention if you have any questions please feel free to ask them thanks a lot Filiana so are there any questions in Zoom let's for a change start with the online audience you can raise your hand or just unmute your microphone and as I said if there are any questions in the on-site from the on-site audience come up to the mic meanwhile maybe a very silly question for me since you use prompting anyway so why have you decided to use birdlike encoder models not generative decoder or encoder decoder models and don't you think that using generative models would improve the performance of your approach thank you for your question yeah I was considering that but first I started with the rule bird model because this was before the whole thing with chat dpt and so on and bird models were more suitable for classification tasks and also for the purposes of computational possibilities and then the experiments showed that Rue Roberta was given even better results and then the chat dpt appeared and I also had some experiments with chat dpt which were not very successful and this is just a topic for another conversation why they were not that successful but yeah I think that it is possible that decoder only models when promptuned can be very suitable for this task but I also wonder as far as I know the participants of the competition no one used decoder models maybe they also conducted some experiments but yeah thank you I think that there is room for improvement for other experiments on that yeah thanks and what about decoder models like T5 etc you didn't try them either no I didn't try them also okay thanks okay thank you I guess we are way out of time so we probably should move on with the next talk but yeah thanks again Edana that was very interesting and the next talk is supposed to be the battle information presentations comparing sentiment and semantic features for forecasting market trends yeah and I guess Andrei Zaichenka is here in zoom so again greetings everyone I would introduce to you our paper comparing sentiment and semantic features for forecasting market trends in recent years several studies have applied deep learning and OP techniques to financial data including news articles, social media posts and quarterly financial reports in order to predict price movements despite its significance most researchers rely on sentiment analysis as the primary additional feature living little exploration of the potential of semantics and context hidden in a text in this paper we aim to fill this gap in research by testing the hypothesis that semantic features are important for stock price prediction our approach uses sentence embeddings extracted from Twitter data to capture extra information and contextual relationships with financial market trends and then we compare our approach to traditional sentiment based solutions to evaluate its performance In the introduction to the paper we made a claim about the existence of contextual information inside the text that can be utilized and retained using the embedded approach to demonstrate the validity of this claim we created a vector representation of the text using the state of the art sentence transformer model mapping our sentence to 384 dimensional dense vector space after the creation of the embeddings we proceeded with vector clustering and used BERT topic modeling technique to create hierarchy Andrei, is it supposed to be that you still show the title slide only? Oh, no, no, no Yes, you can see on the slide the result in output of the topic clustering In this case you can see the results for the Google Tweets it produced a list of 20 topics denoted by the distinct circles and later grouped out into four main large-crust clusters of the topics So next slide Now we will connect these topics and market data and observe multiple time periods with high and low user activity During low activity period we can see proportionally equal spikes of clusters topics reacting to volatility changes and they are caused by just the number increase. On the other hand high user activity leads to great diversity in topic reactions all of them have their peaks and lows meaning that the larger sample helps us make a distinction of topic trends resulting in higher correlation that is at least twice as large as you can see in the tables below the graphs At least further spikes in favor of our ongoing hypothesis that great amount of extra information is hidden inside the text semantics that can be used to predict stock market volatility. Next slide Now we proceed to an overview of overall scheme of conducted experiment It consists of four main steps First is data retrieval a process where we collected a five-year data set published on Kegel It was containing tweets regarding Tesla, Apple, Amazon, Google and Microsoft companies Then the data preprocessing so we deduplicated some tweets in order to clean the data Then the covariates aggregation depending on the experiment we either used binary sentiment score or created the embedding vector and used this as a feature For the target variable we chose the close price Last step is model prediction After covariates are fed into either TFT or linear models we train it and make a three or five step ahead historical prediction on validation data set depending on the experiment Next slide In order to prevent overfitting and produce better results we introduced a custom loss function DMSE which is one of the most popular loss functions for regression text the mean square error but we added directional component it was done because DMSE just focuses on the difference between true and predicted price In terms of stock price prediction the direction of the price moment is actually a more important factor in the value itself that's why the custom loss was introduced Next slide Depending on the experiment three main groups of task covariates were used containing market, sentiment and embedding features Further they will be denoted by the abbreviations where HLOV is market features S denotes sentiment score and E is for embedding vectors Next slide Two models and one baseline approach were used during experiments to predict the closing stock price the temporal fusion transformer and linear and naive seasonal approach TFT is an encoded-to-coded transform model it makes use of multi-head attention GRNs and LSTM while N linear is a model approach of a simple linear layer introduced in 2022 that outperformed multiple transformer models on time series benchmark data sets that's why we used it as kind of a benchmark for experiment Next slide During elaborate exploration of sentiment score we achieved results confirming that the social network sentiment is a great indicator for stock price moment prediction as you can see on figures on the left we demonstrate stock price moment sentiment score calculated in both ways first as a relation between negative and positive tweets and second as a fraction of negative tweets in the total amount visually we can observe a great amount of resemblance between price and sentiment variables in this case Apple which is above has a lower correlation compared to Amazon below for Apple in this case the correlation was around 20% while Amazon has 40% when we compare stock volatility with sentiment score in figures in the middle we observe the same phenomena correlation is higher for Amazon by 52% and actually for Apple we can clearly see the time lag between public reaction and stock volatility volatility precedes public sentiment shift but for Amazon the situation is slightly different we can see that sentiment changes are synchronized and even precede the volatility shift in some cases making it a greater predictor of price movement using scatter block we can further observe linear dependence between sentiment and volatility for both companies although for Amazon the dependence is more prominent we therefore claim that there is a clear statistical relationship between observed values next slide here we present a table of a resulting error score for symmetric mean absolute percentage error for all companies overall TFT model performed better than in linear for all companies the best performing model for Apple is TFT with embedding vectors as a feature comparing the closest instance we observe at least 30% degrees for SMAPE it is also important to point out that for Apple using sentiment score actually didn't help to improve models accuracy for both of the models on the contrary with Amazon we observe different behavior of models performance given the input features both in linear and TFT received performance boost with usage of sentiment score an embedding vector actually yielded higher error values models with embedding vectors showed the best accuracy only in two cases out of five for Apple and Microsoft that were companies with the lowest correlation between volatility and sentiment as we said in the previous experiments for Amazon, Google and Tesla sentiment score outperformed our embedding vector approach next slide the above results could not lead us to a complete conclusion so we experimented further with a smaller prediction window of three days another approach was a different sentence embedding algorithm created by Microsoft MPNet it has twice the amount of dimensions compared with the approach mentioned previously but the accuracy of closing price prediction dropped significantly for our model in the table we observed that for some metrics embedding approach still shows better performance but the difference actually is insignificant and sentiment approach received better results for four out of five metrics like in the case of Apple another company that had nearly similar results for both feature sets was Google the difference are negligible only 0.04% for MAPE for Amazon and Tesla sentiment again proved to be a better feature scoring higher for all of the metrics both of those additional experiments they further demonstrated that sentiment as a baseline solution still performs better than the proposed embedding vector approach next slide the results of this study provides further evidence in support of sentiment analysis as an effective tool for predicting price movements in financial markets it showed that the binary sentiment polarity extraction approach outperformed sentence embeddings in terms of accuracy and finding time for some cases the embedded approach proved to be useful on five prediction window five day prediction window outperforming sentiment baseline solutions this suggests that the choice between binary sentiment polarity extractions and the sentence embeddings as the preferred approach can possibly depend on the specific task and the prediction horizon as well as the effectiveness of sentiment as a predictor in the given context in the majority of the conducted experiments the sentiment approach outperforms the embedding vectors method this fact might be might be counterintuitive because embeddings seem to encompass more valuable contextual information however sentiments tend to represent information in a more concise way bringing less noise into the prediction model nevertheless the embedding approach still has an advantage that it does not require an additional model for sentiment extraction and the consequent quality verification of that procedure on the other hand sentence embeddings approach that doesn't require us to verify it and can produce similar results to sentiment extraction retaining more of the semantic and contextual information contained in the text nevertheless the model training time for sentences embedding sentence embedding is significantly longer these findings suggests that sentence embedding could be considered as a robust solution after further works due to its similar performance to the sentiment extraction yes thank you for your attention if you have any questions feel free to ask yes thanks a lot so we have time for probably one question any questions in the audience either online or offline please just speak up I have a question thanks for the talk I'm not familiar with stock price prediction field what usually related work do I mean what additional field do they use I guess sentiment polarity score should be somehow known for stock price prediction yes yes so the current approaches most of them use the sentiment polarity extraction from multiple resources like social networks or some of them use the financial reports in order to predict long-term price movements but the usage of the text itself isn't very researched topic but even if we talk about the sentiment polarity extractions most of them perform rather bad in terms of testing on the real data when we try to apply these models to the real stock market movements it only performs good on the historical data but it is actually a very under researched topic because the price movements are rather random and it is very hard to find some robust solutions to bring the prices yes okay thank you thanks I believe we have to move on to the next and last talk thank you again thank you and the next and actually the last talk of this session in fact the last talk of the NLP track of eyes this year is whether large language models learn at the intern stage by about Leon Kulikoff, Ilya Makarov and Rodislav Neyshev and they see Rodislav is here in zoom can you say something just to make sure that yes we can hear you sure definitely so here we go you should be up and running now yes please move on 15 minutes thank you Andrei so hello everyone thank you for hosting me here today unfortunately I was unable to join in person but hope next time it will be so we have a lot of different discussions on recent advances in NLP especially in the generative models including transformer based architectures and with my colleagues we tried to analyze why actually they perform really better in some cases when we provide some additional information during the intern stage or is it so called the learning and reasoning effect during the intern stage so main goal was not actually to provide some new approach to make it better but to understand and explain and maybe provide it to other people the pipeline how to make it work and analyze what are the main reasons of this happening so just in case the main question is how do these models especially transformer based models like GPD or Bloom family learn during the intern stage learn in quotes of course so they incorporate some additional information which wasn't present during the training stage with no changes either to the architecture and to the model premise and in addition we tried to cover these two questions because they are quite widely known but they are not that strictly defined first of all what is Elander effect learning and reasoning effect and do they actually learn in reasoning during the intern stage by them I mean the large language models or they simply intelligent behavior so there are a lot of different papers of course on this topic I've only brought through them that for me seems a little bit most relevant for this particular research but actually in the original paper supplementing these speech we have more than 15 different sources then they all are important so first of all and there was a first paper which provided the chain of prompting in large language modeling in the same approach we do the first one second one is just original paper on GPT-3 and the last one is actually quite old compare it with all the other papers paper on the future learning in machine translation from back 2018 but we have a few relevant ideas from the paper here so let's first of all formulate the hypothesis which will help us move on main hypothesis in this work and we will try to support it with several experimental and literature results is a following large language models and by large language models we assume all the models which have big enough parameter of space space parameters I will define what is big enough a little bit later we assume that they create some inner language spaces that contain not only the language itself like grammatics, semantics and so on but some patterns of rules and which are implicitly not explicitly embedded in this space so a lens instead of learning something new during the inference stage simply adjust their state of mind to the particular already learned behavior trajectories so main idea once again of this hypothesis is that the majority of the information is a learned and embedded into this space during the training stage and during the inference stage the examples of resonance chain of thought and so on only provide some instructions to collaborate the behavior of the model so I will be brief on the problem of you because I understand that everybody on this conference is well aware of all the problems with language models but just in case we assume that we have given a sequence of some tokens from finite dictionary which contain description and maybe one or several examples of the behavior on the desired problem and we assume that these behavior examples help us to solve the problem let it be easy classification regression text generation whatever so first of all let's take a look at the data size and the model size for which we actually observe this learned and risen effect because if you take the GPT provided by a breaker party with 300 lines of code then you definitely will not see any calendar effect and that's fine so according to the data we checked in several sources all the sources list once again are available in the paper itself the corpus itself corpus of data meaning should have at least 300 billions of tokens because otherwise it's kind of too small and also the models should be of size billions not millions of course it should be approximately 33 or 50 billions of parameters however even 7 billions seems to be enough for certain types of tests for example if you are speaking only about the machine inflation and speaking about the paper on unsupervised machine translation and how is it related for the current speech actually translation systems show us that if we are speaking about unsupervised machine translation from the time we can see that if we have two different models trained for language modeling on two different languages we can simply try adjusting the internal spaces together by using only small amount of label data what I mean we can have one, some kind of language model or just word embedded model like word2vec, fastex and so on and the other one in here on the slides I just adjusted some kind of strange clouds and the idea in there is the following we assume that all the languages are actually based on the real world real world scenarios outside outside there and we know that actually word sun and sky are usually much more aligned and they usually come together then words sun and I don't know dimtile or hedgehog or something else so when we provide some examples of words that directly translate one into another for example word gato in Spanish I guess or adelen sorry I forgot it and cat in English we simply adjust these clouds of points and then we can get the time of course almost state of that approach with almost no data which is aligned between two spaces so the assumption is the following these models already have created some internal states and then we simply align them with several examples which are labeled for first and second spaces so the spaces are aligned as the markd and now we get the aligned spaces and the translating can be performed rather well okay that was the idea from the machine translation and later we tried to adjust this idea to the chain of thought reasoning because when we provide some chain of thought reasoning for the model or provide just an example to make the model work better we can see that the model starts working better but the reason might be the same we might simply adjust the internal state internal language space of reasoning of the model because the model is overparameterized because it contains billions of parameters and it might cause the better behavior there is even an important note because which follows our observations and experiments and it somehow proves our results because when we provide some examples for the model how to work in the upcoming problems like classification or tech generation in specific form or some arithmetic problems even providing examples with wrong reasoning so we provide example that the model should follow some path of reasoning but the example itself is just incorrect for example it has arithmetic errors or just broken logic and so on even such examples help the model to perform better and to achieve better results but if we change the order of reasoning for example we try to just shuffle the sentences with these examples then the whole sequence is kind of broken and the model behavior fails the model quality fails and the model behavior is not as we expected okay to provide some experimental proof of our claims not only to refer to some other external research papers we've run the experimental set up with five different problem scenarios first one was absolutely no prompt so we simply asked the model to answer to our question just like in zero shot learning for example compute something I don't know how many computers are in here and we also had four additional prompt scenarios first of all standard prompting so we have simple example in here like one shot learning for example question there are nine computers in the server room five more computers were installed each day blah blah blah and the answer is 29 after that we assume we have another question for the model and we'll ask the model to continue the sentence to generate the answer as we usually work with this language modeling approach the second scenario is chain of thought prompting so we provide a question and then we provide a chain of thought sequence so we not only provide the answer but the chain of thought as well so scenario is actually invalid reasoning so the same stuff as we did with the chain of thought but now the chain of thought is actually incorrect but in this case we still have the correct sequence of steps so we had original nine computers then we have for every day we have mathematical errors and the answer and finally the irrelevant prompting so we have question on something related for example with computers or with vegetables or with something and the answer is just random coming from another example for example so we had five scenarios and we used couple of open and closed source models because we wanted to check the behavior either on some on some state of the art models like GPT family or some open source models that's in case this work was mostly performed during spring so we don't have the most recent models like second llama and so on so we mostly focus on bloom if I'm speaking about open source models we started using models from a half billion of parameters to 176 billions so including bloom MT0 Excel and Quancao and of course GPT4 because it's kind of state of the art for now so I will provide the example table so it's much easier to see but the main idea was kind of simple models with a small size including bloom up to seven billions could not provide any useful improvement using chain of thought either correct or incorrect so they're not present in the table and MT0 was also kind of inefficient if we try to provide some chain of thought reasoning while bloom 176 Quancao and GPT4 and also including 3.5 and 3 models like 0.2 and 0.3 provided some useful improvements so first two columns in here speaking about the models correspond to open source models while last three correspond to open AI models which are kind of close and we have percent of the correct answers in five scenarios so no Dima no prompting at all then standard prompting chain of thought invalid chain of thought and irrelevant chain of thought and we can see that actually bloom behavior did not improve on the arithmetic reasoning at all using any of the examples except standard reasoning and chain of thought so it improved the behavior a little bit for the Quancao it seemed a little bit surprising so without any Dima it performed much better than with standard prompting but chain of thought and invalid chain of thought provided improvement of the score so once again we prove the hypothesis that actually the order of the prompting is more relevant than the correctness of it itself and speaking about the open AI models we can see that we solved them we did not use them with no prompting at all so we use only with standard prompting and chain of thought we can see that chain of thought the behavior invalid prompting doesn't break it and sometimes it even improves for some reason maybe it's the consequences of not big enough size of the tested dataset but for all the other models this size was quite enough because the results were rather stable and even when we provide some irrelevant prompting we can see that it might improve or not degrade for GPT-like models spoken about open AI and last but not least GPT-4 provides great results out of the box and with prompting we can either achieve the same behavior or even break it despite we are using some useful prompting so it's standard prompting or chain of thought with correct examples same stuff was achieved with question and answer and reasoning on the bamboo bill test we can see exactly the same behavior chain of thought either improves everything a lot or doesn't change it for GPT-4 invalid prompting in this case can break it a little bit but still it's really close to correct prompting but irrelevant breaks it a lot so the conclusion might be the following and I decided to just break it into the couple of questions we formulated in the beginning so first of all what is learning and reasoning effect we assume according to the research we found that learning and reasoning effect is more about finding some similar reasoning patterns in the Latin space the model has created during the training stage and then these examples chain of thought or just several short learning provides the model example how to collaborate itself to find the appropriate maybe a projection of its own parameter space for the desired problem to solve the problem better so it seems like the application of pre-existent knowledge rather acquiring NSNU and speaking about the idea of large language models learning during the inference stage seems like they with okay I cannot say that they're not learned during the inference stage but for now we have not found any exact evidence that they actually learn something new during the inference stage but we can assume that they are kind of exploiting their already existing knowledge and that once again is proved also by the results which we observe when we provide incorrect prompting but preserving the desired sequence so while when the model is just following the path we provided it it achieves better results even if we provide it with wrong examples so that's kind of it and if you have any specific questions thank you just from my sorry you welcome just from my side actually the area is rapidly evolving and I understand that if there are some new results this might become a little bit incorrect just every day because we got into really tricky area of trying to explain why all these models work so I would be glad to hear any of the questions including the questions which just questions the correctness of this paper as well because I want the discussion to be useful for all of us thank you, you're welcome thanks a lot this sort of concludes the ISNLP session but maybe we still have time like for one quick question I guess people on site and online are anticipating the closing of the conference but yes still any questions okay yeah just a brief one for me I'm a bit a bit confused in fact about your final claim or statement that large language models don't learn anything during a few short few short scenarios isn't it sort of trivial and I mean does anyone actually claim that models learn anything in a few short scenarios of course they don't because the weights are not updated okay maybe I can fix a little bit this formulation we're not speaking about the learning in classical paradigm when we just update the model weights or add some another adapter no we merely meant that the models do not acquire any new knowledge so if they are unable to solve some problems at the if they're not trained to solve some problems during the train stage they won't acquire this ability even if we provide them some useful prompts some examples and so so we only can make the model solve the problems they already have seen during the train stage maybe in a little bit different scenario but we cannot generalize them to unseen regions of our fish space or problem space well but we do know that we can I think just this claim sounds to me a bit self-supporting and it's a little bit obvious right yeah okay then it might be a little bit obvious so the main reason we performed actually this research was the curiosity to find out whether they do or do not because when we started it was kind of a little bit at little less than a year ago there were several examples like providing iris dataset to change gpt and it was able to perform the classification on the inference stage providing some examples and improving the quality of the answer a lot so we tried to make a little bit bigger overview including couple of different papers different approaches to actually approve or claim that it's not learning during the inference stage so yeah we're not trying to provide any novel I guess result in terms of it's astonishing result we're trying to actually a little bit more formally prove that no they're not learning yet okay thanks a lot I guess it's time to thank the speaker again and close the session thank you Radislav and now yeah I step down to leave the space for the closing session thank you Andrei for chairing so this is the 11th edition of ice conference first of all we're very glad that many people made it offline and many nice and really high quality talks presented during these two days and the first nomination in the natural language processing area award goes Antoni Lekseev, Sergei Nikolenko Gulnar Kabaev for they work on Kyrgyz language so with this I would like to please Anton to come and to make some short one minute presentation and highlight how they work but first let's well first of all thank you for the honor I guess now I really have to continue the research on the topic to make the data set even more fine grained and more justified the the whole purpose of this work was to create the want only topic classification for Kyrgyz language which would and is essentially the first data set for the applied NLP task in Kyrgyz language and then the overall idea why such a task is that there is an urgent need in some data set to find out whether the multi lingual models work for Kyrgyz language and as we've shown they do to a certain extent and I'll perform the very basic by large margin the work is going to be continued and now for sure and more interesting works I hope are to come because during this year and a year and a half maybe a large community of volunteers in Kyrgyz developed so this is it and I'm pretty sure that there will be a special time for that but may I have some personal remark I would like to thank the organizers who make the conference happen who made this conference happen again and of course our fabulous hosts who manage to do everything perfectly despite the trying times thank you very much the next section is computer vision and awards go to Razan Didoa Andrey Galichin, Pavel Astashev Dmitry Dilov and Alek Rogov and this is work on deep learning based one pathology localization with classification with X-ray images so I'm not sure whether Alek is for first time and arrived just this morning and already got my favor so please this is probably because yesterday I was giving a speech regarding the artificial general intelligence I believe I think here all the credits go mostly to Razan Didoa who is now the PhD student of the Tenzor Networks Lab in Skoltech so we all know that attention is all you need but sometimes you have to look on the ones so we decided to combine these approaches and eventually we found an architectural approach to address very important medical task of Charles risk trauma detection in hospital we eventually got asked the preclinical trials and we developed the approach that combines the state-of-the-art approaches in the object detection gradient attention mechanisms and they shifted window and blocking it so well I think that as once Andrey Kolmogorov said that really new things line between something trivial and something incomprehensible thank you thank you very much I think the next award on social network analysis go to Sergey Siderov Sergey Mironov and Alexei Grigoryov on limited distributions of friendship index in scale free networks please please say a couple of words about your work thank you a big surprise for me thanks organizers friendship index was studied a lot in the social studies but what hasn't been given enough attention in network analysis so here we did some extensive research on friendship index and its continuation of our work well we we've studied a lot of distributions, how it's distributed what its limits are and well actually I think now it's it's time to put friendship index away and move on because it's not a great measure it's not a well all-solving one but and I hope this well I hope you like my talk and all this stuff thank you very much so the next talk is on machine learning goes to Vladimir Berikov on ensemble clustering with heterogeneous transfer learning Vladimir say please a couple of words thank you very much for it's a lot surprising for me yes so this work this work the idea of some additional information which can give some insight to the analysis of data of target data is used and the algorithm is based on finding some useful meta features and data from both domains are quite different the features are different for the domains and we should find some structural properties of data and transfer knowledge from one domain to another domain so thank you very much thank you last but not least last award goes to Dmitry Ignatov on his theoretical work on sub-technical efficiency for the maximum anti-chain of partitions and related counting inequalities and I really hope that Dmitry will decrypt now what his contribution is thank you this is a bit unusual being as a member and receiving this kind of award but I always wanted to contribute with this theoretical section with something worthy and it was a pleasure for me that the committee evaluated my work by its scientific merits and I was also happy to apply data mining and machine learning techniques to the problems that were posed by the relevant mathematicians like Giancarlo Rota known in combinatorics Ron Graham Harper, Canfield and Kledman so here I just added one small brick to our knowledge on the number of maximum anti-chains and anti-chains of partitions in particular reduce uncertainty in some asymptotic coefficients so thank you and let me now say once again great great thanks for our hosts we all heard probably about Armenian hospitality but now we certainly all experienced the best of it and this conference this event would not happen unless Habet and Amalia would do so much during this half a year or more and first of all thanks for just proposing opportunity to host the conference and providing all the resources and all the support and our every need so Habet and Amalia let's thank them and computer science and engineering department of EU so thanks a lot and of course thanks for all our supporters from Skoltec, Iremisi high school economics those who basically contributed time of their people or other resources to the conference so with this I'm glad and I'm being said to conclude this edition of IST and I hope to see you next time at the conference stay tuned when and where it will be and be sure that we working to make it happen next time