 Hello, everyone, and welcome to the 11th International Conference on Analysis of Images, Social Networks, and Text. I am Habet Madoyan, the Data Science Program Chair of the College of Science and Engineering at the American University of Armenia. I want you all to welcome to our conference. Thanks for coming. And just talking about AUA, about our university, we have a bunch of opportunities here. We have opportunities for people who would like to teach, those who would like to do a search. If you are a speaker during the conference, then please expect that some of our students are going to bombard you with the questions on whether or not you would like to supervise their capstones, their thesis programs. So please be ready for that. And please don't say no to them whenever possible. So I would be in room 323W. It's on the third floor. You go right from the elevators. And so if you have any questions on what kind of corporations you can have with the university or in general about the educational landscape in Armenia, please talk by. Or if you don't find me there, then you have my email here and you have my telegram handler. So please add me and send me a request so we can have a chat whenever you want. Now I would like to give my microphone to Dmitri, who is another organizer of the committee. We'll talk about ICE conference in general. And then we'll go with our first keynote speaker. Thank you. Thank you, Habit. Hello, everyone, one more time. I'm here on behalf of the steering committee and of those guys who started it all more than 10 years ago. Alexander is here, but his flight was moved three times. That's why I replaced him. And it's my duty to deeply thank all the local organizers for hosting ICE here in AUA, Yerevan, Armenia. We are also thankful to all the program committee members. And let me also deliver some key facts about the conference, about its past, maybe its future. Now, according to Australian Corranking Conference, its original conference, or maybe national, I don't remember it exactly. You can check, but we believe it will become original. We apply for that. For last year edition, for the previous year's edition in 2021, we have more than 100 participants. Traditionally, we have five or six tracks. The main track, I believe it's NLP concerning the number of submissions we have, but also the large track is on computer vision, data analysis. We have two smaller tracks since the communities are a bit smaller on social network analysis and theoretical machine learning and optimization. As for the selection procedure, we applied for double blind review system several years ago and we keep going to use it. We have this year more than 100 PC members from 22 countries and the chairs who will be chairing the sessions or the tracks, they are internationally recognized area experts. This year we had, we have submissions from 15 countries and the proceedings are usually booked for publishing as revised selected proceedings in LNCS, lecture notes in computer science series and its satellite series CCIS by Springer Nature. So you can see the summary where they are indexed, they represent and maybe some volumes from the past with a nice logo on it. Let's shortly discuss the acceptance rate of the conference. We receive 106 technical submissions. What means technical? Some submissions can be withdrawn, for example, or desk rejected. That's why this figure is not the final one, which will appear in the proceedings. We have also 13 posters this year and 76 submissions after all desk rejects and withdrawals, which is quite the usual story, even for large conferences. The program committee decided to select 24 papers for the main volume, which results in the acceptance rates of 32%, and 22 papers selected to the supplementary volume. It means that these papers are also good enough, but maybe they are more like research proposals, have some room for future improvements. Let's have a look at the statistics by track, which the system called Is a Chair presented for us. You can see the names of tracks on the left and you can see the number of submissions per track. The effective number of accepted papers is shown in the second or third column. You can see also the relative acceptance rate and the number of PC chairs and program committee members included in the track. International Committee of Fairy Chairs consists usually of two chairs per track. For some larger track, there might be three co-chairs, but this year our brave natural language processing co-chairs managed to select papers together before there were three persons, I believe. All of them are known researchers in the area, as it was said. As for the organizing committee, I believe this is not far the full list of all the people involved, but you can see the names. Irina Nikishina, Maksim Panov, Habed Madayan, Amalia Humbertsumyan, Evgeny Tsimbalov, Mi and Aleksandr Panchenko. Aleksandr decided to include some of the influential papers in terms of citations here. The one here by Mikhail Korobov has probably the top most number of citations. It's about a morphological analyzer and generator for Russian and Ukrainian languages. Most cited papers also include the paper on webvectors, and I believe that the first person, Andrey Kutuzov, will join us online at this time. Also, one of the influential papers is on big IRTM, on topic modeling tool developed by the team chaired by Konstantin Vorontsov. He's quite known in the Russian machine learning community. Here you can also see some photos from the previous edition of Iced. It took place, I would say, nearby in Georgia in 2021. We would like to acknowledge our partners and supporters. This is the AI Research Institute, this is Coltec University, where I also work in HSE University, in addition to the local host. So, I think it's a good time to start. Let's start. I hope it will be a pleasant event for all who are here and online. Thank you. I would like to invite our first keynote speaker, Dr. Narina Sarvazyan, who is William Fraser Endow, Chair, Professor of the American University of Armenia. Please welcome. Hello, everyone. Welcome to AVA. We all know very well that Armenia goes through one of the darkest periods of his history, with tens of thousands of people losing their ancestral homes. And it's very hard for us to be cheerful hosts, but we're trying. I think Desmond Tudu once says that hope is ability to see light despite all the darkness. I think it's only appropriate that today we're going to talk about light, specifically what is light without colors. So that's what we're going to talk today about it. And with this, let's me figure out. So my talk will consist of four parts. We're going to talk about the limitation of our color visions. We're going to talk about the basic physics foundation of this technology, which we're going to get more familiar with medical application of it and a little bit about what was done by my team in the past in this direction. So color is used for medical diagnostic for thousands of years. It was something which physician look, color of your eyes, color of the skin, color of the urine, and there are plenty of very cute illustration like this in the medieval books where by the color of different fluids or part of the body, the disease was diagnosed. However, you know, we need to realize that as useful as it was that information was limited to very small spectral range, something which will call visible light range. So if you look at electromagnetic spectra range, the visible light is only from 400 to 700 nanometers. And we have only few type of receptors, which we call cones, actually only three of them, which in our eyes are sensitive to certain wavelengths of color. So we're going to briefly go over the main limitations of human color vision. And then I'm going to explain to you how this new technology allowed to overcome these limitations. So as I said, one of the major limitation that we start with a very few spectral bands, so to speak, or the receptors we have in our eyes, which we call cones. And many animals, which are much more primitive than us, for example, this mantis shrimp, have many more receptors in their eyes. But because we couple our few receptors with this enormous human brain, our human vision actually is able to recognize up to 6 million different shades of color. So we're talking about the combination of the initial input, which is this number of spectral channels and how you analyze them. But in our case, that spectral input, the initial one is actually very limited. So the next major limitation is subjective. You know, when you go to the store and you pick up type of wood or hair color dye or anything you want, we use this very subjective descriptions, which is okay when you're a buyer, but go to dermatologists and tell them that your skin area went from stolen kisses to hot pants or something like this. It's a very subjective way of describing what we have. And when you go to one physician to another, or when you perceive that degree of redness through the treatment or the time, it's all extremely subjective. In addition to inability to compare it between different person and different time points, our perception of color in our brain, it very much depends on what surrounds that object. For example, if you look at Ararat in the morning or in the evening, it might look very different to you. But in fact, if you remove the background, the color of many objects is identical. It's just surrounding of that color, which impacts your perception of the color. So it's not objective. And in addition to that, you know, we all have a different genetic composition of those color receptors. So when I go with my husband to the store to pick up a sweater, very often I say, oh, it's a nice green sweater. And he said, no, it's brown. So we all know that our perception of color is not totally subjective. So the third, as I mentioned, it's a limited spectral range. So we are seeing in that we are sensitive to that spectral range from 400 to 700. And insects or reptiles actually can see in infrared and ultraviolet range. So when insects approach the flower, the one which you just see yellow or whatever this little creature see, many more shades of color than we are. And lastly, our eyes actually need a lot of light in order to be able to distinguish color. So we all know that famous phrase that, you know, all cats dark looks great at night. So in order for color to be perceived, you need quite a bit of color light. And so there are two different modalities of way we can see light coming from the object. In case of reflectance, we shine something in whatever comes back. It's actually has a lot of intensity. So it's easy to see. But many objects specifically biological, they have another property which is called fluorescence. It's when the light hit the subject. And then it's elicit response from the molecules in that subject to emit light at longer wavelengths and it's called fluorescent. That light, it can only be seen when everything the other colors in the room are darkened. And that light is very hard to see by the human eye. So if this is, if these are the limitations I talked about the technology we're going to cover hyperspectral imaging is actually covering and able to address all of them. Okay, so what is hyperspectral imaging to be honest. I think from the listing point of view, it's will be more appropriate to call it spectral imaging, but somebody name it hyperspectral and the name stuck. So now we are with hyperspectral imaging, and it's basically relate more to the analysis or acquiring the light in the visible near infrared and ultraviolet area of spectra because spectra can be different. You have a Roman spectra and another but hyperspectral imaging is basically analyzing the light in a visible little bit of ultraviolet and infrared range. So the way it works is that if you have an object, you basically acquire the information in this three dimensional way. You have your spatial coordinates X and Y, and you adding the third dimension which is your spectral dimension or lambda. Then from each pixel of that three dimensional dataset, you extract the spectral profile or basically intensity of the signal along the lambda axis. And then you let the machine to sort this pixels and based on whatever task you give the machine saying, find me two ranges of pixels which are the farther from another or find me all possible combination. And then you suit the color those pixels which are closest to specific spectral profile, and you get what is called composite HSI image. So when there only few spectral bands and they kind of are far apart or some apart, it's called multispectral. When as a result of that extraction, you get more or less continuous spectra that's called hyperspectral imaging. Okay, so how we can acquire this set of information? It really depends on the type of the scanning you do and all of them have certain advantages and disadvantages. So in case of when you want to have a very high spectral resolution, you do point scanning. Basically you go pixel by pixel, and then you just acquire the full spectral information here. You can also do this linear scanning which is more appropriate for when you object moving beneath the sensor, or if you have that camera mounted on a plane or a drone and it flies over the certain area. Basically it basically goes and acquire this spectral information from that area you want to scan. Most commonly what we're going to talk today will be done through what is called wavelength scan. Basically you have a regular camera on top of the object and you change the set of filters in front of it. So every time you snap an image, you do it at the specific wavelength. And so you fill your cube from down up. And for the past few years, there is a huge develop in the photonic fields where people came up with a smart way of actually splitting that image coming from the object into many multiple area on a large sensor so you can get all the spectral information at once. So this is basically the same type of information here. I just highlighted that what are the advantages of these of these approaches. So in this case you get very high spatial spectral resolution, very slow acquisition. Here you have a medium, but again you use for moving targets. This is the high spatial resolution, but because you need to sequentially change that filter in front of the camera, it's relatively slow. This one can be very fast and can be used for video HSI, but in this case because you're dividing your sensor tips in multiple squares, the spatial resolution is not that high. So the spectral camera, I have a wide range of price, but generally the price is from 2200K. Okay, so what are the advantages? So the advantages, especially when applied to medicine, this is a non-evasive approach. You don't need to introduce any contrast dye, you don't need to touch the subject, you just take an image from it. So no radiation there, you reveal small difference in color, which I might not be able to see. You have basically the resolution is half of the wavelength, so we're talking about fraction of the micron. Let's say compared to X-ray, MRI or anything else, it's a very high spatial resolution. And again, the most important, you can quantify objectively the color change or difference between the color of different objects. So what are the limitations? Well, the main limitation is just like our eyesight, we only can get information from the surface. It's not like the X-ray, you can go through and see your bones. Because the light will go probably maximum to half millimeter or millimeter into the tissue, but it will not go more. You know, to be effective, you need to know exactly in which specific application ranges this will... So where you actually need to acquire this information, you need to have pretty significant time spent on post-processing. And then whatever algorithms you use is going to affect whatever you're going to get as a final image. So there is a certain subjectivity based on the algorithms or the user who will process the data. Okay, so basically in a very simple way is that we are combining the advantages of this insect or whatever lower creature eye, which has multiple channel but has a very little brain, with our eye, which has very few spectral channel but very large brain. And then we're using the machine to get these multiple channels. And then we use the computer power to analyze the signals which are coming from these spectral channels. Okay, so let's move to the medical application of hyperspectral imaging, because this is where future lies for many medical applications. But this technology doesn't come from medical field. It's come actually from military applications, from astronomy, from material science. And now this technology is very widely used for many areas of application, not necessarily medical. In agriculture, basically many farmers in Europe or United States, they use hyperspectral cameras flying over their fields. And then they can realize where you have certain need for watering or some kind of disease. We actually have a company here in Armenia, which analyzes those images obtained in U.S. They gain transferred here, the team here analyze it and send it back to U.S. farmers to better visualize their fields. The hyperspectral imaging is very widely used in recycling because plastic which looked to eye very similar or kind of transparent white. If you shine ultraviolet light, the fluorescence will be very different between these different type of plastics. So it's very easy to use it to sort plastic. You can inspect boards, you can detect counterfeit in, for example, you know, different currency. It's used a lot in art forgery detection because the spectra of the dye which was used in 13th century. As similar it might look to you when you look at with your human eyes will be different when you use a contemporary paint. So using this hyperspectral imaging analysis to analyze the differences really help to identify first art. It's used widely for the food detection when you see those tomatoes or apples going through the conveyor belt and you want to select ones which are not fully ripe or have some damage into it. Do you have a hyperspectral cameras on top of that conveyor belt which then allow things to be sorted. So the way the hyperspectral camera can be mounted, they can be mounted on the drone or any kind of aerial vehicle. Again this is the example how it's used in architecture. You can mount on the microscope and then you look at the slides or actually live cells and then mix the fluorescent labels there. Or you can just basically mount it and just have a regular camera objective and then you look at the microscopic surfaces like your arm, your mouse or whatever you want. So again this technology is already widely used in many fields but in medicine it's only starting. So you can see here the increase in amount of publications on a PubMed which again look like a big number but compared to overall amount of articles online it's actually small number. So this is an emerging field and one of the reasons I'm happy that you will know more about it because I think many of the techniques you use for different applications are directly applicable to this field and the technology is only bringing it to more and more medical applications where your expertise can be very useful. So they're just starting to appear actually first hand held hyperspectral imaging devices for clinical use and I didn't want to bring more gruesome pictures from real, you know, necrotic or diabetic food but it's obviously that, you know, if you have a camera you can look at the skin condition and see deterioration or improvement in perfusion index and stuff like this, something which can be easily see on the surface of the skin. It's also can be during the interoperative mapping when oxygen bind to the hemoglobin molecule the specter shifts a little bit. And so when it shifts significantly you know there's a different in your venous and your arterial blood right one is blue in others redder, but you can have much smaller shift also identified and so this is an example where hyperspectral camera was used to actually see that the surgeon was going to cut or, you know, put a tie over here but in fact this is like not the exact area where it should be put. So you can see that the area which need to be dissected is actually below so visually you cannot really distinguish it but use of hyperspectral imaging can be very helpful. Another area where it's, you know, going to grow is, you know, the transplantation of the organs when it's only less than one percent of the people who need organs actually receive them. So any organs which being excised from the another person who cannot survive and get transported to the final destination it's extremely critical to know the condition of the organ. Because otherwise you're going to have a person with poorly functioning liver going to put liver which is seem to be okay but actually this person will be dead in a couple of days. So the way to do it is that when they transport this organs the color of the organs despite all the effort is going to start changing because of the oxygen level and stuff like this and so again there is a very straightforward way to analyze it using hyperspectral imaging by change in the color. And so you can see the changes on the level not just the whole organ but the specific area maybe which of some of them you can dissect to avoid future necrosis. There are a few research papers on the subject of intra-operative HSI during brain syndrome jury. This is an example from a first paper. Again these are preclinical, first clinical but again awaiting larger clinical trials where they try to dissect neuroblastoma. And this is a combination of actually input from the hyperspectral camera and most of the paper actually is then devoted to the way they use neural networks to extract the maximum amount of information. And as paper conclude the accuracy was 80% which is outperforming the state-of-the-art approaches. 80% doesn't sound good to me but I guess in this field this number is actually very good. So there is a term now which is called optical biopsy. Then the catheter or some kind of endoscope can go close to the tissue, acquire the spectra, get analyzed and you can correlate it with the similar changes occurring in cancer patients with a similar pathology. One of the most fascinating directions for me is this a few years ago it was discovered that attaching hyperspectral camera to the fundus camera which is the one we look into your retina when you go to check your eyesight, the eyes extension of your brain. So those proteins which get deposited in the brain, blood amyloid, tau proteins and lead to decreasing your mental capacity, Alzheimer and other dimension, apparently they also get deposited in that retinal area. So by simply acquiring this information from the retinal surface there can be prediction made that this person is actually starting to develop early sign of Alzheimer's and everyone in the field believe if you can start treatment early you can delay that process. So this is exactly, this is a very exciting development because again it's non-invasive and you can see the early signs of disease but it's only like one or two groups now in Europe which are exploring it and I think it's definitely something which need to be studied further. So when I mentioned and I guess we'll already kind of see it but when I mentioned that hyperspectral imaging is limited to surface it doesn't mean it's only surface of the skin. Pretty much if you get into, if you can open a surgery or you can get something like an endoscope attached to the hyperspectral camera you can go inside the body and look at the surfaces there. And so it's not just what is on a surface of the body but it's also what is inside. Another major development in medical field which will find time because pathologists are probably the most conservative medical professional and it's hard to convince them this is the way of the future but eventually it will happen. You probably know that if you need to diagnose certain disease, small pieces of tissue is taken, stained with specific dye, sent to histology and then a very experienced histologist look at the slide and say well there's a little bit of this cells, there's a little bit of this color and so I think this is this or this type of you know cancer or any kind of other disease. So again this information A is subjective, B it's very very much depend on the expertise of this particular pathologist. So now using hyperspectral imaging and there are now machines where you can just feed thousands of slides like this and then it will scan it and then it will automatically identify difference in color. And then individual spectral component can be quantified so you can exactly say that you have you know it decrease or increase of number of certain cells during the period of time or you can compare it to other you know pathologists who gave you similar or different diagnosis. Okay so it's all looks wonderful when you look at the final slides of any paper but in my last part I want to just tell you a little bit about the work we done in my lab at George Washington University and it's just happened because in the past I did research in a cardiac field that the target we choose was probably one of the hardest at all. We chose the inner surfaces of the cart to be able to you know diagnose using this technology. So but first let me hopefully I will be able to play some of those videos. So we're going to talk about treatment of the most common cardiac arrhythmia which is atrial fibrillation. It's not fatal because it's when you ventricle actually start fibrillating you drop that because brain does not receive any blood but when you're atrial fibrillate it doesn't really impact you immediately but what happened in those pockets of this atrial you have blood clots accumulating so you likelihood of getting stroke increase five times. So when people have atrial fibrillation first they treat it with drugs but ultimately the best way to treat is go inside the heart and ablate using different type of ways of peeling the tissue. One of the common most common one is coloregio frequency ablation and I just want to play some videos so it's maybe kind of will be more interesting just. Okay so here if this is your heart and you just basically these are the three. And you can see that those abnormal sources of electrical activity which looks like this you know little stars are start randomly to go around and then you don't have a regular pumping so blood can not flow in a systematic way you form blood clots so the way to treat it. So basically physician goes into your vein can be your groin or can be your arm and then they inserted it goes into your heart. And because there is an x-ray machine on top of you can actually see the exact location where the catheter goes actually two of them one is called mapping catheter which record the electrical activity from the surface of the atria this is your mapping. And then based on this mapping signals then the other machine can derive the spot where this abnormal activity actually originate. So then the ablation catheter goes and basically isolate those area where the troublemakers are coming. So at the end on the screen the surgeon sees that pattern and then machine reconstructs where the signals come from so you can see this red area is where the signal comes from and that's what they need to oblate. They don't want to oblate entire heart or entire surface because then you're going to have a scar and it's going to be hard to even like get blood in because it's going to be very stiff. So you want to have a very targeted ablation to just remove those abnormal sources. So at the end basically again here you can see this the point of the ablation catheter which goes and touches this area and then machine records exactly where catheter already been so they don't go back and oblate the same spot. So if you go to that surgical room you really feel like you're in a spaceship because they're like five different monitors showing all these beautiful things. You see a person lying there and again it looks very sci-fi but they still don't see the actual damage to the tissue just only see indirect information that the catheter was there and there is a decrease in electrical activity. It can happen because you have an edema, you have you know wrong area and so still there's a 30% of recurrence rate for the atrial fibrillation ablation so person goes home then the problem reappears they need to come back. So our lab decided to use hyperspectral imaging to help improve this particular surgical treatment. So first we needed to figure out whether it's better to acquire this light in a reflectance mode or fluorescent mode. And the second post-acquisition we needed to figure out how to do that processing first quickly because again we're talking about the beating heart so things need to be done almost real time. So this is just a raw data from our hyperspectral cameras which shows that this is the surface which you see under ultraviolet light and this is under visible light. You don't see any lesion by eye but in ultraviolet light you see three of them much better than in a reflected. So the right side is the outcome of the hyperspectral imaging and processing of this stack of the data. So the biological data are extremely noisy. That's one of the problems you need to kind of realize is that so this is an example for example even after you normalize the signal you can see the trends. These are 14 different individual animals with different lesions and in the white light illumination the noise so to speak or you still see some trends but in ultraviolet range when we illuminate with UV and take fluorescence it's actually a little bit more consistent. So that's the first step for us was to figure out what way of illumination to use and then we proceeded to test this technology starting the small animals and then larger animals and then finally human tissue. So this is an example how the excised atria from the peak heart looks so you can see this is how it looks under the white light and then on the right you can see how it looks under UV light. Again lesions are kind of seen but very very poorly and when you use hyperspectral cube this is a difference in spectra. You can see the differences are very small only a few percent. Nevertheless unmixing it leads to the very clear pattern. So when we move to the harder tissue because it has a thicker layer of collagen on the top of the atria and so it's much harder to see any lesions. We were still able and this is the surface of the this is the excised human heart we're not murders but we had an agreement with the transplant center in Washington DC. So when they have a heart not suitable for transplantation they call the lab and say okay come and pick it up. So that's where this heart comes from from that source. So anyway so this is the surface of the left atria and you see how much collagen it has and if you strip it or pull it away and stain it with the dye which identified the ablated area there there but again from the surface you don't see them. Nevertheless hyperspectral imaging worked reasonably well so this is obviously the best case scenario. But overall it was good enough for us to say okay let's proceed with something resembling clinical device. In addition what we saw is that to our surprise as I told you we really don't see with this technology deep right because it's only surface. But when we did the correlation between the depths of the lesion in degree of spectral change we saw that we actually see how see or reveal how deep are the lesion and again it goes against my physics background. But then after you know many nights of thinking how it's possible I realized that this is a indirect correlation when you have a piece of steak and you have something hot like you know applying to it. If it's become little bit whiter you didn't you didn't heat it deep enough if you have more heat it's going to be deeper lesion is going to get whiter. So basically the degree of spectral change on the surface is indirect indication how deep the lesion will be. So that technology was surprisingly good also in this which is important for clinician because they want to do transmural ablation. And so we then proceeded partnering with the company and created this first hyperspectral intracardial percutaneous catheter. And this is just show that how we have this catheter entering the heart. So one of the problems again now we go through a little bit of problems we encounter. So if you have a blood in the tissue it's optically dense. So basically it's absorbs everything. So a question was well we have a perfused tissue will we have a difference in spectra because one thing to have it on the bench. Another when you have it's in a living individual or animal. So as you can see here we can perfectly identify a different area when you have a blood inside. This is just an example of the fresh lesion and the scar tissue which again can be easily distinguished. And these are the vessels which actually coming from the scar and feeding this border area which was very interesting finding. But this is when the blood is inside so it doesn't really interfere. But if you're going to have a blood right in front of your sensor it's going to block everything. It's like a dark wall. So we needed to insert the balloon. So we have a balloon here was filled with saline which basically work when you insert the catheter. Then balloon get inflated by saline and it displays the fluid in front of it. And then we needed to do things quickly because again it's a beating heart. And then we did a lot of analysis on the data and trying to minimize the amount of channels we actually need. So we went from initial 151 channel to 3 and still obviously the outcome was noisier but we were able to unmix and see the lesions. And then we also realized that some of the balloon materials which used very widely in clinical practice is very fluorescent. We have to search and characterize different fluorescent material for this material not block our signal. So I'm just telling you this to tell you that is any application you're going to encounter differences with difficulties which you need to solve. So I'm approaching the end of my talk. I want to acknowledge the people in my GW lab which do those experiments on hyperspectral imaging of atrial tissues. And I also want to mention that here at AUA I also want to continue on this promising direction. And one of the developments we want to implement which hasn't been done ever before. And I think that's very exciting to me is that we're not going to only acquire the spectral information along the axis on the mission side. But we're also going to scan on the excitations side. So we're not going to have a three dimensional data set like we had before but we're going to have a four dimensional data set which is now possible with a combination of the tunable light sources and a snapshot camera. So you tune on the side of the light which you illuminate and then you acquire the full spectral output on the light which comes from the object. And so we already shown that it's much more sensitive approach but work is only started and we have a faculty here from AUA. We assume we collaborate hoping to also add some new advanced image processing algorithms to this field. So this is just a very quick overview of what this field is going to experience very soon. And I think in red I mark something which you all guys have more expertise than I do because I am experimental physiologist. But this is again it's an emerging field fueled by advances in optics and advances in machine learning. And you are welcome to contact me to see how you can or want to collaborate or just read more about hyperspectral imaging and find somebody in your area who work on this because this is going to be something big. And in next I don't know 15 years when you go to the physician office I'm sure there will be hyperspectral imaging somewhere which will be pointing to your own body. So with that I'm ready to take any of your questions. Thank you Dr. Sarvazian. Do we have any questions? Yes. Thank you for the great talk. I'm not familiar with the field so my question is kind of general. How the multidisciplinary process looks like basically you have researchers on machine learning side and doctors. Is there any annotation process? How usually you describe to doctor and experts in medicine your results? What's the outcome response? Could you please say about it? It's very enriching for everyone and sometimes it can be fun sometimes it can be very gruesome. I remember we had a conversation with the vascular surgeon and we're talking about applying our cameras to like a diabetic foot or some cases where they have to amputate the leg. And he said oh no problem I can bring you 10 amputated lab to your lab and you can just measure it. 10 amputated lab and legs laying around. So I think their perception of what is easy and our perception was is doable or easy is very different. But I think you know I started as just I mean I my base education was in physics but then I moved closer closer to physiology. And but the past 10 years I made this journey from just being a lab researchers to calibrate it with people from the company who made the new devices. From the clinicians who need to test those devices. And I don't want to even talk about the lawyers and patterns and like all this stuff this is like yet another whole field which you need to learn. But ultimately if you want to bring something you do in your lab into actual practical field you need to do that. So it takes a lot of time and it takes I don't know several months at least to explain to them what you can do and what they have or they can offer to you as far as patient population. How they want to see the data many things which we said OK well that's easy let's just do that they say well I only have like two patients in a year. There's no way we're going to get enough data. So that conversation is very important. Thank you. Thank you very much for nice talk. I have rather like maybe the same question as Elena about the possibility of interpretation because what you did what you explained in the first part of the talk. Is this difference between perception of human being the color and the hyperspectral colors. And as I understand you use the or some of the researchers use the CNNs like in the original neural networks and they solve some tasks like with a certain precision. But do do know about some research or some works in the interpretation of this kind of information because it's beyond the human perception. Maybe doctors feel it somehow. But how how research in this direction. So I think it's the I mean it's a very important question. So there's always question of ground truth. Right. So I think at this point we don't we don't have a better tools than actually go to whatever. Let's say a set of pathologists which can say this is the condition. Okay. Or in case of the hyperspectral imaging of the lesions for example staining with TTC which is a dye which identify like a necrotic tissue where you can clearly see it's like become very different color. And you just take a regular image and you compare the result of the HSI and mixing with that image which is again ground truth. So at this point we don't have a better tool that what was known before and identified by physician as this is the damage. We need to go from there. And then the next step we'll see if the hyperspectral imaging give you something. Let's say again like for example this area where physician or surgeon put the knife and say we need to cut here. But actually hyperspectral imaging saying okay no half of the legs still have a normal diffusion stop you don't need to cut that much. But we cannot immediately do the step right because we can start with this and say okay let's have a population of patients 10 of them will treat all the way. And then we'll treat the new way based on HSI and we'll see what the outcomes will be. If HSI outcomes are better than this is justification to have a larger clinical trial and that's how it moves. You're welcome. Thank you for interesting talk. I'll try to be brief. I have more technical question actually. When there's a hyperspectral image of the human heart these filters all different wavelengths are they acquired simultaneously or with some like filter by filter as in satellites. So these experiments I was showing you we used the old style hyperspectral camera where you have just black and white sensors or great whatever. In front of it you have a liquid tunable filter where it's just a crystal which then you sequentially dial so to speak. But that's why we needed to go from 151 initial senders to just three to make it faster right because we do it with a beating heart. Now we just acquired for $22,000 snapshot hyperspectral camera which can acquire the whole you know 20 channels in like millisecond range. So now you can do it much faster. The resolution spectralization will be less but you can do it much faster. That's I'm saying these techniques are coming from the photonics fields and we'll have a camera which allow you to do it pretty efficiently at basically video rate. Yeah so the question may be more around if you're just obtained these snapshots on different wavelengths in different times so they may be kind of changed a bit because it's moving heart. So are there some processing algorithms to do it like to merge this into some static stuff. Yeah so what it done in not only this field but in any cardiac images it's called gating. So in parallel you have a ECG acquisition. So you gate your acquisition to that diastolic period of the heart when it's not moving. So it's not enough light to get from one diastolic period so you do it and you just sum it up over let's say 20 so you add this and you enhance the intensity of each pixel and that's enough for you to un-mix. So but in the heart it's easier because it's a regularly beating organ so you can gate it to ECG basically but yeah you in the again we choose the hardest object to study. So it has a blood in front of the surface it's beating you need to get into the heart sort of extremely small and that catheter needs to be bent which is also a problem for any fiber optics to have a bendability index enough to go in. But eventually we hope it will be solved. Okay so I'll start. My name is Evgeny Arlov and I'm going to present our paper which is titled paraphrases and classifiers controllable text generation for textile transfer. So first let's talk about the motivation behind this research so the task that we are solving is textile transfer which is an important task for products that use an LP because it makes these products more user centered as it is connected with emotions. And in the recent years textile transfer has seen great progress with large pre-trained language models but they are often too big to fine-tune for dancing tasks. So one of the solutions to this problem is to use methods of controllable text generation and more precisely a post-processing group of these methods which do not aim at fine-tuning the original language model. But on the other hand they just work in a post-processing manner only during inference. And another thing is that unsupervised approaches are more preferable as for many textile transfer tasks, parallel data are not available and therefore we should go with unsupervised approaches. So in our paper we adapt an existing CTG method, controllable text generation method for textile transfer which is called CAIF. And the advantage of our method is that it results into an unsupervised method. And we apply paracaif to a textile transfer sub-task de-texification and we work with two languages Russian and English. So first let's talk about the controllable text generation in general. We can say that contemporary language models have acquired the possibility of generating human-like sounding text. But the control is still we are lacking some control of these models because of the downstream application specifics. And also we need to control the models because they were trained on uncleaned, often unclean, web data and therefore they are prone to generating toxicity, toxic content. Yeah, so there are three broad groups of controllable text generation methods. And the first two they actually work with the original model and somehow will interact with it, either retrain or effector. And the third group is post-processing methods which do not interact with the original model and that's just what we need. So here is an example of a post-processing CCG method. It's called GDI and the main idea here is that, for example, if the task is to generate positive text, we train two additional class conditional language models and during generation with the main model we combine the signals from the two class conditional models and result into the desired class generation. So the method that we are working with and we're the one with that we are adapting is called CAIF and it is close to GDI but the difference is that instead of a generative classifier, the class conditional language models, it uses a freeform classifier. So the idea is that during generation at every generation step we assess all the possible continuation tokens with the classifier. So we apply the classifier to all possible sequences at the moment and we choose the most appropriate continuation according to our goals. So one, you could guess that some problem with this method could be that it is computationally difficult to apply the classifier to all the possible tokens. The vocabulary can be very big. So the authors propose several tricks to tackle this problem. First they limit the number of tokens that are being assessed to a sum number j which is set to 100 in the experience and also they apply an entropy criterion. So they suggest applying the classifier only at points where the entropy is high. So it is important at points where it is important to guide the model. Yeah, so textile transfer, the task that we are applying our method to is important because it can be used in, for example, writing assistants and chatbots because it can alter the text to your needs. And the formulation of the task is that we have to change the attribute, the style attribute of the original sentence to the target style attribute. And yeah, there are different subtasks and data sets for this task including toxicity. And the important thing is that, as I have said, not many textile transfer subtasks have parallel data so we have to account for that. Yeah, the particular textile transfer subtasks that we are working with is detoxification. It is relatively new but yet very practical because the internet has provided space for toxic content. And yeah, as you could guess, the task is to transfer the original toxic sentence into a neutral sentence. And the one possible application of this task could be that, for example, if a user writes a toxic content, we could just at this moment provide them with a non-toxic rewrite of the text that they have just written. So speaking about the research in this area, work has been done for English. However, parallel data are lacking and the first parallel corpus for this task was proposed in the previous year. And as for Russian, much less work has been done and last year in a first such competition was organized. It was like the first in the world, not only for Russian language. Yeah, now let's talk about the possible methods for the desexification. The closest method to what we propose is para Jedi. So in this work, the authors adapt Jedi to textile transfer and most specifically desexification. The main idea is that they substitute the original regular language model with the language model that's capable of paraphrasing. So now let's finally talk about our method. It's called para kive, similar to para Jedi. And the main idea is the same that we replace the regular language model with the paraphrase language model. And we also generate several candidates as common for generation tasks. And at the final step we saw the candidates according to style transfer accuracy and semantic similarity. And yeah, here's the algorithm for sorting. I will not spend time on explaining it, but I could explain that there will be some questions. But the main idea here is that we try to balance style transfer accuracy and semantic similarity. So in our work, we experiment with both Russian and English desexification. For Russian, we take the data from the desexification competition. Yeah, it's a parallel dataset. And we employ the evaluation setup in this competition. So the paraphrases are assessed on three metrics, including style transfer accuracy, content similarity, and language fluency. And as for the models that we use for Russian setup, so in our method we require two models. First, the generative one, and the second is the classifier that guides the generative model. So for the classifier, we use a rubber tiny that we train on the trains subset of the competition. And for the paraphraser, we explore a line of generative paraphrases proposed for Russian language, including GPT-based model and T5-based models. For English, we use test data from the paper that proposed paraphrase. And we employ the same, quite the same evaluation setup from Russian language. Yeah, so for the models, we use a, for the classifier, we use a Roberta that was trained on one million examples. It comes from the paper that proposed paraphrase. And for the paraphraser, we use the T5 baseline from the same paper, the same paragraph paper. Yeah, so now let's proceed to the results. So here you can see the table with Russian results. So first, we can see that the paragraph models. So yeah, this column is a joint metric that combines all the three metrics used for evaluation. So we can say that the all paragraph models, they nearly doubled the joint score. And we also see that the T5 models are better at preserving the content, which is quite logic from the encoded architecture of the model. But we also can note that the performance of paragraph models remains lower than the supervised baseline. And overall is lower than, yeah, the supervised baseline mainly because of insufficient content preservation and fluency of the output. However, we can see that in style transfer accuracy, one of the paragraph models outperforms the supervised baseline, the supervised T5 baseline that was trained by the organizers. Just a side note that none of the competitors have surpassed the supervised baseline that was proposed by the organizers. So here you can see the examples of the de-textification in Russian. So I forgot to say that there's an alpha parameter which is in charge of the style strength. And the lower the parameter is, the more strong the cell transfer is. So here we display examples of alpha equal to minus five and minus one. And so with minus one, the style is stronger and we can see that all severe toxic words are cleared out by the model. And so in the results you can see any severe toxic words. But with alpha equal to minus one, we can see that some toxicity remains. Yes, and some severe words remain in the results. So now next we perform some kind of an ablation study over our model and we compare just plain paraphrases with the paraphrases with the added re-ranking or sorting of the candidates procedure. And we can see that just adding the re-ranking, it makes the joints score higher. But it's not just that simple because if we take a deeper look and perform a more fine-grained comparison of paracive and just plain paraphrasing, we can see that if we aggregate metrics over 10 candidates, if we sample 10 candidates for every sample in the test set, we can see and then we aggregate the results. We can see that the overall toxicity of the just plain paraphraser is much, much higher. So this means that just using a plain paraphraser isn't enough to detoxify synthesis. Moreover, if we look at some thing that we refer to as relative toxicity, so here are the graphs by the OX axis. You can see the source toxicity of the test samples and then by the OY axis you can see the toxicity of the resulting paraphrases. And if we draw a regression line on these graphs, we can see that the slope coefficient of the, well, you probably can see, but trust me, it's lower than the coefficient of just plain paraphraser, which means that the paracive model copes better with the task of detoxifying the sentences. And moreover, the intercept coefficient of the regression line is higher, which means that it is lower, sorry. So this means that the overall toxicity of the samples produced by paracive model is lower. Now let's take a look at the alpha parameter and here unexpectedly we can see a rise of the style transfer accuracy with the rise of alpha parameter. So I'll remind you that the higher the alpha is, the less strong the style transfer is. So this finding needs more thorough investigation. However, with alpha equal to minus one, we can see a predicted and explainable drop of style transfer accuracy and it goes even lower with no cave sampling equal to just plain paraphrasing. Next, we also compare the results according to the entropy threshold and here we also see an unexpected behavior because with the rise of the entropy threshold, the style transfer accuracy also rises, which however is good because the higher the entropy threshold is, the more rarely we apply the classifier. So it is more efficient in terms of computation. And for example, yeah, so the peak of accuracy is achieved at entropy equal to 1.5 and this sampling was 1.4 times faster than with entropy threshold equal to zero. So let's look at the results in the English language. Here we can see that also the paragraph models are much less toxic than the just plain paraphraser. We can also note that the paragraph model outperforms the paragraph model in terms of style transfer accuracy and it outperforms the second best baseline from the paragraph paper in terms of the joint score. So to conclude, we have adapted an existing CTG method for textile transfer. We illustrated its applicability by applying it to a subtask of textile transfer desexification in two languages. And we also know that for Russian it is the first known to us application of a CTG method for Russian desexification. And yeah, we can note that paragraph significantly reduces the toxicity of the generated paraphrases. And however, paragraph remains inferior to supervised approaches mainly because insufficient content preservation and fluency of the outputs. But on the other hand, a paragraph is more applicable because it remains an unsupervised approach. So we do not need any parallel data to train on. So we can train the classifier to guide the model on any classification data for the desired styles. So we just need examples of the source style and target style and that's all. So for future work, we can say that it would be important to assess a paragraph with human evaluation because previous excellent transfer research has shown that automatic evaluation that we performed cannot be fully replaced. Sorry, that human previous research has shown that human evaluation cannot be fully replaced with automatic evaluation. Secondly, it could be beneficial to add support for beam search in paragraph because to date it works only with sampling. And that would be beneficial because we have seen that just looking for the least toxic example in the plane paraphraser could be quite good. But yeah, and so assessing longer candidates with beam search could be more promising. And also more promising in terms of computational complexity as compared to applying the paraphrase at each generation sampling step. And lastly, it could be interesting to add support for two classifiers for the CAIF model. So it would benefit for the CAIF itself because we could, for example, control for two styles at the same time. And it also could benefit Parakeif model because we could assess the content preservation, for example, just during the generation and not after it. So yeah, that's all I wanted to say. Thank you. Thanks for a very interesting talk. First of all, I just would like to ask, so let's say you want to do this dedexification in Armenian. So you don't have parallel corpus, but you have parallel corpus for maybe Russian or English. Could you adopt your approach to this cross or multilingual setting? Would it be hard to do? Yeah, so basically we, first that I can say that of course it would be better to have a corpus in the target language. But however, the multilingual models have shown the possibility of working with a few short or zero short working with new languages. Also, we could, for example, take a look at automatic translation of the corpus. Okay, so we are going to actually organize a follow-up task on this for multilingual dedexification at KLEA next year. So in case you would like to test some of these ideas, you're welcome. Thank you. Thank you very much. Let's thank the speaker again. The topic of our work is controllable storage generation based on perplexity minimization. Natural language generation is a field of computational linguistics that deals with the construction of computer systems, which can generate understandable texts in English or other languages. Natural language generation technology has a wide range of applications, including dialogue and question answering systems, storage generation, product description generation, and some others. Making text generation controllable is an important fundamental issue in natural language generation. Controllable text generation or CTG is the task of generating and natural language text that meets certain control constraints set by humans, such as topic, sentiment, keywords, and so on. There are two types of CTG, soft and hard. The aim of soft CTG is to provide the desired sentiment or topic of the generated text. Hard CTG requires ensuring that the text contains explicit constraints, for example, certain keywords. In the work we solved the hard CTG problem. The table on the slide shows an example of side generation. The first row on the table gives the storyline consisting of plot phrases. The second row gives the generated text containing the plot phrases in the order they appear in the storyline. The problem statement is formulated as follows. Given a vocabulary V and sequence of prompt X, which contains car tokens of the prompt, the goal of controllable text generation is to generate a target text, Y with respect to a CTG by maximizing the conditional probability P. The control element C can be sentiment, keyword, and so on. Controllable text generation methods can be classified into four categories, fine-tuning, prompt engineering, retraining or refactoring, and post-processing. Our method belongs to the post-processing category. It has following advantages. No need to create a training corpus, no need to perform a training procedure which is difficult, expensive, and time-consuming. The goals of our work are development of a plug-and-play CTG method which allows generating stories in accordance with the user-specified sequence of guide phrases that make up the plot of the stories. Conducting experiments on controllable generation of stories in Russian using root GPT-3 large, RUALPAKA, and SEGA models. From a text corpus containing stories with extracted storylines. Evaluating the equality of generated texts using automatic and human-centric evaluation methods. The idea of our method is as follows. First, we generate several random short-token sequences from the prompt to the guide phrase. Then we estimate the probability of following the guide phrase after each generated subsequence. Finally, we choose the most probable subsequence. We will describe the principle of our method using the example presented on the slide. The blue color indicates the token generated by some generation step I. Orange color indicates the guide phrase to which we want to provide a coherent transition. At every step of the generation process, we generate several random token sequences of some fixed length, for example, three tokens long. Examples of such tokens are marked in red on the slide. Then we evaluate the probability of following the guide phrase after each sequence and select the sequence with the highest probability. At the next step, we repeat the process for generating and selecting sequences of tokens. The method can be applied to any autoregressive language model for which the probability of a token sequence is decomposed using the chain rule. The task of generation is to decode sequences of tokens from the distribution P. Important component in the generation process is the decoding algorithm. Examples of such algorithms are top-k sampling and nucleus sampling. We consider a token sequence x, where x with index from 1 to i minus 1 is a prompt. x with index i to i plus k is a connecting sequence and t is a guide phrase. Theoretically, it is possible to find the connecting sequence x with index from y. y to i plus k using exhaustive search of tokens from the model vocabulary. However, such search has an exponential dependence on the length of the connecting sequence and is not applicable in practice. Therefore, in order to reduce the number of variants we propose heuristic technique for generating and evaluated connecting sequences. First, as continuation of the prompt are different token sequences of length k plus 1 are generating using some decoding strategy. Next, for each of the subsequence x with index from i to i plus k of the R sequences the probability of following the guide phrase t after it is determined as the product of the probabilities of following the guide phrase tokens. Further, as the current generation step, a subsequence is selected for which the probability is maximum. And the subsequences of length k plus 1 are repeatedly generated. We want to fulfill the condition of the explicit presence of the guide phrase in the text. After the completion of the generation, the guide phrase is not inserted in the text. We insert it by force. Its position is determined by the maximum probability for the entire generation. After the phrase is inserted, the generation continues towards the next guide phrase. To conduct experiments, a text copy was formed from fairy tales in Russian with extracted storylines. The copy is made up of fairy tales placed on the site in Mukadzechi Roo with a length of no more than 5000 characters. In total, the training copy contains 562 fairy tales. In each fairy tale plot phrases were extracted using these phrases that determine the main event in the storyline. To do this, first in each fairy tale, keywords and phrases were selected using YAK and root term extract method. The plot phrase was determined by a syntactically related four elements set where V is a verb or objects related to the verb M is a modifier, prepositional object or indirect object. The objects and modifier were selected from a set of the extracted keywords. The verbs were determined from the parse tree. For example, in the phrase dragon holds princess in a cave, holds is a verb, dragon and princess are objects, and cave is modifier. The minimal number of phrases in the plot is one. The maximum is logarithm of n base two, where n is the number of sentences in the text. Table on the slide shows the distribution of the number of phrases in the plot in the training and test coppers. The number of plot phrases varies from one to eight. We used 25 storylines from the test coppers and generated two samples per storyline. Storylines contained from one to seven plot phrases. In the experiments we used root GPT-3 large, RULE-PACA and SEGA models. The quality of the generated text was evaluated using automatic and human-centric evaluation methods. For measures we used for automatic evaluation. Perplexity, repetition, self-blow five and word inclusion coverage. Perplexity is calculated as exponential average of the negative logarithmic probability per token in the language model. A separate root GPT-3 medium model was used to compute the perplexity. Repetition score calculates the proportion of repeated four grams in the text. Self-blow five evaluates the syntactic diversity of a given set of texts. It is defined as the average overlap between all generated texts. Word inclusion coverage shows the percentage of plot words included in the generated text. Three measures were used for human-centric evaluation. Coherence, relevance and interest in this. Coherence shows whether the story is consistent in terms of causal relationships in the context. Relevant shows whether the story corresponds to the plot. The events in the story unfold in accordance with the storyline. Interestiness shows how the user likes the story, whether it is interesting. The proposed method was compared with three methods of controllable text generation, constraint beam search, few-shot learning and prompt engineering. Prompts for these methods are shown on the slide. The table on the slide shows the statistical characteristics of the generated text. The few-shot method with root GPT-3 model on average generated fairy tales three times shorter than the other three methods. It should be noted that when generated longer tales, the first tale was often interrupted and a new tale began. Similarly, prompt engineering method with SEGA model on average generated fairy tales two times shorter. SEGA model was trained as a chatbot. That's why when we asked this model, the composed tale, it generated short but complete tales. They corresponded well to the given plot. Automatic and human-centric quality scores are presented in the table. The values of the word inclusion coverage show that our method ensures that more than 93% of the words from the storyline event appear in the text. The text generated by our method met the requirement of matching the storyline to the best extent. Analysis of perplexity values shows us to conclude that our method shows almost the largest value of perplexity. A lower perplexity value makes the generated text look more natural. The increasing perplexity indicates that the control process is unnatural for the model. The causes the model to be more surprised by the tokens observed in the text. The self-below value shows that our method with the SEGA model allowed us to obtain the most diverse text among all methods. To calculate human-centric measures, the generated text will be evaluated by three annotators for coherence, relevance and interestingness. The assessment was carried out on a five-point Likert scale. According to the annotators, the proposed method allowed us to generate texts that were most relevant to the storyline. Our method performed best when using the relatively small root GPT-3 model receiving the highest score on all three human-evaluation measures. Also, root GPT-3 model generated less coherent and interesting text than Ruralpaka and SEGA. The table on the slide shows an example of a retail generated by our method using root GPT-3 model. The storyline consists of four plot phrases. All four plot phrases appear in the generated text. The guiding phrases are in position with the lowest perplexity value, which seems quite logical. Experiments show that our method induces the model to give the content of the text towards the plot phrase. Several examples of the generated fairy tales are shown on the slides. We obtained the following results in our work. We developed the method that allows generating stories in accordance with a user-specified sequence of quite phrases that make up the plot of the story. We formed the text that covers containing stories with extracted storylines. We conducted the experiments on controlled fairy tales generated in Russian. We calculated the various of the automatic and human-centric quality measures of the generated texts. The proposed methods perform best with the root GPT-3 model receiving the highest human scores among other methods. For the larger models, it can be used as a complement to other methods to increase the relevance of texts to a given storyline. Thanks for attention. For the nice presentation and perfect timing, we have time for one or two questions. Any questions here? Thank you for the great talk. I actually have a couple of small questions about evaluation. I wonder, did you have any measures whether the model generated something irrelevant, something extra? Was there something of the like? I couldn't get it from the 14th slide. Like maybe some super irrelevant text or you would allow that for the model because it naturally invents something. Repeat, please. I wonder if it is at all important for the task whether the model invents something really irrelevant to the plot. Some plot twists that are just impossible or something like that. Our text is relevant to the storylines. I also have another question about that. About a different metric on the next slide. On this slide, how did you measure the coherence exactly? That is the causal relationships. What were the annotators placing some scores or taking something? We used five points like the scale from one to five mark. One is bad, five is the good. Okay, thanks. Thank you for the question. Let's thank the speaker again for the good talk. Hello, everyone. My name is Polina and I will present a paper about using taxonomic information for hyponomic prediction using large language models. I will start with the definition of taxonomy and taxonomy is a particular case of knowledge graph. It is tree-structured lexical database based on either relations. Every node in the taxonomy is a set of words with similar meanings. Also, for every node in the taxonomy, it is true that all its child nodes are its hyponyms and all its parent nodes are its hyponyms. Taxonomy is applied for a wide range of natural language processing tasks. There is need to constantly update existing taxonomies since language changes rapidly. However, manual extension of taxonomies seems to be infeasible. It requires a lot of human labor and deep knowledge in the specific domains. There is a large amount of approach to automate this process. However, most of them are based on measuring the distance between non-contextualized embeddings. That leads us to the two problems. The first one is the fact that we need direct access to the large text corpora or large set of pre-encoded embeddings to capture really rare words. The second one, even more important, is the fact that static embeddings do not allow us to resolve a homonymy problem. We cannot see the difference between similar words with different meanings. However, both of these problems can be resolved with the use of large language models. There are several researchers exploring birds' acquisition of easier relations. All of them show that a bird is able to predict hypernims and hypernims on a quite decent level. As for the approach presented in these studies, the first one is the prompting, where we expect a bird to predict hypernims or hypernims in the face of mass talking. The second one, extending, also includes providing bird with information about taxonomy structure with projecting graph embeddings into the bird's space. However, there are no such researchers for the decoder-based models. In the current study, we propose to formulate a task of taxonomy enrichment as a task of conditional generation and apply decoder-based models to predict the child nodes for the target node. And also, inspired by very high performance of the decoder-based model in Solven zero-short text generation task, we also aim to try to formulate such textual input that will also provide information about taxonomy structure to the model. And there is also some additional parameters which start in the taxonomics, not only the graph structure and also additional information such as definitions or sense numbers. And for the first part of our research, we will try to find the best form of input data to provide the model information for taxonomy enrichment task. And for the second part of this research, we will fine-tune the decoder-based language models and try to predict hypernims. So usually when we speak about conditional generation, we use some natural input prompts or direct instructions. However, natural language input isn't very suitable for our task since we need to predict hypernims for a very large amount of different terms. So it's quite impossible to formulate a really universal prompt. For example, if we will speak about the simplest possible natural prompt, like x is something, we will also face a problem that even the choice of article highly impacts the expected outcomes. And also when we will speak about some more extended parents that also marks easier relations, we also can get some inappropriate statements. For example, my favorite written party is the diabetic written party makes no sense since written party is a disease. And the second problem with the natural prompt is the fact that we are still not able to resolve their homonymy. For example, we need to specify and return context also to define which meaning is here presented. So in order to overcome these hindrances, we propose to create some artificial input. The main idea behind it is to linearize a graph structure and mention in the input hierarchical structure in the flat form. We will mention grant parent and target nodes in the order and expect that the model will understand the button and then predict the child nodes for the target since it. The also advantage of this approach is that we can embed information from the taxonomy automatically for any target node. So there is three main nodes features that included in the WordNate data which we use in our experiments. So the first one is definitions for the terms in the each node. The second one is dilemmas which is synonyms to the title of the node. And the third one is sense numbers that mark order of the particular sense of the world. For example, bad as wooden club and bad as animal would have different sense numbers. So based on these parameters we create eight formats of artificial input and you can see some of them in the slide. So first one is the shortest. It only contains mention of parent of a target node. And the basic one also contains mention of grant parent nodes. And the most extended contains all possible information that we can get from the taxonomy. And so then we use these eight artificial input forms to fine-tune GPT2 and T5 base models. To evaluate our experiments we use two data sets for each language. One of them is bigger and consists of 1,000 randomly selected pre-terminal wordnet nodes. This data set is very suitable to evaluate in taxonomy enrichment tasks since it resembles real data of this task. However, it not allow us to assess to evaluate hyponymic relations acquisition from the decoder based models since this data set contains very rare and specific terms which might be not represented enough in the training data. So to overcome these problems we also created two manual data sets which consists from some very easy and frequent terms for example like beverage or cheese. And also how we can calculate our metrics for each sample from the test data which generate 50th sequences using top case sampling and then we separate outputs by the comma and sort them by the frequency. So we believe that this approach is more robust and reliable than the greedy search one. Here you can see the content of English and Russian manual data sets. We try to find matching terms from the both wordnets. However, it is not really possible due to different graph structures. So here you can also see some replacements. Here you can see the results of the selection of best form of artificial prefix. So surprisingly we can see that the most full and extended in terms of information input format shows the lowest results. We connect it to the two factors. The first one is simple the reducing of the correct answers that model can see since we make prefix longer and the model can see less amount of correct examples. And the second one we assume that large amount of unstructured information such as definitions can lead that it becomes very hard to model to capture the main information. And as for comparison of the models we can also see that GPT-2 shows greater recall scores while T5 is leading by the precision and also we can absorb from precision at 10 scores that GPT is more sensitive to the input format comparing with T5. So for the next stage of our experiments we use default input format with sense numbers for English and default format for Russians since Russian wordnet do not have such sense numbers and used to fine-tune three models for each language. First model is a decoder. Second one is encoder-decoder and the third one is instruction-tuned decoder of larger size. So here you can see the results for easy manual datasets for both languages and we can see that for both languages instruction-tuned models I'll perform the other with a large margin. And as for models compression we get controversial results and we can say which one encoder-decoder suits our task better since GPT-2 is better for Russian data and T5 is better for English data. And here is the results for big random dataset and we can see that scores are pretty low in comparison with previous slide since this data is really hard both for human and for models so the scores are low. And we also can see that even that in English Dolly still outperforms the smaller models but for the Russian SEGA shows lower scores than the GPT and we connect this to the facts that base model for SEGA is LAMA so SEGA can see much less lexical diversity than GPT-2. Here you can see some results of the prediction for the English for GPT and T5 large and here is results for the target nodes which show the best scores in terms of precision. So to sum up we can we found out that decoder-based model shows really high level of acquisition is a relationship and also that the most useful information from the taxonomy is pointing on the highest level of the taxonomies such as mention of grant period node is much more important than the definition to help the model to resolve the harmony and predict correct answers. However, despite high results for the main node dataset, low scores on the hard one dataset shows the further need for investigation in this direction. And as for huge work we think that maybe prompt union of large models can improve our results and also we expect that our approach could be extended both for the other languages and other taxonomy enrichment tasks not only hyponomy prediction. So thank you for your attention I'm ready to answer questions. First thank you for the talk and I wanted to ask how do the models look like? So do the models produce very specific words or are they not on the topic or are they similar to the word that we are trying to look for hyperonyms or just something irrelevant? No, there is no really irrelevant results but we face the problem that for example formulation of terms in taxonomy is very specific for example as a human we expect for the beverage the word water. However there is no water in the taxonomy there is only drinking water so we can get lower scores because of this fact. So and the metric is for exact match of the word, right? Yes. Okay maybe there's some room for like exploring the metric like semantic similarity and not exact match. Would be nice and also maybe we will perform some human evaluation. Thank you. Thank you for the talk. So the prompting approach works almost in all the cases but not in all cases can you comment like what might be the reason why this degrades performance for some of the models? We do not really use prompting approach since we use artificial input and the model can like only for pre-training information understands what do we want so we use artificial input and also for artificial input we have much difference in scores between the models and I think that's really connected with like amount of information that model seen during the training so they are harder the larger model seems to be tend to perform better. So they don't need this input or output post-processing large models? No they all need since we use fine-tuning all the models learn correctly as expected format and produce terms separated by a comma so we do not need too much processing we only need to split and sort. Any more questions? I have one question so if to sum up you compare two large models and the huge model with six billions of parameters can you say something about how size of a model is actually matters because the large is less than one billion and six is much bigger. I think that size difference is really matter for the hard one dataset since for the bigger dataset the difference between smaller models and Dolly is twice in this course for a smaller dataset we do not absorb such huge difference in discourse between smaller and larger models. Okay let's thank the speaker. Hello everyone it's a pleasure for me to be here today and today I'm going to present our joint research static dynamic or contextualized on the explanation of discovering semantic shift approaches and basically today I'm more like a presenter because the main contributor is Veronica Niganova who is here with us today on Zoom and after my talk she will be also glad to answer the questions with me and once again the research is mostly hers and I'm more like a man than a presenter today because unfortunately when Veronica was unable to come here today offline so as for the plan for this talk first we will talk about the goal and the motivation of the research and then briefly walk through the related work and the datasets then we will discuss the models we are going to explore throughout this work and the experimental setup and then we will look at the results understand the applicability of the studied approach and discover the words which experienced the most significant semantic shifts at this point thus and of course we will talk about the future work and the possible development of the research so in this work we focus on semantic shift changes in other words on changes in work meaning that are analyzed by studying the context in which a particular word is used in different time slices studying the chronic semantic shifts is important for both theoretical and practical reasons on the one hand it helps researchers in historical linguistics to understand how the word meaning evolves across different time periods and provides linguists with some data driven evidence for updating and improving dictionaries in practical terms such models can be applied in natural language processing tasks such as information retrieval sentiment analysis and machine translation for example helping to improve accuracy and the relevance of these systems especially with historical text or multilingual data well the main goal of these research is to discover semantic shifts in the selected data set which deal with media data and compare the performance of different approaches namely here in this work we compare the three main approaches static presented by work-to-work dynamic we use dynamic word embeddings and contextualized we use BERT for this task we apply these models to two tasks discovering semantic shifts and detecting known shifts these tasks are quite similar however we just hold the question the problem in a slightly different angle and study it under the slightly different angle well now let us briefly talk about the most important research in this field the first work is the embeddings reveal static laws of semantic change where the authors use three different algorithms namely PPMI SVD on top of PPMI and work-to-work with Kibram negative sampling to compare the results in the second paper the model is trained on all the time periods simultaneously and it's also proposed joint optimization problem that involves embedding learning and alignment problems in the third paper the authors apply contextualized models which provide a separate what are embedding which occurrence of the work depending on the context two semantic shifts problem namely elbow and BERT models it is also interesting for us since it is dedicated to the Russian language and the fourth paper provides us with the data set and the baseline results for the semantic shift detection task so as I've already mentioned here in our work we use the two data sets namely the news corpus and social media corpus for the news corpus we used already collected data from LENTERU from the time period which starts at the year 2000 till 2019 and as for the social media corpus it was collected precisely as a part of this research from of contact social networks namely we collected posts from the year 2007 up to the year 2019 and we use these data sets in order to identify social cultural and political shifts rather than linguistic semantic shifts first because a time period is too small for considerable linguistic changes and second because the nature of news and social media implies reflecting cultural and political events and processes well as was already mentioned we used three different approaches the first one is the static we use the most classic model the work-to-work model with skip-ground-negative sampling then we use dynamic work embeddings based on the PPMI matrix and finally the contextualized approach for which we use the BERT model namely we use the classic rubert-based model from Sberbank AI we conduct the two types of the experiments we call the first type is discovering semantic shift tasks and the second is the classification task on detecting known shift basically in the first task we aim at revealing semantic changes from the data and in the second task we already given the label dataset and the task is in the form of binary classification to predict whether the BERT has experienced the semantic shift or not for discovering semantic shifts we use the following pipeline first we train or fine-tune the embedding models on our data for different time periods then we align the embeddings and reduce them to the 2019 vector space for the work to work and calculate the cosine similarity measure for each on the eligible work for work embeddings of the year 2020 and 2019 for BERT we use the prototype embeddings namely we average the embeddings of all the occurrence of the words in the appropriate year and finally we obtain top 20 words with the lowest cosine similarity between these time periods and analyze the revealed semantic shifts of term of their validity and actuality by finding other closest neighbors as for the second task namely the binary classification test for detecting known shifts we take the embedding model from the previous task return them if necessary and calculate the cosine similarity measure for each of the words in the classification list then we train a random forest classifier which is the classic classifier and use our cosine similarity measure at the obtained feature and evaluate model with the quality matrix namely we use F1 at the main matrix and also compute such metrics as precision recall and accuracy scores now let us proceed to the results we will start with the discovering semantic shifts tasks in the news corpus here we see the results namely the top 20 words with the lowest cosine similarity measures between the studied years and that means that the these words have experienced the most significant semantic shift well let us take a look at the most interesting example for the work to work model we there are words like narrate and video in the beginning of the 21st century narrate was associated with the police if I could say so while in the year 2019 it is this work used as to symbolize an outfit like a beautiful dress or something there is also an interesting change in the world video while in the beginning of the 21st century it was more like a TV commercial then it shifted to the video clip which symbolizes the development of the smart phones and the growing popularity of things like youtube videos and reviews and so on and so forth as for the dynamic work embedding model the work in the year 2000 meant the railway road and in 2019 it is referred to a painting as for Burt also captured the same technological change at the work to work with a different work this work was used in reference to staff and personnel in the year 2000 while in 2019 with the advance of modern technologies and mobile phones it is used to symbolize a photo so on this slide you can see the visualization we actually see that for example the work cutter has shifted a lot and here the Polotnoy and Narat as for the social media corpus namely the analysis of the contact post in my opinion it is even more interesting here we see for work to work the work Norma and the work OO this is really great because in the beginning of the period Norma or norm meant like good okay it's okay it's norm normal norm while in the end of the time period it was used in connotation with the legal form OO and as for the letters OO in the beginning it was like OO the exclamation more like an interjection while in 2019 it is usually referred to the company like OO sorry I couldn't translate this on the spot but I hope that you understand what OO for the company means so we can suggest that the case change is due to the fact that the audience of contactee grows in the beginning of time period it was mostly about school students while in the end now contactee is more for adults is mostly used for adults the same could be noticed for the dynamic work embedding model the word Franzewski at first was used in connection with the Franzewski subject Franzewski lesson while in the end it is associated with other languages and as for birth we can highlight the word polarny here it is in the year 2007 it was used in the meaning of alternative and in 2019 it was it is used to refer to the polar like north polar and so on well once again on this slide you can see the visualization of the semantic shift for the words that was discussed on the previous slide so this table shows precision scores for all the modern models for discovering tasks and we can see that the birth model showed the best performance the second table showed the results in F1 score other metrics obtained for the classification task once again we have here for the second task we had a binary classification dataset and we used our embedding models as a feature for the random forest classifier to solve these binary classification tasks and we see that our models are approximately equal performance with work week given the best result and it should be noted that since our training corpus is smaller than the one used in the baseline research we cannot compare the results directed with the ones from the original research but we can know that our F1 score indicators are close to the baseline scores so what takeaways can we take from this research basically work-to-work model is a comparatively simple but rather effective model and the major problem with this approach is that we need to align work embeddings dynamic approach solves this problem by optimizing and aligning work embeddings at the same time during training but this approach is sensitive to the hyperparameters choice and is rather memory consuming the BERT model provides us with contextualized work representations which automatically solves the alignment problem however in our research we have seen that BERT model shows slightly poorer performance and one of the possible results for it is that BERT shows poor results in comparison with work-to-work model and it's not very good in detecting semantic shift for polysemi-words well in this work we developed a new social media corpus compared different approaches namely the static dynamic and the contextualized one to semantic shift discovering and detection tasks and conducted an interesting discovering semantic shift analysis in the experiments the tested model revealed political, cultural, social and technological changes in the rational language with the BERT model with a better quality of 80% for the news corpus and 60% for the social media corpus while analyzing and discovering semantic changes for the social media corpus we suggested that some shifts can be connected with the fact that a large part of users that wrote the conducted post grew from school students to adults and there may be two main directions for the future research namely we are planning to extend the scope of the studied models namely to use other contextualized BERT-like models like rubric large and multilingual models and explore other data set larger text corpus and use different time slices so thank you for your attention now we are ready to answer your questions and hope that Veronica is here now Veronica, if you are here yeah, say something and could we please show Veronica on the screen is it possible somehow yeah so my first question was about classification tasks I realized it was a binary task so did you check if the sense of the word has shifted through time yes we did well actually it wasn't a data set with several words and we checked whether the word we had these words we had ground truth 0 and 1 yes whether the word shift there wasn't and we had several words with the appropriate years whether the word had shift from one year to another whether there was change and so we used our embeddings to assess the piece okay thank you and another question was about so this is an interesting matter in general but do you know where this results can be applied in production or in some applications well as we said in the beginning there is practical applicability of such of such models yes for example for information retrieval sentiment analysis in machine translation when we give the models additional information about different time so we give different information about the changing work in penance and time yes so just as a helper to improve the accuracy of other models for example and does it work does it help when you add this information to the model for the task well I didn't check it personally but well it should help yes okay thank you the third question will be from Andrei thanks Maria and Veronica for our very interesting talk and I have many questions not sure we will have time to address all of them the main one is about the evaluation of the findings in your discovery discovery step so am I right that you ask some experts in social science something like that you mentioned it in the paper yes yes but how exactly was it done so you showed them these top 10 or top 20 most changed words and just asked these experts whether these words really changed was this the procedure so we gave the words with the closest neighbors five closest neighbors in each of the periods and yes the decision was made primarily on this information but broader context was also available if there were some well if they wanted more information yes yeah but still you showed the experts only the words from the top of the ranking as I say the decision was made on these five neighbors closest neighbors from each of the periods but broader context was also available no my question is did you show the experts also some maybe some random words from the from the bottom of the ranking yeah because it would be no we didn't because don't you think it would be more fair because essentially your method definition method you use then measures only like only the precision but not the recall yes but as we say that it's precision yes we say that we use as metric precision for this task and understand that it's rather subjective that's why we added the second task which is more objective and it's more objective to relate the quality of the models the first one was more about the data to reveal the semantic sheets from the data and to see it was more like interest or what words can reveal from these data maybe something interesting and to make it more objective we waited yes the second task okay thanks and second question is about this classification task so why did you decide to use this data set from FAMIN at all 2019 when we have now available the data set from the root shift of our shared task which is much more methodologically sane and okay thank you for your question it's because we've had they're practically almost the same data set as the one the authors of these data set used I mean the annotated data set yes they also used the Lenteroo source and that's why we used because well I think that there would be more missing words in our data set if we're used for example root shift of our data set yes in our data set because it's not really that large and here we could do our best to compare yes the results with the baseline research okay thanks yeah it's a pity that I'm not in at the conference right now I would like to discuss it with you more and yeah maybe the last question so the social media corpus that you released it's very interesting and thanks for releasing it but what are the legal like the terms of use for this corpus is it legally possible to redistribute it train models on it we don't impose any additions on using it of course it's free from our part yes and since we don't say which of the user posted which post I think there are no legal concerns from the contactee social network because we collected it from the free API so I think yes this corpus can be used but you didn't get in touch with contactee about it because in the past they were no so sometimes some data sets were removed from the net because of contactee got in touch with the creators of data sets and made them to remove the data so you didn't get in touch with us, yes but we didn't violate any rules when we were data scrapping this corpus so well if they contact us of course we'll have to remove it but I don't think that should be a problem alright thanks thanks a lot now we have the last talk of this section so hello again I'm going to present paper on probing which is less than necessary or more than sufficient so first let's do an introduction into probing itself so we all witnessed the success of black box language models yeah but the success has started the interest into what's inside these black box language models and the area of probing merged and a bright example of this area is the data set named Semtaval which is made for one of the first probing data sets for English language and the probing itself is the task of detecting the so to say true language capability of the language model so we ask the model for example does it understand like the notion of the number or case linguistic case and other things yeah and diverse probing studies exist and for example exist studies which draw graphs based on the experiments which like display the language capability of the model depending on the layers of the model for example with this 12 size 12 layer size model we measure different domains of the language according depending on the layers of the model and this picture is from Semtaval yeah and diverse probing studies exist and there are lots of data sets for probing but the probing desicits are quite difficult to collect because they contain real language data and also like the size of the data measures for computational reasons as always so in our paper we propose a method called fraction probing which is used to determine the right size of the probing desicits and it consists of two tests which we call data sufficiency test and data redundancy test so data sufficiency test is used for existing data sets to find if they could be smaller in size and data redundancy tests can be used when building new data sets to find the point to stop collecting samples yeah so the method is based on comparing probing graphs and their similarity both visually and using computational metrics yeah and this method is based on learning curves which Pavel will tell more about them yeah so speaking of the related works of this study they of course they include probing data set size although this area is quite understudied they include simple size determination which is used for statistical experiments and like is very common for when starting new experiments they the related works also include learning curves which initially come from psychology but are used for for example training models and they also include the progressive sampling which is quite close to simple size determination but is used when we are building the sample and we continue to make experiments on it so now let's talk more about our method so here in the right picture you can see the original center valve and the graphs which demonstrate the capability of the model so for example if we take this blue one this is tree depth which means that the model has to understand how deep the syntactic tree of the sentence is and it displays the performance of the model depending on different layers and how we interpret is that we see that the middle layers are more capable of doing this task than the first one and the last one so yeah and in these two pictures you can see what happens if we take a little a very little fraction of the original center valve and this is 40% of the original center valve and we can see that at 40% the graphs are quite similar to what we acquire when we experiment with the full dataset so the only thing is to find out how to directly compare these graphs so by our eyes we can see that this graph doesn't look quite the same as this one but this one is quite close so what do we do we measure the graph similarity first we leave the possibility of comparing the graphs visually but we also add some metrics so we suggest using three metrics and we experiment with three of them so first is Pearson correlation second one is Euclidean distance and third one is for share distance so we apply all these metrics first two vectors like here if we have 10 tasks we compare 10 sized vectors so this one with this one and so one so like the columns and we also compare the graphs so yeah we do that to compare the ordering of the graphs and we also compare directly the graphs themselves for example this one with this one so this goes in a 12 sized vector so we expect that Euclidean distance is better when comparing absolute numbers of the graphs and for share distance is in charge of the both the form of the graphs and their absolute positioning because for share distance is used to compare curves and usually is explained as as such so imagine a man walking a dog and they can walk any curve they want they can stop but they cannot go back and the for share distance is measured as the minimum leash size of the between the man and the dog that can allow such locomotion yeah so now let's proceed to the tricky stuff how do we find out whether the graphs are similar or not so for the data redundancy test with the existing data set we compare all previous fractions with the original 100% and we draw as so called a learning curve so this the OY axis you can see the metric that displays the distance and we can see that moving to 100% the metric grows smaller which means that we are getting close to the original data set but the thing is to find out where to stop and we use like there was some form of the elbow method we can we say that if we stop like where at the beginning of the Plato we are at the right place yeah so that's what we do with the existing data sets we just plot we continuously plot the metric for different factions and find out the elbow the place where the Plato starts but the thing are trickier for the data sets that are being built at the moment because we do not have any data set to compare with we do not have that imaginary 100% so we somehow simulate the setup for the existing for the existing the existing data set so what we do is we we think like that so we have been the data set that we have right now we think that we imagine that this is our 100% and we compare the previous fractions with it but the problem here is that we like the graph continuously we get like the next points of the graph continuously and we if we just look at the graph we cannot say if we have already reached the Plato or not so in addition to just computing the original numbers of the metrics we also compute first and second differences of the metric to say like the first difference displays the change of the absolute change of the metrics and the second difference displays the speed of the change so we propose the method and we also we also display its applicability so we work with the center valve the most famous probing so it consists of 10 tasks of 10,000 samples each and we like create these fractions of it continuously and we perform the data redundancy test with the existing center valve and we also simulate the data sufficiency test as if we were building the original center valve we experiment with Bert and Roberta and we use logistic regression as a classifier the results for the data sufficiency data redundancy sorry test you have already seen them and you can see that at 40% of the original center valve the graphs are quite close to what happens at 100% here you can see the table of all the results for the Bert model so here are lots of figures but it displays both the data redundancy test and the data sufficiency test so here you can see the three metrics that we compute the first one is just the metric themselves sorry yeah the first one is the metric themselves so yeah these figures are the fractions of the original data that are so it's recommended by the metric so you can see from this table that the center valve data actually could be massively reduced although the actual numbers differ for different tasks so here we can here are some conclusions about this big table so the visual method shows that each task could be reduced without losing its like explanatory power the tasks differ by how they look like when increasing their size so for example word content here you can see that the absolute numbers of its curve are rapidly changing throughout the different fractions however other types of tasks don't show such behavior so we call this score growth so there's a group of score growth tasks and there are tasks with no score growth so when applying the metrics in a task wise manner we conclude that for shed distance shows the lowest mean error and Pearson correlation doesn't really look at the absolute numbers of the curves so if we want to compare only the shape and if the absolute numbers aren't really relevant for us then we can use Pearson correlation so for a layer wise method of application of the metrics we can see that while preserving the ordering of the curves requires more data than preserving the shape for the data sufficiency test we see that simulated with the semtival we can see that the discrete differences constantly recommend higher fractions so they're in a way more strict than the original metrics but however they are highly correlated with the original metrics so the results of the data sufficiency test simulated on semtival resemble the results of the data redundancy test on the actual semtival and another thing is that the second difference produces less error than the first one which can be explained by the fact that it is less strict and looks at the speed of the change itself so I already showed you this graph and looking at particular different linguistic semtival tasks we can divide them into two groups first that we can reduce to the minimal fraction of 10% for both Bert and Roberta and second which require more data interesting is that this division cannot be really explained by the linguistic content linguistic sense of the tasks so this needs more thorough investigation but the thing that we can already note that the standard classification parameters remain relevant so the word content task that you see here is different from all the other tasks because it has a massive number of classes it is 1,000 as compared to other tasks which have like two or three classes we experiment with two models so we can compare between them and we know that Roberta constantly requires more data there could be different explanations which is this fact we go with the explanation that Roberta has like it is more it's more it's was based on Bert and it was developed on the base of it and therefore it has like more quality high quality data encoded in it and therefore it needs more probing data to find out what's inside and Roberta similarly needs more data to preserve the ordering of the tasks yeah so these numbers are higher for Roberta than Bert so to conclude we propose a novel method for determining the right size of the probing dataset it consists of two tests which first one of which first one is data redundancy test which is applied to existing datasets to find out if they are actually bigger than they could be and the data sufficiency test it is applied to datasets that are being built at the moment yeah and we experiment with Centerval and apply it in both setups so for further work it would be interesting to look more deeply at the learning curves that we work with in our method because we call them learning curves but they are not actual learning curves that are usually perceived by this notion they are created by artificial data and it would be interesting to find out if they follow the inverse power law which was shown by the usual learning curves another important point for future work is to create a numerical definition of the Plato that we are by this time we determine by the visual method and it could in some cases it is tricky to find out where the Plato is yeah and another further point for further work is to apply those methods to other existing probing datasets because our results imply that actually the existing probing datasets could be smaller or even much smaller than they are at the moment thank you thank you so is your methodology for estimating sufficient size of datasets is generalizable to other tasks let's say not only probing but other classification tasks because I think it is quite valuable for pretty much every task thank you for your question actually it is applicable I could say that it is applicable to any probing task that draws curves but actually developing on your question I can say that the method could be applied to any task that produces vectors so the novelty of the proposed method is that we come up with the means of comparing the shape and the relative positioning of the curves like if we compare so answering your question the method could be applied to any task that produces vectors which consist of numbers and if this this task is pretty generic did you search for some alternative methods which exist maybe in literature for estimating size of datasets maybe not specifically probing but just in general for machine learning such kind of methods exists or yeah they exist and it is actually a big classical area for determining the size of the dataset for machine learning but they are usually applied to tasks that produce either one number or just classification and our task of comparing curves is much more complex in that way more questions ok let's thank the speaker and go for lunch we'll come back at 3 ok dear colleagues I think that's time to start the session please take your seats alright so in this afternoon session we start with our second keynote talk this talk will be given by Hakim Hasid who is the principal researcher in the technology innovation institute Abu Dhabi which is United Arab Emirates and he is also a honorary professor in Macquarie University in Australia so and you'll be talking about HCI or some movement towards HCI and well you're welcome thank you Maxim thank you everyone good afternoon so the talk of the afternoon is always complicated after the lunch so we'll try to make it a little bit light as much as we can so initially I was planning to talk about I'm still will be talking about HCI but it was more sort of a classical talk where I wanted to touch base basically on the different methods strategies that we are using in the HCI but then I thought it's slightly too classical for that so what I did I tried to gear the talk into more sort of industry inspired we have been meeting with many industrial people these last weeks and I wanted to share the small experience we had in relation to this HCI and generative AI models or LLMs I hope this would be useful for the different profiles we have here in the room and maybe we'll inspire and give some ideas on topics that could be eventually treated at the fundamental level so just before I start I'm coming from the TII TII's research institute that is located in Abu Dhabi it's the research arm of the ATRC which is the advanced technology research council TII is composed of 10 research centers that you see here going from material autonomous robotics, biotech space, directed energy we are in the AI center that is in the digital science research center these days we are focusing a lot on the LLMs large language models but we are not doing only that we are doing other stuff related to image processing theory of fundamental issues with Maxime who is here we are doing some other stuff related to EDGE so on and so forth so my presentation will be organized this way so some discussion about the genitive AI EDGE machine learning so two aspects one related to the inference on the EDGE the other one is the learning on the EDGE on which we are trying to focus I will give three use cases on which we are working just to illustrate this sort of EDGE AI stuff and then we will be finishing or concluding on the future of genitive AI so the logic behind my talk today is basically to bring together these big models or these big genitive models but also open the doors to this EDGE AI hoping that I will convince some people here that having bigger and bigger models is not necessarily the right option to follow but there are other options also that are there so just for the genitive AI that is making the best today so that's not new I'm sure everybody in the room is aware of that so the RNNs and the LSTMs that were behind the transformers probably were much older than many of the people who are in the room but then we have foreseen a lot of sort of evolution these last years after the transformers but these transformers were not the only reason we have actually the computation power or the physical layer that became much more interesting which allowed the execution or the exploitation of these models now we see different models that we are hearing about we have the LAMMA GPT-4 and we have Falcon that is coming from the TI for example so we have started in the past well the AI is not a new thing again so it's been there for a long time I think it goes together with the computer science and the computing in general but at that time if I focus like starting from the end of the 80s we started looking into this artificial intelligence closer and we were hoping that we could have really intelligent systems but I think at some point the objective was too big for what the systems or the equipment, the physical layer was able to provide at that time I think we may already remember the expert systems at that time where people were promising that those systems would be able to solve all the problems that we have at work in the industry that didn't happen that fell short I would say then we had this machine learning that came in the mid of the 90s where we were focusing more on some statistical analysis and more simpler program then we get this deep learning again this deep learning was actually I think it came at the right time because the physical layer became much more interesting well then we are talking nowadays about the generative AI that is basically an AI that is able to generate content and this is very important to keep in mind that we are able to generate content which means we cannot do for example some reasoning to which I will be coming later so we are here to generate some content you have different players playing in different domains we will not go through them but that's the combination of the deep neural networks and sort of a higher capability in terms of computation when we got into this competition at least from our side we see it almost every day who is the one who will be building the biggest model so there is a huge competition there but this competition actually is I would say is not healthy at the end of the day so we are trying actually to build the biggest model thinking that the bigger we are better we should be so people start actually looking into this problem in a different way we have different scaling laws that came up and demonstrating that the quality of the model is not necessarily related to the size of the data that you are putting inside but it's more related to the quality of the data that we have inside just to give you an example so when one TAA has released the Falcon 40 billion parameters we have been doing better than Lama for example and we honestly the architecture that we had inside was not that more sophisticated than Lama the only thing that has been done is actually to take the data and clean the data in a better way so we have used a smaller portion of the data but the team has cleaned it much better than the results where it became actually much better than what Lama was having at that time so we have been ranked in the different leaderboards as first but then Lama again came with the second version which was better but then we came again with the Falcon 1 ATB that is better so we don't know where things are going but there is this sort of new vision where we are saying maybe we shouldn't continue in the size but we need to look into other things this is justified actually not only by researchers who are working there but also with the industry this is why I was saying in the beginning we had a lot of meetings with different industrial partners and players who are actually complaining about this size because you can imagine if you have a model with a hundred and ATB to run it to run the inference that is also costy so those businesses are not ready to spend that amount of money to run models and we can add to that those models are too generic they are not necessarily specialized in the business that they want to solve so the equation is becoming more complicated and we need to look into other options eventually so in this slide they tried actually to look back again and try to see if there is a parallel between what was happening in the past in the hardware world and what's happening today with the LLMs in the hardware world when we started in these computers we have actually started building big computers and at the time the logic was also the same if my computer is bigger so it should be more performance but then we find that it's not necessarily following that logic then we started building smaller computers personal computers so on and so forth and we went even to things where nowadays we have phones that can compute much more than some laptops we have the internet of things we have different stuff so I believe that this generative AI will also follow a very similar trend going from bigger the model is better the quality should be to smaller the models are the quality also should be better or at least equivalent to what we have so this opens actually the way to look into this edge and what this edge basically the edge is the the edge of the network or the small devices that are there that are nowadays used mainly to capture the data and display the results instead of having any sort of computation that is happening there so can a single model large model actually sort out everything this is the question that we are asking more and more nowadays I think the simple answer is no we are facing a lot of issues when it comes to the practical dimensions the businesses as I said are not ready to spend a lot of money on a general model that is not necessarily serving the objectives of the business I hear that we can do some fine tuning for example yes but that is also costly so people are questioning actually the use of these models you have different other aspects that are related to that I just mentioned I mean related to the this answer of no that justify the no answer so you have the domain specificity but then you have the data availability as of now all the LLMs at least as of three days back all the LLMs are built on historical data businesses are interested in more real data, real time data how to process that real data how to integrate it in the LLM to be able to exploit it you have the multimodal tasks people are not interested only in text we see more and more multimodal models that are coming out but we still have the same issues in terms of generality of the content issues with their fine tuning for example so on and so forth the cost that is related to the LLMs or these generative models building a model is not I would say open for everyone for the moment we have big players that are or who are building such models you have open AI, Google meta, you have TII who is spending a lot on that but if we go to small companies or just companies who are not in that business the cost may be extremely high and it's too risky for them you have the issues of privacy and security your data actually is going in the LLM you don't know what's happening with your data your competition can get access to the data for example without knowing anything customization this is the things related to those who are working for example on the web and recommend their systems there is no personalization that is by default added to the LLM or to the generative model the scale and the complexity so here again the LLMs and these generative models are specialized in generating content so they learn patterns and then they try to give back those patterns there is no reasoning that is there as I said and then you have issues related to the ethical and bias considerations and the carbon footprint that many of these actors I have mentioned before are trying to sort of work on but still there is a lot of effort to be done as of now just to give you an example again to compute the Falcon 1 ATB the amount of computation that was used to correct me if I'm wrong Maxim I think we used 4000 GPUs for 6 months so that's huge and in terms of energy that's I think it's more than the cars that are circulating outside for some period so the generative AI process when it comes to the domains and I mean the conclusion from the previous one is that we need to think of more specialized models we need to rethink the approach that we have taken for building these models instead of building general models big models maybe the idea is to look into smaller models exploiting smaller devices for those who are not familiar with what is done in this generative AI so you have sort of part of the system is to get your data be it text, image speech whatever you have then you do some preparation of the data you do your training this is your foundation model but then when you want to apply it for specific domain I don't know energy, education, finance you need to do some fine tuning fine tuning is the adaptation of the general model into specialized domain or specialized task okay so this is of course costy but this part also is less costy but still it will cost you into things so the adaptation itself and then the hosting usually of the inference to go to specialized models you have this fine tuning but you can build also we can think of building more specialized models and more much smaller models some sort of arguments to push us towards thinking of other strategies the privacy and security concerns so most of the companies and the businesses now are raising these issues of security and privacy you have even governments who are setting rules on sort of business data they should not leave the country this is what we have in the UAE for example your data as a business should not leave the country and should be processed and stored in the country right using all those services like chat GPT is against the law so none of the companies is using this kind of things so I mean this is an argument that will come more and more to sort of push us either I mean push everybody either to build their own or generative model or think of smaller think of strategies that will help in building smaller models in the future so there is a high cost constraints so everybody is complaining about the cost whenever you mention the cost people are not prepared for that well you have also the awareness people they don't understand the needed energy or the needed computation to run this kind of models but still the costs are high and everybody is complaining about that you have also the some amenities that are open at the edge level the computation power is getting more and more interesting so we can do some stuff at the edge level you can have basically your computers your phones for example you can do some stuff it's not at the same sort of order of magnitude as the GPUs or the clusters of GPUs but maybe these can be exploited for such a thing so the high demand and expectation on the performance people they want less latency for example good so this is the architecture I mean a very high level architecture of what we have as infrastructure today so we have the cloud then you have edge devices that are related to the cloud as I said before the edge devices currently are used mainly for capturing the data and displaying the results that you may get so most of the techniques we have we use these devices we capture the data we send it back to the cloud where things or the computation is done then when we have the result we send it back to the for display well we could look into that and that will be later my conclusion I will link it to this one but we believe that the edge layer is not that much I would say used so we could use that layer we could use it in a better way to sort out these privacy issues to sort out these cost issues to sort out the latency issues and these are the reasons that pushed us to start thinking about this edge so the edge machine learning is basically what it's a combination of edge computing and machine learning the objective is to build and execute machine learning models directly on that of course we have to start from somewhere right we will not start by building an LLM directly on the edge but what we are trying to do currently is to build at least traditional machine learning or some small deep learning models on the edge and then try to find ways to go a little bit further so this edge machine learning is can be operated on one device or multiple device they are constrained by no sharing of the data we don't want to share the data with the cloud for example or with other devices or at least we should have control of that sharing and we hope that we can offer similar capabilities to what you get from a cloud a cloud system good some questions that motivate the use of the edge so is there a real need to share the data with the cloud is there a need actually to let your data leave from your device and go somewhere else and then how can we allow training to happen directly on the edge this is a very important issue because of the limited sort of computation that you have on the edge and how can I efficiently execute those models on the edge when it comes to the inference so we have done a paper where we have it's a long paper I will be trying to summarize it into four or five slides we have tried to understand the different requirements of the edge machine learning we have divided that into three parts so the machine learning requirements the edge computing requirements and then you have some overall requirements that are related to everything from the machine learning perspective you have the low task latency the high performance we need always high performance to when it comes to the computation generalization we always again we need some generalization when we find our models enhanced privacy and security and then we have to be independent from the data that is labeled from the edge computing we have the efficiency in the computation optimized bandwidth offline capability and low communication latency this offline capability comes most of the time because the people they may have issues with the network and they still need to use their services that's a very important thing and this should be allowed by the edge machine learning and you have the cost and the energy that I would say is related to all so we have to play with all these parameters when we think of doing edge machine learning it's not an unlimited resource as we have in the cloud for example so you have three parts when it comes to edge machine learning you have the learning on the edge you have the inference on the edge and then we have some parts some of the work that is done around the preparation of the data directly on the edge so the paper I was referring to is this one for those who are interested it was published few weeks back it's a long paper it's around 50 pages but we go through the different aspects related to edge machine learning so from the inference I try to summarize things in this slide we have different ways of applying or bringing sort of a big model to run on a smaller device so the first one I believe everybody is familiar with is the quantization we are working a lot on that so we are trying actually to work on the models to change the encoding of the data to reduce the encoding of the data so that the model becomes smaller but still we keep the quality of the model we are able to quantize models up to 4 bits from 16 bits and we keep a very similar sort of performance you have people who are working on the weight reduction we are not sort of handling this part for the moment but the idea here is to see it's a sort of approving strategy that we are sort of exploring we have the knowledge distillation the activation approximation I'm just going quickly on this one while we are not doing much on that but on the last two ones the early exit we are exploring it a lot combining it with the model compression and we have the caching when it comes to the exploitation of the large language models on the web so the inference on the edge has a reflection I think you have a variety of methods that are there promising results you have sort of methods that are able to reduce the models 32 times and then bring them with a very similar quality you have lack of real public implementations we don't have much of the things that are stable and that can be commercialized for example which means we still have a lot of work to be done at this level lack of automation to do this quantization for example you need to have people who will be working basically it's a try and fail so you see what is not working so you try again again and again until it works this is also related to the diversity of physical architecture it depends on your processor it depends on the physical capacity that you have so you need to have people who should be behind that to control it again we are investing a lot on this part trying to bring some solutions to automate the quantization because it's very important and again it should bring an answer to all those businesses who do not want to see their data moving outside their company we don't have a killer application yet so more work is needed for those who want to work there there is really a lot of effort that need to be spent learning on the edge so that this started lately basically you have different approaches again but the objective of what we want to do is to build directly the model on the edge right so we don't want the help of the cloud we want to build directly on the edge you have limited constraints and you need to work on that you have different methods I'm listing them here the most used ones I would say is the distributed learning we are trying to distribute the learning on different devices everybody is aware of the I guess the federated learning but you have other methods also that are in the same space I would say we are exploring it but here we need we are exploring it from a theoretical perspective in the theory team but when it comes to the application or the concrete sort of deployment of these kind of things we get a lot of physical constraints again that need to be solved so we are trying to work on that especially when you use for example heterogeneous devices that's not an easy task to have where in theory for example we try to sort of make all those constraints basically we don't take into consideration that so we have the federated learning we have the split learning we are also exploring transfer learning again I will share the slides for those who are interested just not to go into the technical things but then we have the summary that we have tried to build on the Edge ML so this is the taxonomy again we go back to the reference edge learning and data preprocessing you have a lot of methods that are there but again so in terms of papers, research papers we have a lot of papers that are there I think the paper we have published has more than 250 references but when it comes to the industry and the platforms that support those kind of things this is very limited and I think there is a need for investment to make things happen because again this will help bringing these LLMs to be built or at least the inference could be done at a lower cost to generalize the use of these LLMs so this is just sort of a summary on the learning part so the generalization and the adaptation is complicated to do when you do the learning on the Edge while we need to set the theoretical foundations of that the architectures we are always facing the issue of heterogeneous devices so this is an important thing that I believe needs to be taken into consideration when we build that theoretical foundation you have the sort of the hybrid approach that should be exploited in my opinion hybrid in the sense that I can use the Edge but I can also use the cloud as a collaboration between these two things data quality and assurance any business you talk with will tell you that I have issues with this because again we assume a lot that the data is good but in reality the data is not that good so we need again to find sort of methods that would verify the quality of the data we have a PhD thesis that is working on that and then you have the standardization that needs to come again to help us into this diversity of devices so I brought like three use cases here just to show you quickly some of the work that we are doing in the in the Edge site so this is George who is trying to show us something so what we are trying to do in this application actually is to show the possibility of building an Edge a model, a machine learning model, a neural network directly on the phone we illustrate it on the activity monitoring so what we are trying, what George is trying to have here, he has a small model that is running on his phone and he just checked or showed that the working is we can identify it and then he just tried a new sort of activity that is not known by the model so what he will be doing now is to just record some samples of data of that activity and then he will rebuild the model again or update the model directly on the Edge so we are here he is basically collecting some samples of data and shortly he will run now he got tired so now he can run the training the thing that is important here to keep in mind is that the training is happening on the phone we are not sending any data to the Edge, to the cloud but everything is happening on the phone it is taking a little bit of time the idea was not the performance but to show that there is this capability that we can build and update the model directly on the Edge so now the model has been updated now he will test basically he will start the inference and see if the new sort of data has been added to the model or if the model has been updated or not and then he shows that and I think those who work a lot on neural networks we have this issue of catastrophic forgetting as you learn the new things you forget the previous things we have integrated that also in the system we are able to incrementally update the model while not forgetting the things that are that have been already learned so the second case if I can go is the sort of reinforcement learning algorithms that we are using to help in the navigation of the drones we are trying actually here to combine image processing with reinforcement learning the idea is to help the drone to autonomously explore an area without getting or without colliding with the obstacles that you have the final objective is to run of course these models directly on the Edge on the device that is the drone the nice thing here again is that we learn some situations it's reinforcement learning strategy then the environment can change but the model is still working properly and the drone will necessarily prevent getting into collisions against the worse I just jump quickly to the to the situation where we add it's this one normally so we have added some obstacles you see the black obstacles are new they were not in the these ones they were not in the initial learning environment but the system is able to identify or to recognize at least the obstacles and go further the next one and the last one is the web navigation so here what we are trying to do actually is to integrate these models in the web navigation so we have the sort of extensions of the of the browsers that you use and any page that you visit you can actually request for summarization you can also discuss with the page or sort of interact with the page with questions and answers this can be working with the cloud but again what we wanted to do is to bring these things to run on the edge so what we have done the work that we do on the quantization we quantized the models and then we brought them to run on CPUs and then we add them to when you get this extension you get actually an instance of the model that comes on your computer so whenever you discuss with any page on the web your data stays on the edge and it doesn't go it doesn't go outside so here we have asked for the summary and then we started asking questions on the content that is on the web page so here for example how much funding is allocated toward low carbon solutions and then with some after some time you should get an answer explaining that we have done different questions and then the system is getting the data from the page and solve all the problems but they should be also mixed with what we call traditional machine learning models and then you can have sort of hierarchies of model coordination of models as we do in web services for example and then have a sort of more complex system that would bring solution for more complex program and more adapted situation than what we have nowadays so the future generative AI should be multi-model that is an important thing to keep in mind we do not want only text or only images we need combination of different types of data specialized models should be at the heart of the ecosystem again we don't want huge models that are too generic or at least not only we need to build collaborative strategies between all those models we need reasoning capabilities so that we do not have nowadays and then we have security and privacy that needs to be taken into consideration and the last one and I think it's the most important one the actionability that needs to be attached to the generative models as of today the models are only generating data so they can recommend things for you for example things to do things to watch, things whatever but then when it comes to the action you have to do the action you have to follow up and do the real work there is a need to integrate some actionability in these models to make them really supportive for the end user and for the business I think that's all from my side thank you I don't need to run I have a question now some phones starting to have some accelerators for specifically this computation for example neural engine in iPhones the question is how accessible they are and what is the perspective because as I understand now they are not accessible like neural engine etc what do you think about perspective in the next three years that's a good question thanks Kiril so I think these things have started always in the same I would say pattern so the device is always expensive but as we move there are new technologies that are coming and they should be accessible but to be fair the people who are able to again it's matter of how much funding people have so the people who are able to sort of budget things for 4000 GPUs for 6 months I think that's accessible for us so we started actually acquiring some of those devices trying to sort of play with them at least for the moment and see what we can do but I think those will be generalized soon with the new sort of architectures that we are having you can check like NVIDIA you have different companies that are trying to come up with new devices with new architectures so in my opinion that would come accessible much faster than we think could AI be used in genetic engineering and nano robotics if yes how yeah that's a good question well nano robotics well not at that level I believe I mean we need to explore it's a matter of how much computation we sort of we can exploit so we need to look into that I don't have a sort of direct answer to that I need to explore a little bit more my understanding is that we don't have a lot of computation at that levels we need to look into that that would be my answer for the moment because you are taking me really like much lower in terms of in terms of edge right so we need to think about that thank you Hakim any more questions so as far as I can get what edge AI is about is that the thing that's computed on the user's devices that's what the user asks for but is there any room for exploration about computing everything like on the network of devices that are kind of signed in to use the service like something similar to torrent network yes there could be issues with privacy but maybe ciphering the data would help with something yes definitely I mean the edge alone I don't think that it will solve again all the issues right so I think there should be a collaborative approach where the edge is collaborating I mean with the small devices they collaborating together as they need also to collaborate with the cloud so what the message here is not saying that we should eliminate the cloud from the equation but it's more to collaborate devices they have to collaborate together but they still need to collaborate with the cloud because at the end of the day you may have some staff that need to be collected at the cloud level and then computed there but there could be a peer-to-peer that's in terms of protocol you can have a peer-to-peer approach where different sort of devices are collaborating to build something thank you and thank you for the talk thank you thank you any more questions okay so maybe related to previous question what do you think about the agent approach to LLMs where you don't have this one giant big boss LLM but rather have a protocol of communication between different LLMs and maybe some specialized LLMs know how to better answer such question and others know how to answer question in another domain or in another language and related similar to human society there is certain distributed system is this model falls into this edge competition paradigm or that's probably different well I totally agree on that I'm really aligned to that and that's why I try to build a figure that we have here so you will have the models what you call agent is basically a model so the models they have to collaborate between them but then this collaboration can happen one-to-one for example one model and another one but you could also use some more classical machine learning things because the generative model will not solve everything but then you will have other layers also that will be coordinating sort of more complex or building more complex sort of logic that needs to be executed thanks but what human language is the most complex system ever invented right and this is actually a mechanism to communicate but for LLMs for artificially created systems what should be this protocol do you have any idea should be this also human language or should be its artificial language what should be efficient protocol to perform this communication like an interface I would say that's an open question we need to work a little bit more on that thank you well it seems that we have no more questions still have someone I have not seen properly go ahead I wanted to ask are you concentrated on general tools and technologies or you also work with requests from companies about their precise tasks that's a good question I would answer to this question in terms of organization that we have so from the TII side we are more in that R&D basically building those generic tools or optimizing working on this edge for example but then we have another entity that is called Adventure One that is more focusing on interacting with the customers than sort of doing some fine tuning and this kind of stuff so let's say a company may come to you with their request and you will tell them what you can do to optimize their processes okay thank you I'm making a pause to be sure I think we are good yeah so okay let us thank Hakim again thank you and now almost immediately we have like in five minutes we have started the next session so I'm reminding that we have two parallel sessions and the one on theoretical learning and optimization will be in a different room and those who have O&OP they stay here start O&OP session session on machine learning and data analysis on the different whole room let's start with the first talk thanks good afternoon everyone it is my pleasure to present a work on multi-label topic classification for the language, our joint work with Sergei Nikolenko of the Stake Love Mathematical Institute in St. Petersburg and other places and Gulnara Kabaeva of Kyrgyz State Technical University named Dr. Arzakov a little bit of introduction and motivation Kyrgyz language is one of the languages of the Turkic family or the Kipchak branch and several millions of people can call it their mother tongue mainly of course in Kyrgyz but also in China, Tajikistan Pakistan, Uzbekistan and Afghanistan and Russia however and despite of certain corpus of research work with computational linguistics flavor dedicated to Kyrgyz language the number of open language resources for Kyrgyz is rather small and upon trying to solve some applied problem that's that involves usage of Kyrgyz language one often meets with certain obstacles due to the lack of language resources so one can say that language is definitely a low resource one until recently and here we can say thanks to Zberbank there was no general purpose LMS for Kyrgyz however as for many other languages from those hundreds there are two families of models that were trained in multiple languages on common crawl and other large bodies of text one of these hundreds of languages is Kyrgyz so one can attempt to use Excel and Roberta Base Large etc Roberta Base Multilingual Cased and as we've just discussed although certain solutions may arise in the nearest years the current NLP still leans towards universal models but still despite that a reliable evaluation for any language is necessary and arguably is impossible without manually annotated resources which is why we decided to develop to make a first step to develop the first dataset for an applied task which could be suitable for fine-tuning some LLMs for the applied Kyrgyz task to just find out whether it is at all possible the answer to this question is not as evident as one may believe and after that we will publish a benchmark our competition is ahead so we haven't yet but we'll publish a benchmark with all the results all the models and all the data so we've built our own dataset with the kind permission of the editors of the 24K GE agencies that's the Kyrgyz news agency we've scraped the Kyrgyz language section of the site that yielded 23,000 news articles and on that site there were no topical texts for news and Kyrgyz so as you can see at the bottom of the slide certain topics are present but only Russian texts are annotated with those and actually they are not quite suitable they are too general in some of those so to say topics are actually multi-topic projects so we had to decide on certain labels to do that one could try to use some general purpose thing like they use in advertisement like IAB taxonomy Demos taxonomy in the older days but those are also too broad we have attempted some zero-shot approaches and unfortunately nothing worked and in certain private conversations with practitioners I was told many times that when working with topical data from a single source it unfortunately has to be custom so that's what we focused on but one can't just invent those labels and we decided to do the so called exploratory annotation to do that we've sampled 500 texts we've translated all the titles into English applied obtained sentence-bearings for all of them and grouped them by hierarchical clustering and then the manual annotation for every cluster we've attempted to invent a title which is general enough to cover most of the news in the cluster and then add some extra labels for multi-topic texts so after that quick introduction we did the re-annotation again from the start because say the second half of the data some labels that we've introduced there were not available at the start so that's the example of the cluster and the proposed annotation clearly all the texts are about certain finds and crimes but sometimes with the political flavor sometimes with ecological flavor and so on having done that we've decided that the label set is established and then we've done the same thing with two 500 text batches which yielded a dataset of 1,500 texts so the first two batches were then used as a training set 500 plus 500 were used as a test set and here's the statistics here are the label statistics for the first two batches one can see that well arguably the distributions of labels are relatively similar which can also arguably signify that the annotation procedure and the annotation scheme were relatively consistent in the process so hopefully the label this yielded 20 topics as for experimental setup well it was pretty standard but with a multi-label multi-label twist we had to do an accurate splitting so that the distributions of the labels in the splits would be similar so we did iterative certification for bag of angrioms models with two-fold cross-validation because the dataset is rather small for neural approaches which are computationally harder we've used a simple trained dev split but basically the same splitting procedure for statisticalization we've used an LTK and we are lucky to have a morphological analyzer for Kyrgyz language from the project Apurtium and that we've used for something that one could call stemming and basic word tokenization solely because I guess one could use an LTK or something or something maybe from the standard package but this is something that seems to be more or less standard for Kyrgyz okay now for the models so we are about to build a benchmark and we've tried some very basic bag of angrioms models with an extensive hyperparameter tuning and those were two groups of methods one group is based on linear models and well our own logistic regression and on stochastic gradient descent with well basically on basically on linear SVM and logistic regression the first approach that we call independent classifiers this is just a set of independent binary classifiers for each labels and the chain approach is also standard approach with classifiers that are trained in a row and the dates for training for the next one in the sequence uses predictions of the previous one the other approaches may not be a great choice for large sparse vectors but we've included them since they are truly multi-label and yet classical enough for the nearest neighbors flavor for multi-label classification and here go the results so for evaluation we've used several metrics probably the most descriptive one is sample wise zhikhar index computed for every pair of predicted labels and sets of goal standard labels and averaged overall samples so the numbers show that the performance is not that great overall also we've computed exact measures hamming distance F measures the micro averaging flavor and also sample averaged version we have also added the count of the percentage of times when at least one label was guessed correctly what we see here the first approach is a simple bag of row talking engrams with different hyperparameters so these are the best groups of hyperparameters that we found with grid search for each family and the results are not great and clearly as expected the nearest neighbors methods perform poorly and when we move from row tokens to character engrams also quite expectedly we get a boost in performance when we move to stem tokens we also if you will just compare the numbers here and here and to the stem tokens we also get an improvement also quite expectedly since the language is morphological reach, glutinative language and removing almost all effects gives a boost an interesting observation is that when we take stems and and also convert them to character engrams this also allows to improve the results and that's it with the very basic approaches but probably the most important thing we were planning to do with this benchmark apart from publishing it is trying some fine tuning some multilingual language model we miserably failed with training birth multilingual so it just couldn't produce any reasonable results however hard we tried but we've achieved certain success with Roberto Lodge with BPE tokenization and as you see it outperforms all the previous approaches by a large margin it is also important to note that for bag of engrams bag of stems approaches we did quite an extensive hyperparameter sweep, hyperparameters you may have noticed that this annotation approach this size of the data sets and everything else is not without flaws and we're going to add some more new evaluations to the benchmark first of all we'll try to apply the model that has appeared before the revised deadline the MGPT Kyrgyz it is also and as the reviewers rightly noted that it would also be nice to translate the Kyrgyz texts automatically into English and apply some English based models like birth or something just fine tuning to test how well this is clearly something that should be done for low resource languages to test what can be achieved by state of the art also we believe that zero shot learning prompt engineering approaches which were not quite successful when we tried them should not be thrown away and maybe we should return to that to a certain articles conversion into something for prompts but most importantly I guess and due to certain recent developments in Bishkek that is kind of easier now we will annotate more data with the help of multiple native speakers and we will do it properly with the instruction with a fixed instruction and several notators for sample just since I have some time just a short note this work was carried out mostly at least the annotations were carried out mostly in 2022 but something changed in 2023 large community evolved large data science community evolved large sub community of volunteers appeared in Bishkek and we've done something else some named entity recognition data sets and some more efforts are on the way such as yet another Kyrgyz corpus so there is more work that lies ahead and it of course should be done whether or not without our participation I think that's it thank you for your attention so thanks a lot of work has been done and one question which I'm interested in is whether transfer learning strategy through multilingual language models is preferable over just pure translation of data sets let's say you don't have resources for Kyrgyz you have a machine I think you have a machine translation system one way is to just blindly take and translate the data sets and train something and you will and you train a model specific language as a Kyrgyz MGPT you get certain quality or another way is you get a multilingual model which supports a lot of languages you train on let's say English data set you assume that somewhere in the model there is a separation of knowledge some model learns to let's say detect sentiment positive negative sentiment but then it transferred to Kyrgyz so in practice if you just need to choose which would be in your opinion the working better out of the box thanks for the question let me just clarify the first part was about translating something into Kyrgyz yes that's actually a great idea and that's something that colleagues in Kyrgyzstan are doing for other tasks but I would also like to ask you to clarify the last part of the question the last part would be not to translate to assume that translation is prone to errors and instead you train on a clean data set let's say for English and not monolingual model but multilingual model which just was pre-trained on Kyrgyz so it knows that Kyrgyz exists but it trained on some English data set or Russian data sets and then it starts to solve tasks in Kyrgyz and then the question is which is more preferable strategy to translate with errors but still learn on monolingual model for the target language without or rather go this knowledge transfer approach thank you for the question yet again it's a good question and I don't have a good answer unfortunately I have beliefs only for that first thing I've met what I've met the thing with training on English and shared knowledge within the multilingual model we did that with information extraction but for my experience on other tasks of the sorts with Russian language mostly I believe that the first option is more preferable so translating something some standard data set if possible to Kyrgyz and fine tuning on it should be better I think based on my experience on the tasks that are slightly less relevant than they could be that's also kind of my experience it's a very strong baseline but I just wanted to thank you for your talk I have several questions the first one just an idea because normally if you don't have a lot of data set on some task you search for another language that is quite close to this one so do you know which language are close to Kyrgyz are there models related to that so maybe in future work you could rely not only to the multilingual ones not only to translation so maybe using some Russian, Belarusian, Ukrainian probably so they are related and you can probably tune some models so what's your experience in this case thank you for the question speaking of the similar tasks in similar languages I've seen speaking of this particular task text classification even without the multi-label twist what I found to the best of my knowledge there are some data sets in Turkish and in Kyrgyz written in Arabic script the Chinese flavor of Kyrgyz and and nothing else that I found but speaking of translations into similar languages there is a stat work of 2020 by the Turkish inter-lingual community which is now I guess special under group in Turkish languages of ACL their model could be used probably but for the task but I really don't see any data sets of the sort in other languages but of course the processing community does exist the in Kyrgyz is rather scarce in terms of resources but the situation is a bit different with of course Turkish language, Uzbek language but for this task I haven't found anything else okay thank you and one more quick question your data set that you are creating is very specific so it's multi-label and it's in Kyrgyz you told that you're going to reuse it during some for benchmarks for Kyrgyz or some other things like correct me if I won but are there any reasons why this specific data set multi-label because there are many other tasks that you could start with were there another specific reason for some application maybe or what was okay thanks but well I would say that okay there was not any specific motivation or inspiration for that apart from the fact that in my opinion and vision topic classification is one of the tasks that average data scientists could meet in one's practice so something like things like sentiment analysis or topic classification I don't know maybe that's it I mean named into recognition is already something on a different level so that was the motivation and it is like one of the classic tasks of information retrieval so I just decided that that was a good idea thank you very much sorry if I missed something in the beginning but can you elaborate a bit about the annotation quality check how do you guarantee that how many annotators do you use some crowd sourcing or do you have plans for this yeah the pretty descriptive thanks for the question the pretty descriptive phrase in the beginning was quick and dirty so that's it so I was actually planning at some point well I have a certain access to experts that are more proficient in Kyrgyz language to native speakers that would check yet again I mean one of the authors is from Kyrgyzstan but still so yeah the only more or less sophisticated procedure for establishing something guaranteeing some quality was the one I have described on choosing the label set as for the quality quality check that was just reading stuff multiple times thank you for the work anyway it's a lot of work okay so there is a remote so I start straight with a brief introduction to the problem the summarization there is two approaches extractive leverages existing text fragments to select a set of highlights and obstructive summarization improves on an extractive by employing additional language resources to paraphrase and combine the set fragments into concise sentences now the main approach to solve obstructive summarization is to sequence where we have encoder to extract the contextual information and the encoder that generates the summary in accordance to this contextual information and the preference is justified since GPT models that have several times more parameters just fail to achieve the level of performance specialized encoder decoder or counterparts especially even after fine tuning as was proved on the second plot that's from official open AI article about summarization so they proved that T5 performs better in terms of human evaluation then they are fine tuned to GPT-3 and now there is an evidence that classic sequence-to-sequence approach is not enough several works just showed that integrating extractive summarization in training and inference in loop improves the quality substantially especially for full transformer models let me the point of works so what distinguishes full transformer models from other architectures is that besides encoder-decoder bridging the attention is used for all intermediate embeddings in all layers meaning that the overall impact of attention is much larger to an extent that attention patterns are now part of summarization models so the models that are more aligned with ground truth extractive labels happen to perform better, converge faster to more optimal results since they spend less time searching for important sentences and just learn to paraphrase and combine them so many researchers argued that it could be beneficial to correct this attention and the first approach is to use the local mechanism binary masking of attention mechanism so it works by just selecting the important parts of the sequence and the problem is that it's equivalent of token removal meaning that the model wouldn't not attend to the masked parts so the information won't be propagated and mainly the whole information the centrality of the context would be shifted in the other direction meaning that the optimal summary would be different so to alleviate the issue the researchers came up with an idea to apply the content selection masking to a subset of layers and attention head they do so by searching for responsible for silencing evaluation and applying the mask obtained from some content selector, maybe an extractive summarization system, maybe some query from the user and the alternative, the only existing alternative to that approach is just complementing the existing attention mechanism to receive some more complex guidance signals with relevant attention this is the latest in state-of-the-art approach uses semantic query document metrics and applies some simple linear transformations and aligns it using some simple interpolation with cross-attention weights to guide the decoder to query relevant positions and we hypothesize that actually there is no benefit in tampering with existing attention mechanism because it still interferes with natural information flow and for alternative solution we looked for inspiration in image processing area namely text-to-image model dali model uses one interesting technique it has no name they call it just result blending it is based on the idea that the model uses click embeddings that maps both text and image embeddings into the same vector space meaning that if we take two different prompts two different text sequences encode them and then just take a weighted average we will obtain some intermediate image embedding and it seems that the result is quite stable so following the same idea we derived biased encoder mixture it is quite simple we don't use different inputs we just use different attention masks one full attention mask go through original encoder using the original input and then we derive selected mask and process it we can use the original encoder or an auxiliary it is just our expansion the theory is that if we use it is more sensitive to masking that is more focused on masked positions it would provide more amplified signals that would better guide the original embedding to very relevant positions and so the decoder would produce more optimal results so to test the method we derived to masking strategies first is based on just using ground truth extractive label statistics and the second one is dynamic in case if we are planning to use it in practice it is based on extractive summarization system just any kind we just need some distribution over sentences that would denote the silency of the set position and so to obtain the mask we just use top piece something of these sentence position distributions we evaluated the methods on four domains well no, three domains just one is represented with two datasets so news, science and dialogue we used the respective state of the art models to determine everything for the methods we used simple grid search on validation part and quantitative results are quite promising biasing code mixture seems to outperform every attention modulation method in almost every scenario and in best case scenario biasing code mixture can bring 8% improvement over the original and in terms of quality well it was important questions so how it performs does it violate the coherency, the relevance and it seems that none of the methods violate these constraints there is a sampling from the news dataset so there is a reference the original model prediction and then there is a set of alterations corrections so relevance attention is based on semantic similarity matrix it just injects the named entities into the original generation attention does the contrary it deletes some excessive entities like there was China and other nations now it's just China and the UK is now replaced with official say which is quite questionable and biasing code mixture is the most different of the bunch it just simply forces the model to revise the whole summary since it creates a new embedding the model understands that text is likely to be different the new summary tells us the story about submarine drones that can walk independently and track over thousands of miles and basically biasing code mixture version is more aligned with the reference summary and so about the patterns, about alteration patterns they are also different in well the original generation can disagree with the reference at any position meaning that the generated summary can be quite different from the reference so the attention based methods don't take into account the initial mistakes from the initial positions however they scale with the article length so closer to the ending they introduce the changes and the attention is just guaranteed to revise the ending biasing code mixture is more radical and more uniform yet still a bi-model it is more aligned of course with the intended reference distribution and as we've seen before it can completely force the model to completely revise the generated summary the patterns of semantics well the attention based are quite conservative the relevance attention since it uses semantic similarity matrix just keeps everything in check with the original generation their attention is more brave and just has a wider range of semantic changes and the biasing code mixture is bravest I would say it just can completely diverge from the original meaning yet it's still better aligned with the intended extractive summary so this concludes my presentation I'd be glad to answer your questions well you're referring to Goose data sets from multilingual or the one that was collected by Goose but they are quite noisy the reason why I chose CNN-DLML for ablation or for case study is because it is the data set that was proved to have it was written by editors the summaries were written by editors so they are quite aligned the Goose used automatically extracted summaries and sometimes they filled with just some automated things so they aren't so reliable for the experiments and besides there are no just counterparts like I used Brio, Pegasus models and yeah Brio, Pegasus but so it's summarization specialized models they are just none for the Russian thank you I have a question it's more like theoretical question do you think your model or am I true that the data set you are using it's summarizing just only one article like not multiple summarization multiple of multiple articles right? do you think it would be possible to transfer this summarization to like multiple article summarization and like applicable or not or what could be the difficulties to do this you're asking for technique well of course one of the main approaches to solve multi-document summarization it just pretend it is just like one long document with multiple chapters so just using some special markers some special tokens we can produce some special embeddings for these chapters and so it's distinguishable from the same long synchronous and yes of course bias encoder mixture can be applied to any model that has encoder so even if you have multiple encoders if you well that's one of the approaches that's just to use an individual encoder for each part for each document then yes you can still apply the same approach for multi-document yes thank you because normally the problem of multi-document summarization is that if you just make them as one long document normally the model looks at the beginning of the text and of the end of the text because it contains like the most summarized information that the model should look into that's why sometimes at some experiments I have seen it doesn't work then you just combine all the sentences of the text in one large text that's why I was thinking yeah so how it could be yes the context window just illimited and the attention patterns just get lost so it could be now I recall yes it was one of the hardest work before transformers and they really did they used multiple encoders and just interpolated their embeddings well using of course not just simple interpolation they used fully connected layer to just to combine these embeddings into one and just passed them to decoder and they proved they said it worked better than just you passing the long sequence but of course it was the recurrent neural networks now we have GPTs and so they claim they can accept the context of up to a thousand tokens yes so just don't know never tested thank you thank you my name is Ekaterina Zalivina and today I present my work of Dementic Detection of Dialect of Pictures of Sculpt Dialects and Speech of Native Speakers the purpose of my research is to create a model for transcribing dialect speech in this work we focus on Sculpt Dialects and provide researchers with a tool to detect dialect of features characteristic of these dialects in the speech of informants why it is important as a field researcher I know firsthand that field data is collected manually so our tool allows to reduce the amount of manual work and concentrate on analyzing linguistic phenomena and we present experiments using Russian dialect data which are not so common now in solving the problem of automatic speech recognition what about the steps first we collect corpus data then for manifest for automatic speech recognition task and do the manual annotation for detecting features we fine tune models for speech recognition and have three approaches for detecting any features and we choose the best approaches and create a big pipeline to work with them what was the data data in the corpus oh no this one we must mention that was used to corpora were taken and the data was collected during expeditions to Red Region and Pskov Region and on the map you can see the location of the villages where the data was collected and it is worth noting that the Pochetsky villages are located closer to the borders of Latvia and Belarus although these dialects belong to Pskov dialects and existing dialect specifications note a lot of differences between them how we process audio data we take files from corpus if files or text-read files we export annotation to text-read files to have the similar files and then we have an audio segmentation based on sentence length we clean from and convert all to lowercase and we generate the file in format that presented on the slide and you can see a little bit of statistics that we achieved and I can say that these very low-resource dialects and we work with a little amount of data and about text data notation for binary classification for each annotation we assigned one if it demonstrated the implementation one or more dialect features otherwise zero and for rule-based and token classification approaches each token in the notation was assigned in the IOB2 notation where B is the beginning of the sequence, I is the second or subsequent tokens for the sequence and O is the absence of a dialect feature and on the slide it's presented the sample of my picture about phytonine approaches it was four approaches and we can see that we have two corpus and for first approach we train on first corpus and then test on the same corpus and the same approach is for second corpus and third and fourth approaches we have two iterations of phytonine first one we you know phytonine on Zapodon-Winsky data and then on a Pochetski data and test on Zapodon-Winsky data and first we train the same way and test on Pochetski data we use the base matrix for this task for speech recognition is word error rate and character error rate and for dialect features detection precision recall of one score and accuracy speech recognition we select models that were pre-trained on standard Russian we use the three architectures among the common mistakes of the two first models are combining two tokens into one splicing one tokens in two parts inserting characters and to correct such errors we use a spell checker Yandex power and we see results on this fight and we see that for Zapodon-Winsky data the best model is the first when we train on this data and test on this but for Pochetski data the best result is a first approach when we fine tune on Zapodon-Winsky data then on a Pochetski data and we get better results for detection of dialect features we use three approaches binary classification of entire sentence binary classification of token and multiclass classification of each token as for a rule-based approach we took a list of phonetic morphological and syntactic dialect features from grammatical sketch then for each rule we write over the function for example to determine the realization of a name in the first prestressive model we follow this algorithm we use the dictionary partial library to obtain transcriptions instead on Russian then we determine the backness and the height of the prestressed vowel and the presence of polarized consonant before the vowel and for morphological and syntactic features we use pymorphy 2 and Natasha we know that this approach successfully handles correct identification after detection for dialect features at the level of phonetics and morphology but variability is not taken into account and all positions in which dialect feature can be realized are marked with an attack but we know that now it is very common to have variability in dialects about entire audio classification for each audio fragment more frequency capstone case and work calculated and during the experiments three classifiers were used and as a result we see that the method is more peaceful rather as one of the intermediate stages but the problem of classifying tokens within a fragment is not solved still in this case we see reliance on the audio but and consideration of variability and about token classification we find soon X-men Robert for two sets of tech binary and multi-class and we see strong influence of variability on the choice of classifier but we see that this approach still cannot cope with lexical and syntactic features next we decide to have experiments with the poshsky data in the rule base it shows results higher than on the podnadvinsky data which may indicate only more consistent implementation of the features in speech of informants and about entire audio classification it is important that the implementation is impossible without training only on the target data only on the the podnadvinsky data and it is necessary to train on target dialect too and last token classification we see that models that X very few dialect features without fine tuning on the target data but it rather good results for after fine tuning on this data and we create pipeline to work with the best models and audio recording in web format is accepted as input data we convert audio to a single channel format then we divide the recording based of areas of silence fragments of audio bevel special decibel threshold and we get the descriptions with the best model to then we also have attack for each token and we generate finally two formats text read and ef to work with chat and event programs and on this slide you can see how it looks like in the event program and conclusion if the goal is to get the model to recognize one selected dialect fine tuning a model that have already seen a close dialect will give a better result then fine tuning a model trained only in standard Russian language and you can see it on the case of opportunity data that's phonetic morphological syntactic and lexical differences between close dialects do not compare the quality of recognition and the last data to fine tune should be the target dialect for which the model is being trained and not some close one otherwise the quality will be lower and we see it in Odvinsky data and new hypothesis has been put forward for further research to create a universal dialect speech recognition model and it is necessary to fine tune the model on the entire sample of dialects at the same time thank you, that's all any questions? I have a question I'm not familiar with scope dialects you know how do you summarize the difference in realistic points of view because I don't know exactly what's the dignity factor yeah, they had a lot of similarities but differences too and they had different realizations of phoneme they had morphological differences for example in Opocitsky dialect they have a final consonant in the sort present form of verbs and in the Pododvinsky data in this case but in Opocitsky data they don't have last consonant at all and we have a list of features and in most of them they have a little differences I guess that we still going to know it for a while yeah could it be that you can artificially generate one dialect from another in those rules in order to increase data in training it's a good question I don't know exactly but I think it is possible but nobody so maybe it's a bit generic question but so this linguistic expeditions could they be somehow modernized now with let's say modern technologies mobile phones and LLMs or something like this let's say you just install certain application and ask some people to interact with some dialect agent and then the data is collected more or less distributed way in a very cheap way could this be do people actually start to think about these kind of practices or there is certain inherent limitation for linguistic data collection in this way now it is not popular at all to use modern instruments and expeditions I was in expeditions expeditions this year and we still go here with micro we write we record informants speakers and then manually annotate their speech but this my work I believe that I have done a step to modernize the process of this and I believe that we will go next year to the expedition and use this tool thank you thank you for the talk quite condensed it's good that we found new hypothesis my question is about this proposal for universal dialect speech recognition model how this should operate what is your vision of this model and what is the purpose of this model should it predict like a multi-label given the speech it should predict like 1 out of 100 or maybe 10 I don't know or something different and why people need this model I think that now we have dictionaries and other sources where dialect features compare with each other in different dialects and I believe that we can find more if we try to optimize this process and also I should say about that some features died and we should see what is still in the dialects we can see okay but is this task can be solved just building some atlas or some like reference of all the dialects why do we need machine learning or whatever all resources was collected 50 70 years ago and it should be modernized and I think that it is one of possible way to do it also questions yes questions regarding the practical application of this interesting instrument first thank you for the talk and you have already said that it hasn't yet been used for like in the expeditions but it could be interesting like first if it's not 100% that if it's not 100% quality maybe it can first be used like to guide the scientists and then they could correct the mistakes and second it could be interesting to look at if the model and the scientists make mistakes in different or in the same places so maybe the model could help them in the specific places where they are unsure yes of course the first step it should be corrected by the experts and then I think that model can reach better result if we train the entire sample of dialects and it is too parallel process for me as I can see and about the second can you repeat yes it could be interesting to look at if the model and scientists make mistakes in the same places or not I analyze only models errors but it is interesting to see notators errors and compare them so I think that it will be good enough to do it in future thank you for the talk I'm curious what are the most notable differences between Pskov dialects and the standard Russian they have unique phonetics for example I talked about first press syllable they have for example then they had unique syntactic structure they have verbs forms that we don't have in Russian standard language and also they have in our mind disagreement in auxiliary verbs and verb the main so there are a lot of differences we can see also in constructions and homophology got it thank you and do you understand correctly that this dialect is influenced by the Belarusian language we have some cultures about it and of course the main influence is standard Russian because of TV people that were born recently but in some cases they have influence of Belarus got it thank you thank you thank you hello everyone thank you for coming today we are talking about compression of large language model based on transformer architectures and compression is provided by matrix or tensor decompositions the field of natural language processing has made significant progress with the development of large language model so we have a lot of factors transformer models share a common challenge of expanding scale presenting an obstacle presenting an obstacle to model employment and training especially for small research groups so in our work we decided to reduce the size of large language models for example the initial layer using tensor and matrix decomposition so we have Bert and Bart and this is model based on transformer architecture and most of parameters several work shows that many parameters into this model are redundant so we decided to take several layers select several layers and compress it we see how many parameters contained in the different layers inside the model every transformer architecture consist of MLP blocks, embedding blocks and attention block and how you can see into this table the concerned Bert and Bart and this is out of scope of this presentation but this is true for GPT-22 for the quarter based model too the most number of parameters are in the MLP block so we decided to take MLP block layers inside this MLP block consist of two fully collected layers and apply several decomposition to it we decided to take one matrix decomposition is a singular value decomposition matrix decomposition with facial information one tensor based decomposition tensor train matrix decomposition and tensor train matrix decomposition with facial information so we made this variant of the composition and replaced fully connected layers inside the proper architecture with the corresponding representation as a baseline we take full model which was obtained by downcasted the full model by PyTorch FP16 in PyTorch FP16 we started to represent our weight not into the floating point of with two precision but with floating with 16 precision and block pruning to do block pruning we select several MLP layers or several head into the attention block prune it, throw it over and obtain a model with the smaller size so how we can apply singular value decomposition into our fully connected layer has a weight matrix weight with the size dimension input multiply to dimension output and we can represent this matrix weight with in the singular value form with two factor matrices U we transposed and with the diagonal matrix sigma which contains of singular values to make shorted version compressed version of it we can select the most significant singular value into the sigma first throw into the U matrix and the column into the V transpose matrix and we obtain the compressed version of our decomposition so having initial linear weight double V we can now we can obtain two U weight double V2 with this multiplication of truncated version of factor U, multiply to square root to the matrix sigma and double V1 with this multiplication of square root of matrix sigma to the truncated version of factor matrix V following and into the denominator we have multiplication of input and output dimension of the initial matrix and as a denominator we have some of the multiplication of weight of the size of this and this weight matrices to understand how we can deal with tensor compression we should go into some tensor notation so tensor is a multidimensional array it's a big cube which have n axis where n is a number of dimension and this big cube this big object can suffer from a can suffer of having a lot of parameters so we can obtain some tensor compression technique to represent this tensor into the more more compressed way in other word we represent the tensor multidimensional array as a set of factor object which usually have less dimensions than the initial tensor and because of this object have less dimension we have a compression comparing to the initial tensor so some notation we want to know by this letter tensor is defined by this letter with where n is a number of dimension into the tensor the tensor in try is defined one element inside the tensor this is the regular definition for the core tensor in some decomposition so when we decompose some tensor into the set of object with low dimensions we this object usually represented by this letter this is matrix and vector and this is this is the definition for modern monetization or unfolding operation of a tensor what is unfolding operation when we make unfolding we create a matrix based on our tensor in other word we have tensor with n dimension in this slide we have a tensor with dimension number 3 and we select one dimension and scratch our tensor alongside the selected dimension like harmonic and we have matrix and this matrices depend on the initial tensor and on the axis alongside we scratch it so if tensor has 3 dimension we have 3 different matrices by unfolding this tensor by every dimension when we want to represent tensor in more compressed way first step is to define the way in which we will represent our tensor and the next step will be to fill this set of containers by the digit by the value in the proper way so we decided to represent our matrix in tensor train matrix format with an extension of tensor train format what is tensor train format this is when we represent our initial tensor by the set of core tensor every core tensor has dimension no more than 3 so out core tensor has dimension 2 but in the core tensor has dimension 3 so the number of core tensor the number of tensor into this sequence is equal to the number of dimension to the tensor so here we have a tensor with dimension equal to 3 and we have 3 core tensors to calculate to count the entry inside the corresponding tensor we should do the same for example we want to calculate the entry under the indexes 2, 3, 1 we should select the second slice here from the second core the third slice here for the first core the second slice from the first core the third slice from the second core and the first slice from the last core multiply it and after multiplication we obtain object with size 1 with number of dimensions 1 we obtain a point the proper point so this is the formula which we use to calculate every entry of the tensor and we also should say that every tensor has no more than 3 dimension the outer dimension is rank rank is a dimension on the right side the neighbor tensor will be multiplying so rank 2 here and rank 2 here is equal the first rank into the outer tensor is equal to 1 and the rank is usually different so we obtain tensor train decomposition with different ranks but for simplicity of formula for simplicity of calculation all rank are usually set equal to each other and equal to R we select our form in which case we will represent our tensor and now we should understand which digit should be in our core tensor how to do it there are several algorithms tensor train compression and one of the most famous is TTSVD which is provided by Ivan Asseliaditz how it works we have initial tensor and we should do initial tensor with d axis and we should do d steps on every step we unfold our tensor over the selected axis this is unfolding then we apply to the obtained matrix SVD obtaining u, vt and sigma truncated to the corresponding rank this rank should be the rank of tensor train decomposition we obtain the corresponding core tensor by reshaping factor matrix u and the rest of our factors we multiply it and this multiplication go to the next step so on the next step we unfold this tensor instead of the initial so we do it the times which equal to the number of axes into the tensor and obtain the number of cores which is equal to the number of axes into the tensor unfortunately we cannot apply it to the neural network architecture and for neural network we should use the expansion of the tensor train format the train matrix format tensor train matrix format is very similar to tensor train format instead of two point if we are talking about tensor train representation we usually talking about tensor over the point if we are talking about tensor train matrix representation we are usually talking about tensor over matrix so we have no point with the dimensionality 1 but matrix with dimensionality 2 and now our index y turns into the index tuple so now we have no core with dimension maximum 3 but we have and cores with maximum dimension equal to 4 and our formula of calculating entry turn out to be this one so we have two index instead of one and we have core tensor with the biggest dimensions and into the tensor train how we we decided which form we will now we have matrix with size dimension input and dimension output then we factorize dimension into the factors our factors will be the initial shape of our cores and then we reshaped this matrix into the two n dimensional object two n dimensional tensor we should permute axis of these objects so the required axis from the tuple of the indexes are adjacent and when we obtain tensor C by reshaping we apply tensor train SVD algorithm on this the compression rate is this so into the denominator we have multiplying of all factors which in real life is a multiplication of dimension input and dimension output and here we have some of multiplication of dimension of every core tensor so returning to our Bert and Bart model we decided to see its behavior on three compression rates and for every compression rate we selected the proper rank for SVD for transcated SVD decomposition and for tensor train matrix decomposition so if we want to update for example 69 million into the Bert we should apply SVD with this rank and TTM with this rank SVD is quite simple we have two linear layers a sequential linear layer instead of one and in TTM we decided to represent our matrix as a set of four cores with these shapes so into the TTM algorithm we can varize the number of cores too so we can represent our matrix as three cores as four cores or as five if you want so with these shapes where our rank we take from this table and we choose jk that j multiply k are approximately equal among all tensors inside the sequence of core tensors okay we were working with Bertrain Transformer based language models and when we applied decomposition of course our quality has dropped significantly so we decided to align task objective which we used to obtain some decomposition of the given matrices with task objective which we have on our work with our model is working to do this we decided to inject fission information into our single decomposition algorithm fission information is the way of measuring the amount of information that an observable random variable x carries about an unknown parameter theta of a distribution that models x in other words we have a set of data set of number of object and on the certain object model gave some prediction so we can calculate loss of this prediction and in this loss explicitly contains information about some parameters inside the model so fission information is defined as here a partial derivative over the probability which over the probability of the output which model gives on the certain object into the data set with respect to w w in this formula is the weight inside our layer and we can approximate this in this way so we calculate a loss on every object take a derivative and calculate the mathematical expectation so we obtained i v i v is a matrix of fission information which has the similar size as our initial matrix v we the matrix of fully connected layers so we multiply this matrix by matrix of fission formation and obtain a single variable decomposition on this multiplication and we should also multiply u-factor to the inverted matrix of fission information in this way we inject information of the output of the model objective into the single variable decomposition algorithm ok, what we should do with tensor-train matrix algorithm it's very simple, as you know tensor-train matrix algorithm is based on tensor-train algorithm and tensor-train decomposition algorithm is effect a set of several SVD so if we have fission matrix i double v we do with this matrix the same operation as we do with matrix v to obtain our tensor and then we unfold tensor based on matrix v and unfold tensor based on matrix i v and we have two matrices and do with two matrices the same things that we do here so into the tensor-train SVD decomposition algorithm set of SVD steps and into the SVD steps we do the same thing that we do here and by this way we inject our fission information into the tensor-train matrix approaches ok we define for decomposition technique that we have seen before and then we started to evaluate it the first evaluation point was built we evaluate our method for the natural language understanding task, global benchmark global benchmark is a benchmark which consists of language assessment, sentiment analysis paraphrasing, semantic similarity, natural language inference tasks we firstly fine tune our model over every task from benchmark for one epoch and then we compressed fully connected layer using one of four techniques and then we fine tune our model again during one epoch and we obtain this score and the score can show that on the big compression rate this part of table the best performance comes from TTM and fissure weighted approaches but on the medium compression rate and lower compression rate the best performance comes from SVD and fissure weighted SVD approaches and also when we apply fission information into one of the SVD or TTM algorithm we usually have increasing in our performance increasing in model accuracy or some other scores the next point was sequence to sequence was a variation of sequence to sequence model BART and the first task for BART was paraphrasing we made paraphrasing of paradox data set in this data set we have a pair of sentences the first sentence is looks like sub-retirations and the sentence we want to obtain should look more polite than the initial phrase on this data set four metrics are works the first is style transfer accuracy how accurate we fit into the polite mode similarity how similar the meaning of this and these sentences and fluency of generated text and the last metric is joint score this is a variation of these three metrics on this data set the best score usually comes on every compression rates comes from fissure weight SVD and fissure weight SVD gives the great boost gives a great increasing into the performance according to other according to other approaches the last evaluation point is also sequence-to-sequence model BART and we try to train BART to provide a summarization we have a data set X sum which consists of several hundred thousand of BBC articles and a single sentence which provides summary of these articles and on this data set so the example this data set imitates the result on the glue the best score on the high compression rate is provided by feed-forward TTM so added fissure information into the our decomposition algorithm also provides some boosting scores on high compression rate the best scores come from fissure weight TTM on the medium and low compression rate the best score comes from fissure weight SVD so there is a graphic for a different task for Glubinsh Mark for BART and on this graphic you can see the main the main tendency of the whole work red is a SVD blue is a fissure weighted SVD green is TTM yellow is fissure weighted TTM so usually the best score is for fissure weighted SVD but on the several tasks on the high compression rate which is described by the right part of this graphic of every graphic the best score goes to the TTM or fissure weighted TTM so this is the rest of our glue tasks and this is the average work of the whole glue so as a result we take four different techniques for compression of fully connected layers in BART and BART models a different compression level and different technique can give better or worse result usually for BART on Glubinsh Mark and BART for XSUM on the high compression rate TTM fissure weighted TTM provides the best score for the variants of compression fissure weighted SVD provides the best score and the alignment of the task and the decomposition objective by injected fissure weighted information inside to the decomposition algorithm can significantly improve the performance of the compressed model so thank you for your attention Thank you very much Thank you I think that distillation is quite good technique distillation is good when you train your distill model towards the desired task for example you could say ok I want to obtain distill version of BART I train distill version of BART to the task of the initial BART to the task of natural language understanding and on natural language understanding distillation can provide good score but when you train your model towards the desired task on the other type of task distillation can provide not very good score and our method gives approximately the same result over all set of tasks yes and TTM too I don't think so you mean the compressed different layer with different drugs and with different shapes it can make sense and we made some research which is out of scope of this presentation which can see that layers inside the TTM and BART into the BART or GPT can be less compressible according to 3D and TTM and can be worse compressible it can be seen by single value spectrum and of course it makes sense it can give some boost but it's quite difficult and we decided to set one rank to the whole model I have a question we have a large language models and the popular method of compressing them is quantization and there is no studies but there is an evidence about that larger models are easier to compress and so have you conducted some ablations about the scales I saw about BART conducted some experiments with BART large maybe it was better for compression I mean the model with large number of parameter can become less better yes no I haven't I haven't seen such paper which provides such experiment I only have seen a paper from open no sorry from META about about binding between the number of parameter into the model and number of talking into the data set on which you train your model with their different scores with the different metrics and people proved that this this function is logarithmic but unfortunately I haven't read any paper which can directly prove that we can compress large language model is better compressible that less large language models yeah and follow up question about all this decomposition there are some related work much older about SVD decomposition and they show that after decomposing the model if you run about one epoch of pre-training it will perform better have you done something similar in your ablations if we train for yes yes so it does help yes absolutely thank you it's very depend on the size of task we selected so people usually evaluate large model not very big data set and when we are working with not very big data set one epoch is really enough to to set model to some good point another question first of all thank you for your talk and I'm also curious about compared to quantization because like in the field of large language models nowadays quantization is very popular and I know that for example 16 model strain in 16 bit precision they are like can be compressed for example to 4 bits almost without losing much quality so basically we will lose something like 1-2% of perplexity for 25% so I'm curious how your approach is compared to these advancements you asked about comparison of quantization and my approach or can we apply our approach with quantization comparison so which one is better I suppose we have information in slides because we didn't compare with 4 bit quantization but for 60 floating point and we can see that for exam then our approach provides the best score then FP evaluation this one this this role into the paraphrasing paraphrasing something change and FP16 provides the best score than every of our approaches I got it I think one issue here is that here we see like naive so basically all the bits were in float 32 and were naively converted to float 16 but like this new approaches they are kind of smart for example they take out liars and they quantize some percent of outliers differently and so on and they provide like I think it may be interesting for you to compare with them because they provide really good compression rates without any significant harms to quality we have most sophisticated ways of quantization and I think it's really interesting yes to compare with this yeah thank you this is a really interesting alternative approach hello dear colleagues today I'd like to present to you the results of our research which was devoted to the development of the geothermal recommendation system the title of our paper is on the slide the authors of the paper are me, Dmitry Hrnoklev and my colleague Pavel Prostovsky let us begin with the main idea behind our study as most of us probably know idioms are quite an essential part of many languages and native speakers usually tend to use them in certain circumstances because idioms enhance the fluency of speech and expressiveness of speech however no native speakers may sometimes struggle to find an appropriate idiom for some context also I'd like to mention here that nowadays become more and more popular, automated writing assistants writing assistants some of which are powered already by the way like D-bell and Grammarly are able not only to correct mistakes grammar mistakes or spelling mistakes but also they are able to improve style and suggest continuations for the entered text and so here we come to the motivation of our research we aimed to suggest system which is able to recommend an idiom for a given context on the upper image you can see for writing assistant and below there is a kind of simplified scheme of what desired automated system should look like and how it should function for the purpose of our study we obviously needed data and by this I mean a set of idiomatic expressions and contexts which contain these idioms based on the literature review we have chosen EPI dataset which is the basic corpora in our study we have chosen it because of the several reason firstly it is the second largest freely available dataset another reason is that it contains definitions for the idioms that are presented in this corpora table added up of the slide contains several examples from this dataset however this dataset required some sort of preprocessing which is described in detail in our paper after it we have split the result in dataset into train test subsets and in the 88 to 20 ratio with stratification just to make sure that all the unique idioms are represented in both sets also I'd like to mention here that the test set is fixed for all definitions which I'm going to describe further to have a right to compare different configurations during the experimental part of our research we finetuned one of the considered model and in order to make it more robust we required additional data to get more context for our idioms we have used the Guardian API which provides handy interface to parse articles published in the Guardian newspaper this allowed us to obtain almost 25,000 additional sentences and therefore we have increased our initial corpora by more than two and a half times the graph on the right of the slide presents two distributions of the number of different contexts per idiom before and after the parsing process on this slide you can see a scheme which represents our proposed approach our proposed approach is based mainly on the semantic similarity search ideology we obtain embeddings for input sentence and all of the collection sentences input sentences are also called queries in terms of semantic similarity task and the collection sentences are called documents then we obtain embeddings using some model for both of these input sentences and collection instances and then we rank the collection items based on cosine similarity to the query and take the corresponding idioms as recommendations the main metric in our research was mean reciprocal rank just in case if someone is unfamiliar with it here is a formula now before we move further I'd like to mention here just one more point related to the test set it's obvious that when our approach is used on inference we receive sentences without idioms so we remove original idioms from test sentences it's illustrated on this scheme now let's return back to this to our main scheme we can conclude from the scheme that our algorithm has only two key parameters which can be varied the collection on which the semantic search is performed and the model which is used to obtain the embeddings so using different collections and models we obtain various configurations of our main approach let's discuss different collections first in our research we consider four different collections the first is called idioms and it consists only of the unique idioms from the initial data set themselves then we have idioms plus test collection which consists from idioms with concatenated corresponding definitions for these idiomatic expressions then we have sentences collection which consists of the sentences from the train set and finally sentences plus plus collection which consists from the instances from the sentences collection so it's basically just a sentences collection which was enlarged or extended with examples from the Guardian API in the tables on the slide you can see some examples from the collections that I've mentioned now let's switch to the models in our study we employ pre-trained on the Google news dataset word-to-vec model from the Consim library as a baseline model just to establish some kind of power against which we could compare certain configurations in the case of the word-to-vec model to obtain sentence embeddings we've averaged embeddings of the words in the sentence but the main model in our research was sentence birth model since it is considered a state-of-the-art model in the task of semantic similarity search the scheme on the right site of the slide presents sentence birth architecture at inference just as a reminder also in our research we use several sentence birth models straight out of the box from the sentence transformer framework which includes Minilem model sentence birth based on Minilem model sentence birth based on Distiller-Bert and MPNet besides we also finetuned the sentence birth model based on Distiller-Bert which is noted as Distiller-Berta plus to obviously achieve better results now let's talk about the finetuning process of the Distiller-Bert model as I've said earlier we have parsed additional data so we join our initial train from the ebi data set with contexts that we have collected from the garden then we split this new set into new train set and validation set in the ratio of 92 to 10 again with certification then we create so called positive and negative pairs to create a positive pair for a sentence from new train set we merged a sentence with another random sentence from the new train set with the same idiom and then we have removed the idiom from the first sentence the process for creation of negative pairs was identical except that we matched two sentences with different idioms this process is illustrated in the table at the bottom of the slide on the right side of the slide you can see hyperparameters we have used for finetuning and the graph which illustrates the dynamics of the accuracy on train cross 5 epochs so as we can observe at the fifth epoch accuracy reaches the plateau so we have stopped training now let's take a look at the final results this table right here contains MRR scores for all of our configurations as we can see in a green cell finetuned Distilla-Roberta model achieved the highest the highest result overall on the idioms collection as I've said earlier we consider word to word model with sentences collection as a baseline configuration therefore we can see an 80% gain compared to the baseline and 46% gain over mpnet at sentences configuration which achieved the highest MRR score before the finetuning process so we can draw conclusion that MRR is higher than 0.5 which means that on average the correct idiom is ranked second on this slide you can see examples of simple and difficult idioms for our best configuration simple idioms are characterized by a high average reciprocal rank averaged overall corresponding sentences from the test set while average reciprocal rank for difficult idioms is close to 0 as a possible explanation for this for this variations in performance I can mention that first of all some idioms might be used in a wider context or in a more complicated context or perhaps another reason is it could be related to our evaluation protocol because we assume that there is only one suitable idiom for each sentence which might be overly strict however these hypothesis are under-researched and we plan to examine this phenomena in the future so the main results of our study we'd like to highlight that first of all we automatically expanded epi and dataset by more than two and a half times and therefore we have created basically a new dataset for the task of idiom recommendation for the English language secondly we present a fine-tuned model particularly for the task of idiom recommendation thirdly we present a novel approach which is based on the semantic similarity search and implementation of centers bird and also we have examined the suitability of several neural models including word to egg and centers bird for the task of idiom recommendation as for the future plans in the foreseeable future we plan to further expand the dataset because we have a hypothesis that showing the model even more various contexts on the training stage might result in even higher performance and secondly we would like to add filters to prevent some kind of inappropriate recommendations thirdly we would like to analyze the impact of the context length on the performance of our approach and experiment with contexts longer than one sentence and finally we plan to use some kind of tool to filter sentences which contain idioms used in the literal sense because in the initial EPI dataset some idiomatic expressions were used almost exclusively in the literal sense on this slide you can see a QR code which leads to the code models and extended dataset which are all freely available we invite all of you to come and see it that's it thank you for your attention I'm ready to any questions maybe we have time for one quick question because we're slightly out of time good but I have many but yeah I will try to do it quite fast so my question is about the test set because there are multiple ways to split the test set sometimes you may choose the sentences that were in the train with the same phrase but it's not seen in the test set so like can you please elaborate a little bit how the test set was split created and thank you for your talk it was just a random split we didn't have no duplicates because we have removed it if I heard your question right so it was just a random split we just have some contexts containing idioms in the train set and some contexts without idioms in the test set which we want to find the correct idiom for okay thank you and one more quick question like do you have I think that the scores that are higher for the this fine tuning step is that because the structure might be quite evident for the model because some phrases you cannot like food for thought maybe use only with some preposition that you don't hide and some phrases that use a verb that could not evidently according to the grammar cannot be used there so like probably like have you thought about that and maybe it's going to be for future work or maybe there are some examples that they were phrases that were correct but not identified by the test set thanks as for the first part of the question yes it's up to the further research we have thought about it but we haven't got enough time to check all the hypothesis some of them are described in our paper but it's too it's not it's hard to put them in a few words so I can just invite you to read our paper and as for the second part I have mentioned that our evaluation protocol isn't quite as logical because we don't consider the fact that some contexts may contain more than one appropriate idiom so there is not only one correct answer so it means only the fact that we have kind of lower estimation for our approach it might be even better okay thank you let's thank the speaker again and the session is over see you all tomorrow