 kind of corporations you can have with the university or in general about university, about educational landscape in Armenia, please talk by. Or if you don't find me there, then you have my email here and you have my telegram handler, right? So please add me and send me a request so we can have a chat whenever you want. Now I would like to give my microphone to Dimitri who is another organizer of the committee. We'll talk about ICE conference in general and then we'll go with our first keynote speaker. Thank you. Thank you, Habit. Hello, everyone, one more time. I'm here on behalf of the steering committee and of those guys who started it all more than 10 years ago. Alexander is here, but his flight was moved three times. That's why I replaced him. And it's my duty to deeply thank all the local organizers for hosting ICE here in AUA, Yerevan, Armenia. We are also thankful to all the program committee members. And let me also deliver some key facts about the conference, about its past, maybe its future. Now according to Australian Co-Ranking Conference, its original conference, or maybe national, I don't remember it exactly, you can check, but we believe it will become regional, we apply for that. For last year edition, for the previous year's edition in 2021, we have more than 100 participants. Traditionally, we have five or six tracks. The main track, I believe it's NLP, concerning the number of submissions we have, but also the large track is on computer vision, data analysis, we have two smaller tracks since the communities are a bit smaller on social network analysis and theoretical machine learning and optimization. As for the selection procedure, we applied for double blind review system several years ago and we keep going to use it. We have this year more than 100 PC members from 22 countries and the chairs who will be chair in the sessions or the tracks, they are internationally recognized area experts. This year we had, we have submissions from 15 countries and the proceedings are usually booked for publishing as revised selected proceedings in LNCS, lecture notes in computer science series and its satellite series CCIS by Springer Nature. So you can see the summary where they are indexed, they represent and maybe some volumes from the past with a nice logo on it. Let's shortly discuss the acceptance rate of the conference. We receive 106 technical submissions. What means technical? Some submissions can be withdrawn, for example, or that's rejected. That's why this figure is not the final one which will appear in the proceedings. We have also 13 posters this year and 76 submissions after all desk rejects and withdrawals which is quite usual story even for large conferences. The program committee decided to select 24 papers for the main volume which results in the acceptance rates of 32% and 22 papers selected to the supplementary volume. It means that these papers are also good enough but maybe they are more like research proposals have some room for future improvements as say now journal papers. Let's have a look at the statistics by track which the system called is a chair present for us. You can see the names of tracks on the left and you can see the number of submissions per track. The effective number of accepted papers is shown in the second or third column. So you can see also the relative acceptance rate and the number of PC chairs and program committee members included in the track. International committee of fair chairs consists usually of two chairs per track. For some larger track there might be three co-chairs but this year's our brave natural language processing co-chairs managed to select papers together before there were three persons I believe. All of them are known researchers in the areas that was saved. As for the organizing committee, I believe this is not far the full list of all the people involved but you can see the names, Irina Nikishina, Maksim Panov, Habed Madayan, Amalia Humbertsumyan, again it's Mbalov, Mi and Alexander Panchenko. Alexander decided to include some of the influential papers in terms of citations here. So the one here by Mikhail Korobov has probably the top most number of citations. It's about a morphological analyzer and generator for Russian and Ukrainian languages. Most cited papers also include the paper on web vectors and I believe that the first person, Andrey Kutuzov will join us online at this time. Also one of the influential papers is on big IRTM on topic modeling tool developed by the team chaired by Konstantin Vorontsov. He's quite known in the Russian machine learning community. Here you can also see some photos from the previous edition of Iced. It took place, I would say, nearby in Georgia in 2021. We would like to acknowledge our partners and supporters, AI Research Institute, Mrs. Coltec University, where I also work in, HSC University, in addition to the local host. So I think it's a good time to start. Let's start. I hope it will be a pleasant event for all who are here and online. Thank you. Thank you. I'd like to invite our first keynote speaker, Dr. Narnia Sarvazian, who is William Fraser Indow, Chair, Professor of the American University of Armenia. Please welcome. Hello everyone. Welcome to ADA. We all know very well that Armenia goes through one of the darkest periods of his history, with tens of thousands of people losing their ancestral homes. And it's very hard for us to be cheerful hosts, but we're trying. I think Desmond Tudu once says that hope is ability to see light despite all the darkness. So I think it's only appropriate that today we're going to talk about light, specifically what is light without colors. So that's what we're going to talk today about it. And with this, let's me figure out. So my talk will consist of four parts. I'm going to talk about the limitation of our color visions. We're going to talk about the basic physics foundation of this technology, which we're going to get more familiar with medical application of it and a little bit about what was done by my team in the past in this direction. So color is used for medical diagnostic for thousands of years. It was something which physician look, color of your eyes, color of the skin, color of the urine, and there are plenty of very cute illustration like this in the medieval books where by the color of different fluids or part of the body, the disease was diagnosed. However, we need to realize that as useful as it was, that information was limited to very small spectral range, something which we'll call visible light range. So if you look at electromagnetic spectra range, the visible light is only from 400 to 700 nanometers. And we have only few type of receptors, which we call cones, actually only three of them, which in our eyes are sensitive to certain wavelengths of color. So we're going to briefly go over the main limitations of human color vision. And then I'm going to explain to you how this new technology allowed to overcome these limitations. So as I said, one of the major limitation that we start with a very few spectral bands, so to speak, or the receptors we have in our eyes, which we call cones. So, and many animals which are much more primitive than us, for example, this manzi shrimp have many more receptors in their eyes, but because we couple our few receptors with this enormous human brain, our human vision actually is able to recognize up to six million different shades of color. So we're talking about the combination of the initial input, which is this number of spectral channels and how you analyze them. But in our case, that spectral input, the initial one is actually very limited. So the next major limitation is subjective. When you go to the store and you pick up type of wood or hair color dye or anything you want, we use these very subjective descriptions, which is okay when you're a buyer, but go to dermatologists and tell them that your skin area went from stolen kisses to hot pants or something like this. It's a very subjective way of describing what we have. And when you go to one physician to another, or when you perceive that degree of redness through the treatment or the time, it's all extremely subjective. In addition to inability to compare it between different person and different time points, our perception of color in our brain, it very much depends of what surrounds that object. For example, if you look at Ararat in the morning or in the evening, it might look very different to you, but in fact, if you remove the background, the color of many object is identical, it's just surrounding of that color which impacts your perception of the color. So it's not objective. And in addition to that, we all have a different genetic composition of those color receptors. So when I go with my husband to the store to pick up a sweater, very often I say, oh, it's a nice green sweater. And he said, no, it's brown. So we all know that our perception of color is not totally subjective. So the third dimension is a limited spectral range. So we are seeing in that we are sensitive to that spectral range from 400 to 700 and insects or reptiles actually can see in infrared and ultraviolet range. And so when insects approach the flower, the one which you just see yellow, or whatever this little creature see many more shades of color than we are. And lastly, our eyes actually need a lot of light in order to be able to distinguish color. So we all know that famous phrase that in all cats, dark looks gray at night. So in order for color to be perceived, you need quite a bit of color, light. And so there are two different modalities of way we can see light coming from the object. In case of reflectance, we shine something in whatever comes back. It's actually has a lot of intensity. So it's easy to see. But many objects specifically biological, they have another property, which is called fluorescence. It's when the light hit the subject and then it's elicit response from the molecules in that subject to emit light at longer wavelengths and it's called fluorescent. That light, it can only be seen when everything the other colors in the room are darkened. And that light is very hard to see by the human eye. So if these are the limitations, I talked about the technology we're going to cover, hyperspectral imaging is actually covering and able to address all of them. Okay, so what is hyperspectral imaging? To be honest, I think from the linguistic point of view, it will be more appropriate to call it spectral imaging, but somebody name it hyperspectral and the name stuck. So now we are with hyperspectral imaging and it's basically relate more to the analysis or acquiring the light in the visible near infrared and ultraviolet area of spectra because spectra can be different. You have a Roman spectra and another, but hyperspectral imaging is basically analyzing the light in a visible little bit of ultraviolet and infrared range. So the way it works is that if you have an object, you basically acquire the information in this three-dimensional way. You have your spatial coordinates X and Y and you're adding the third dimension which is your spectral dimension or lambda. Then from each pixel of that three-dimensional dataset, you extract the spectral profile or basically intensity of the signal along the lambda axis and then you let the machine to sort these pixels and based on whatever task you give the machine saying find me two ranges of pixels which are the farther from another or find me all possible combination and then you suit the color those pixels which are closest to specific spectral profile and you get what is called composite HSI image. So when there are only few spectral bands and they kind of are far apart or some apart, it's called multi-spectral. When as a result of that extraction you get more or less continuous spectrum, that's called hyperspectral imaging. Okay, so how we can acquire this set of information? It really depends on the type of the scanning you do and all of them have certain advantages and disadvantages. So in case of when you want to have a very high spectral resolution, you do point scanning. Basically you go pixel by pixel and then you just acquire the full spectral information here. You can also do this linear scanning which is more appropriate for when you object moving beneath the sensor or if you have that camera mounted on a plane or a drone and it flies over the certain area, this way it basically goes and acquire this spectral information from that area you want to scan. Most commonly what we're going to talk today will be done through what is called wavelength scan. Basically you have a regular camera on top of the object and you change the set of filters in front of it. So every time you snap image you do it at the specific wavelength and so you fill your cube from down up. And for the past few years there is a huge develop in the photonic fields where people came up with a smart way of actually splitting that image coming from the object into many multiple area on a large sensor so you can get all this spectral information at once. So this is basically the same type of information. Here I just highlighted that what are the advantages of this of this approaches. So in this case you get very high spatial resolution, very slow acquisition. Here you have a medium but again you use for moving targets. This is the high spatial resolution but because you need to sequentially change that filter in front of the camera it's relatively slow. This one can be very fast and can be used for video HSI but in this case because you're dividing your sensor tips in multiple squares the spatial resolution is not that high. So the spectral camera have a wide range of price but generally the price is from 2,200 K. Okay, so what are the advantages? Well the advantages especially when applied to medicine this is a non-evasive approach. You don't need to introduce any contrast dye, you don't need to touch the subject you just take an image from it. So no radiation there, you reveal small difference in color which I might not be able to see. You have basically the resolution is half of the wavelength. So we're talking about fraction of the micron. So let's say compared to XRA MRI or anything else it's a very high spatial resolution. And again the most important you can quantify objectively the color change or difference between the color of different objects. So what are the limitations? Well the main limitation is just like our eyesight we only can get information from the surface. It's not like the XRA you can go through and see your bones because the light will go probably maximum to half millimeter or millimeter into the tissue but it will not go more. You know for it to be effective you need to know exactly in which application specific ranges this will, let me see my timing. Okay, so where you actually need to acquire this information you need to have pretty significant time spent on post-processing and then whatever algorithms you use is going to affect whatever you're going to get as a final image. So there is a certain subjectivity based on the algorithms or the user who process the data. Okay, so basically in a very simple way is that we're combining the advantages of this insect or whatever lower creature eye which has multiple channel but has a very little brain with our eye which has very few spectral channel but very large brain and then we're using the machine to get these multiple channels and then we use the computer power to analyze the signals which are coming from this spectrum chance. Okay, so let's move to the medical application of hyperspectral imaging because this is where future lies for many medical applications but this technology doesn't come from medical field. It's come actually from military applications from astronomy, from material science and now this technology is very widely used for many areas of application, not necessarily medical. In agriculture, basically many farmers in Europe or United States, they use hyperspectral cameras flying over their fields and then they can realize where you have certain need for watering or some kind of disease. We actually have a company here in Armenia which analyzes those images obtained in US. They gain transferred here, the team here analyze it and send it back to US farmers to better fertilize their fields. The hyperspectral imaging is very widely used in recycling because plastic which looked by eye very similar or kind of transparent white. If you shine ultraviolet light, the fluorescence will be very different between these different type of plastics. So it's very easy to use it to sort plastic. You can inspect boards, you can detect counterfeit in for example, different currency. It's used a lot in art forgery detection because the spectra of the dye which was used in 13th century. As similar it might look to you when you look at with your human eyes it will be different when you use a contemporary paint. So using this hyperspectral imaging analysis to analyze the differences really helped to identify first art. It's used widely for the food detection when you see those tomatoes or apples going through the conveyor belt and you want to select the ones which are not fully arrived or have some damage into it. Do you have a hyperspectral cameras on top of that conveyor belt which then allow things to be sorted? So the way the hyperspectral camera can be mounted, they can be mounted on the drone so any kind of aerial vehicle. Again, this is the example how it's used in architecture. You can mount on a microscope and then you look at the slides or actually live cells and then mix the fluorescent labels there or you can just basically mount it and just have a regular camera objective and then you look at the microscopic surfaces like your arm, your mouse or whatever you want. So again, this technology is already widely used in many fields, but in medicine it's only starting. So you can see here the increase in the amount of publications on a PubMed which again look like a big number but compared to overall amount of articles online it's actually small number. So this is emerging field and one of the reasons I'm happy that you will know more about it because I think many of the techniques you use for different applications are directly applicable to this field and the technology is only bringing it to more and more medical applications where your expertise can be very useful. So they're just starting to appear actually first handheld hyperspectral imaging devices for clinical use and I didn't want to bring more gruesome pictures from real, you know, a necrotic or diabetic food but it's obviously that, you know if you have a camera you can look at the skin condition and see deterioration or improvement in perfusion index and stuff like this, something which can be easily see on the surface of the skin. It's also can be during the intraoperative mapping when oxygen bind to the hemoglobin molecule the spectra shifts a little bit. And so when it shifts significantly, you know there's a difference in your venous and your arterial blood, right? One is blue and another is redder but you can have much smaller shift also identified and so this is an example where hyperspectral camera was used to actually see that the surgeon was going to cut or, you know, put a tie over here but in fact this is like not the exact area where it should be put. So you can see that the area which need to be dissected is actually below. So visually you cannot really distinguish it but use of hyperspectral imaging can be very helpful. Another area where it's, you know, going to grow is, you know, the transplantation of the organs when it's only less than 1% of the people who need organs actually receive them. So any organs which being excised from the another person who cannot survive and get transported to the final destination it's extremely critical to know the condition of the organ because otherwise you're going to have a person with poorly functioning liver going to put liver which is seem to be okay but actually this person will be dead in a couple of days. So the way to do it is that when they transport this organs the color of the organs despite all the effort is going to start changing because of the oxygen level and stuff like this and so again there is a very straightforward way to analyze it using hyperspectral imaging by change in the color and so you can see the changes on the level not just the whole organ but the specific area maybe which of some of them you can dissect to avoid future necrosis. There are a few research papers on the subject of intraoperative HSI during brain syndrome jury. This is an example from a first paper. Again, these are preclinical, first clinical but again I'm waiting larger clinical trials where they try to dissect neuroblastoma and this is a combination of actually input from the hyperspectral camera and most of the paper actually is then devoted to the way they use a neural networks to extract the maximum amount of information and as paper conclude the accuracy was 80% which is outperforming the state of the art approaches. 80% doesn't sound good to me but I guess in this field this number is actually very good. So there is a term now which is called optical biopsy then the catheter or some kind of endoscope can go close to the tissue, acquire the spectra, get analyzed and you can correlate it with the similar changes occurring in cancer patients with a similar pathology. One of the most fascinating directions for me is this a few years ago it was discovered that attaching hyperspectral camera to the fundus camera which is the one we look into your retina when you go to check your eyesight, the eyes extension of your brain. So those proteins which get deposited in the brain, bladder amyloid, tau proteins and lead to decreasing your mental capacity Alzheimer and other dimension. Apparently they also get deposited in that retinal area. So by acquiring, simply acquiring this information from the retinal surface, there can be prediction made that this person is actually starting to develop early sign of Alzheimer's and everyone in the field believe if you can start treatment early you can delay that process. So this is exactly, this is a very exciting development because again it's non-invasive and you can see the early signs of disease but it's only like one or two groups now in Europe which are exploring it and I think it's definitely something which need to be studied further. So when I mentioned, and I guess we already kind of see it but when I mentioned that hyperspectral imaging is limited to surface, it doesn't mean it's only surface of the skin pretty much if you get into, if you can open a surgery or you can get something like an endoscope attached to the hyperspectral camera you can go inside the body and look at the surfaces there and so it's not just what is on a surface or a body but it's also what is inside. Another major developing medical field which will find time because pathologists are probably the most conservative medical professional and it's hard to convince them this is the way of the future but eventually it will happen. You probably know that if you need to diagnose certain disease small pieces of tissue is taken, stain with specific dye, send to histology and then a very experienced histologist look at the slide and say, well, there's a little bit of this cells there's a little bit of this color and so I think this is this or this type of cancer or any kind of other disease. So again, this information A is subjective, B, it's very, very much depend on expertise of this particular pathologist. So now using hyperspectral imaging and there are now machines where you can just feed thousands of slides like this and then it will scan it and then it will automatically identify difference in color and then individual spectral component can be quantified so you can exactly say that you have decrease or increase of number of certain cells during the period of time or you can compare it to other pathologists who gave you similar or different diagnosis. Okay, so it's all, okay. So it's all looks wonderful when you look at the final slides of any paper but in my last part I want to just tell you a little bit about the work we done in my lab at George Washington University and it's just happened because in the past I did research in cardiac field that the target we chose was probably one of the hardest at all. We chose the inner surfaces of the cart to be able to diagnose using this technology. So, but first let me hopefully I will be able to play some of those videos. So we're going to talk about treatment of the most common cardiac arrhythmia which is atrial fibrillation. So it's not fatal because it's when your ventricle actually start fibrillating and you drop that because brain does not receive any blood. But when your atria fibrillate it doesn't really impact you immediately but what happened in those pockets of this atria you have blood clots accumulating so your likelihood of getting stroke increase five times. So when people have atria fibrillation first they treat it with drugs but ultimately the best way to treat is go inside the heart and ablate using different type of ways of killing the tissue. One of the common, most common one is coloregio frequency ablation and I just want to play some videos so it's maybe kind of will be more interesting just so. Abhijan maybe you can help me to push this button because there is an, oh there is a, sorry I just saw this, oh okay yes we have a mouse that's helps. Okay so here if this is your heart and you just basically these are the atria and you can see that those abnormal sources of electrical activity which looks like this you know little stars are start randomly to go around and then you don't have a regular pumping so blood can not flow in a systematic way you form blood clots so the way to treat it so basically physician goes into your vein can be your groin or can be your arm and then they insert it, it goes into your heart and because there is an x-ray machine on top of you can actually see the exact location where the catheter goes, there are actually two of them one is called mapping catheter which record the electrical activity from the surface of the atria, this is your mapping and then based on these mapping signals then the other machine can derive this part where this abnormal activity actually originate so then the ablation catheter goes and basically isolate those area where the troublemakers are coming so at the end on the screen the surgeon sees that pattern and then machine reconstructs where the signals come from so you can see this red area is where the signal comes from and that's what they need to ablate they don't want to ablate entire heart or entire surface because then you're going to have a scar and it's going to be hard to even like get blood in because it's going to be very stiff so you want to have a very targeted ablation to just remove those abnormal sources so at the end basically again here you can see the point of the ablation catheter which goes and touches this area and then machine records exactly where catheter already been so they don't go back and ablate the same spot so if you go to that surgical room you really feel like you're in a spaceship because they're like five different monitors showing all these beautiful things you see a person lying there and again it looks very sci-fi but they still don't see the actual damage to the tissue just only see indirect information that the catheter was there and there's a decrease in electrical activity it can happen because you have an edema you have a wrong area and so still there's a 30% of recurrence rate for the atrial fibrillation ablation so person goes home then the problem reappears they need to come back so our lab decided to use hyperspectral imaging to help improve this particular surgical treatment so first we needed to figure out whether it's better to acquire this light in a reflectance mode or fluorescent mode and the second post-acquisition we needed to figure out how to do that quickly because again we're talking about the beating heart so things need to be done almost real-time so this is just a raw data from our hyperspectral cameras which shows that this is the surface which you see under ultraviolet light and this is under visible light you don't see any lesion by eye but in an ultraviolet light you see three of them much better than in a reflected so the right side is the outcome of the hyperspectral imaging and processing of this stack of the data so the biological data are extremely noisy that's one of the problem you need to kind of realize is that so this is an example for example even after you normalize the signal you can see the trends these are 14 different individual animals with different lesions and in a white light illumination the noise so to speak or that you still see some trends but in an ultraviolet range when we illuminate with UV and take fluorescence it's actually a little bit more consistent so that's the first step for us was to figure out what way of illumination to use and then we proceeded to test this technology starting the small animals and then larger animals and then finally human tissue so this is an example how the excised atria from the peak heart looks so you can see this is how it looks under the white light and then on the right you can see how it looks under UV light again lesions are kind of seen but very very poorly and when you use hyperspectral cube this is a difference in spectra you can see the differences are very small only a few percent nevertheless unmixing it leads to the very clear pattern so when we move to the harder tissue because it has a thicker layer of collagen on the top of the atria and so it's much harder to see any lesions we were still able and this is the surface of the this is the excised human heart we're not murderers but we had an agreement with the transplant center in Washington DC so when they have a heart not suitable for transplantation they call the lab and say okay come and pick it up so that's where this heart comes from from that source so anyway so this is the surface of the left atria and you see how much collagen it has and if you strip it or pull it away you can see it has been extended with the dye which identified the ablated area there there but again from the surface you don't see them nevertheless hyperspectral imaging worked reasonably well so this is obviously the best case scenario but overall it was good enough for us to say okay let's proceed with something resembling clinical device in addition what we saw is that to our surprise we really don't see with this technology deep because it's only surface but when we did the correlation between the depths of the lesion in degree of spectral change we saw that we actually see how reveal how deep are the lesion and again it goes against my physics background but then after many nights of thinking how it's possible I realized that this is an indirect correlation when you have a piece of steak getting hot like applying to it if it's become a little bit whiter you didn't heat it deep enough if you have more heat it's going to be deeper lesion it's going to get whiter so basically the degree of spectral change on the surface is indirect indication how deep the lesion will be so that technology was surprisingly good also in this which is important for clinicians because they want to do transmural ablation and so we then proceeded partnering with the company and created this first hyperspectral intracardiac percutaneous catheter and this is just show that how we have this catheter entering the heart so one of the problems again now we go through a little bit of problems we encounter so if you have blood in the tissue it's optically dense so basically it absorbs everything so a question was well we have a perfused tissue will we have a difference in spectra because one thing to have it on a bench another when you have it in a living individual or animal so as you can see here we can perfectly identify a different area when we have a blood inside this is just an example of the fresh lesion and the scar tissue which again can be easily distinguished and these are the vessels which actually coming from the scar and feeding this border area which was very interesting finding but this is when the blood is inside so it doesn't really interfere but if you're going to have a blood right in front of your sensor it's going to block everything it's like a dark wall so we needed to insert the balloon we have a balloon here filled with saline which basically work when you insert the catheter then balloon get inflated by saline and it displays the fluid in front of it and then we needed to do things quickly because again it's a beating heart so then we did a lot of analysis on the data and trying to minimize the amount of channels we actually need so we went from initial 151 channel to 3 and still obviously the outcome was noisier but we were able to unmix and see the lesions and then we also realized that some of the balloon materials which used very widely in clinical practice is very fluorescent so we have to search and characterize different fluorescent material for this material not block our signal so I'm just telling you this to tell you is any application you're going to encounter differences with difficulties which you need to solve so I'm approaching the end of my talk I want to acknowledge the people in my GW lab which do those experiments on hyperspectral imaging of atrial tissues and I also want to mention that here at AUA I also want to continue on this promising direction and one of the developments we want to implement which hasn't been done ever before and I think that's very exciting to me is that we're not going to only acquire the spectral information along the axis on the mission site but we're also going to scan on the excitations site so we're not going to have a 3-dimensional data set like we had before but we're going to have a 4-dimensional data set which is now possible with a combination of the tunable light sources and snapshot cameras so you tune on the side of the light which you illuminate and then you acquire the full spectral output on the light which comes from the object and so we already shown that it's much more sensitive approach but work is only started and we have a faculty here from AUA whom we collaborate hoping to also add some new advanced image processing algorithms to this field so this is just a very quick overview of what this field is going to experience very soon and I think in red I mark something which you all guys have more expertise than I do because I am an experimental physiologist again it's an emerging field fueled by advances in optics and advances in machine learning and you are welcome to contact me to see how you can or want to collaborate or just read more about hyperspectral imaging and find somebody in your area who work on this because this is going to be something big and in next, I don't know, 15 years in office I'm sure there will be hyperspectral imaging somewhere which will be pointing to your own body so with that I am ready to take any of your questions Thank you Dr. Sarvazian, do we have any questions? Yes Thank you for the great talk I'm not familiar with the field so my question is kind of general how the multidisciplinary process looks like, basically you have researchers on machine learning side and doctors is there any annotation process how usually you describe to doctor and experts in medicine your results, what's the outcome response so could you please say about it? It's very enriching for everyone and sometimes it can be fun, sometimes it can be very gruesome I remember we had a conversation with the vascular surgeon and we're talking about applying our cameras to like a diabetic food or some cases where they have to amputate the leg and he said, oh no problem, I can bring you 10 amputated lab to your lab and you can just measure this and no, I don't want to have 10 amputated lab legs laying around so I think their perception of what is easy in our perception was is doable or easy is very different but I think I started as just I mean my base education was in physics but then I moved closer closer to physiology but the past 10 years I made this journey from just being a lab researcher to collaborate with people from the company who made the new devices from the clinicians who need to test those devices and I don't want to even talk about the lawyers and patterns and like all this stuff this is like yet another whole field which you need to learn but ultimately if you want to bring something you do in your lab into actual practical field you need to do that so it takes a lot of time and it takes I don't know several months at least to explain to them what you can do and what they have or they can offer to you as far as patient population how they want to see the data many things which we said okay well that's easy let's just do that they say well I only have like two patients in a year there's no way we're going to get enough data so that conversation is very important Thank you very much for nice talk Vladimir Ivanov from Onopolis I have rather like maybe the same question as Elena about the possibility of interpretation because what you did what you explained in the first part of the talk is this difference between perception of human being the color and the hyperspectral colors and as I understand you use the or some of the researchers use the CNNs like in the original neural networks and they solve some tasks like with a certain precision but do you know about some research or some works in the interpretation of this kind of information because it's beyond the human perception maybe doctors feel it somehow but how research in this direction is going So I think it's a very important question so there is always a question of ground truth so I think at this point we don't have a better tools than actually go to whatever let's say a set of pathologists which can say this is the condition or in case of the hyperspectral imaging of the lesions for example staining with TTC which is a dye which identify like a necrotic tissue where you can clearly see it's become a very different color and you just take a regular image and you compare the result of the HSI and mixing with that image which is again ground truth so at this point we don't have a better tool than what was known and identify by physician as this is the damage we need to go from there and then the next step we'll see if the hyperspectral imaging give you something let's say again like for example this area where physician or surgeon put the knife and say we need to cut here but actually hyperspectral imaging saying okay no half of the legs still have a normal diffusion stop you don't need to cut that much but we cannot immediately do this step right because we can start with this and say okay let's have a population of patients ten of them will treat old way ten of them will treat the new way based on HSI and we'll see what the outcomes will be if HSI outcomes are better then this is justification to have a larger clinical trial and that's how it moves you're welcome Thank you for interesting talk I'll try to be brief I have more technical question actually so when there's a hyperspectral image of the human heart these filters all different wavelengths are they acquired simultaneously or with some like filter by filter as in satellites So these experiments I was showing you we used the old style hyperspectral camera where you have just black and white sensors or great whatever and in front of it you have a liquid tunable filter where it's just a crystal which then you sequentially dial so to speak but that's why we needed to go from 151 initial senders to just three to make it faster because we do it with a beating heart now we just acquired for $22,000 snapshot hyperspectral camera which can acquire the whole you know 20 channels in like millisecond range so now you can do it much faster the resolution spectral resolution will be less but you can do it much faster that's I'm saying these techniques are coming from the photonics fields and we'll have a camera which allow you to do it pretty efficiently at basically video rate Yeah so the question may be more around if you just obtain these snapshots on different wavelengths in different time they may be kind of changed a bit because it's I don't move in hard so are there some pros and algorithms to do it like to merge this into some static stuff Yeah so what it done in not only this field but in any cardiac images it's called gating so in parallel you have an ECG acquisition so you gait your acquisition to that diastolic period of the heart when it's not moving so it's not enough light diastolic period so you do it and you just sum it up over let's say 20 so you add this and you enhance the intensity of each pixel and that's enough for you to unmix so but in the heart it's easier because it's a regularly beating organ so you can gait it to ECG basically but yeah you again we choose the hardest object to study so it has a blood inflow of the surface so you need to get into the heart so extremely small and that catheter needs to be bent which is also a problem for any fibre optics to have a bendability index enough to go in but eventually we hope it will be solved this one doesn't this one doesn't seems not to be but it's it's not working here so basically we can see his slides on the screen but it's not working it doesn't work but we can see this it is could you please reset the timer okay so I'll start at the beginning I love and I'm going to present our paper which is titled better phrases and classifiers controllable text generation for textile transfer so first let's talk about the motivation behind this research so the task that we are solving is textile transfer which is an important task for products that use MLP because it makes these products more oriented as it is connected with the motions and in the recent years textile transfer has seen great progress with large pre-trained language models but they are often too big to fine tune for dancing tasks so one of the solutions to this problem is to use methods of controllable text generation and more precisely a more big post processing group of these methods which do not aim at fine tuning for example a language model but on the other hand they just work in a post processing manner only during inference and another thing is that unsupervised approaches are more preferable as for many textile transfer tasks parallel data are not available and therefore we should go with unsupervised approaches so in our paper we adapt an existing CTG method controllable text generation method for textile transfer which is called CAIF and the advantage of our method is that it results into an unsupervised method and we apply paracaif to a textile transfer subtask, the textification and we work with two languages Russian and English so first let's talk about the controllable text generation in general we can say that contemporary language models have acquired the possibility of generating human like sounding text but the control is still we are lacking some control of these models because of the downstream application specifics and also we need to control the models which are trained on unclean web data and therefore they are prone to generating toxicity toxic content so there are three broad groups of controllable text generation methods and the first two they actually work with the regional model and somehow interact with it, either retrain or effector and the third group are post processing methods which do not interact with the regional model and that's just what we need so here is an example of a post processing CCG method, it's called JDI and the main idea here is that for example if the task is to generate positive text we train two additional class conditional language models and during generation with the main model we combine the signals from the two class conditional models and result into the desired class generation so the method that we are working with and the one that we are adapting is called CAIF and it is close to JDI but the difference is that instead of a generative classifier the class conditional language models it uses a freeform classifier so the idea is that during generation at every generation step we assess all the possible continuation tokens with the classifier so we apply the classifier to all possible sequences at the moment and we choose the most appropriate continuation according to our goals so you could guess that some problem with this method could be that it is computationally difficult to apply the classifier to all the possible tokens, the vocabulary can be very big so the authors propose several tricks to tackle this problem, first they limit the number of tokens that are being assessed to a sum number A which is set to 100 in the experience and also they apply an entropy criterion so they suggest applying the classifier only at points where the entropy is high so it is important at points where it is important to guide the model so textile transfer, the task that we are applying our method to is important because it can be used in for example writing assistants and chatbots because it can alter the text to your needs and the formulation of the task is that we have to change the attribute, the style attribute of the original sentence to the target and there are different subtasks and data sets for this task including toxicity and the important thing is that as I have said, not many textile transfer subtasks have parallel data so we have to account for that the particular textile transfer subtasks that we are working with is data electrification which is relatively new but yet very practical because the internet has provided space for toxic content and as you could guess the task is to transfer the original toxic sentence into a neutral sentence and the one possible application of this task could be that for example if a user writes a toxic content we could just at this moment connect them with a non-toxic rewrite of the text that they have just written so speaking about the research in this area work has been done for English, however parallel data are lacking and the first parallel corpus for this task was proposed in the previous year and as for Russian much less work has been done last year and the first such competition was organized it was the first in the world not only for Russian language now let's talk about the possible methods for desexification the closest method to what we propose is para-jedi so in this work the authors adapt jedi to textile transfer and more specifically desexification the main idea is that they substitute the original regular language model with the language model that's capable of paraphrasing so now let's finally talk about our method it's called para-kive similar to para-jedi and the main idea is the same that we replace the regular language model with the paraphrase language model and we also generate several candidates as common for generation tasks and at the final step we sort the candidates according to style transfer accuracy and semantic similarity and yeah here's the algorithm for sorting I will not spend time on explaining it but I could explain that there will be some questions the main idea here is that we try to balance style transfer accuracy and semantic similarity so in our work we experiment with both Russian and English desexification for Russian we take the data from the desexification competition it's a parallel dataset and we employ the evaluation setup in this competition so the paraphrases are assessed on three metrics including style transfer accuracy content similarity and language fluency and as for the models that we use for Russian setup so in our method we require two models first the generative one and the second is the classifier that guides the generative model so for the classifier we use a rubber tiny that we train on the train subset of the competition and for the paraphraser we explore a line of generative paraphrases proposed for Russian language including GPT based model and T5 based models for English we use test data from the paper that proposed Parageti and we employ the same quite the same evaluation setup from the Russian language so for the models we use for the classifier we use a reberta that was trained on 1 million examples it comes from the paper that proposed Parageti and for the paraphraser we use the T5 baseline from the same the same Parageti paper so now let's proceed to the results so here you can see the table with Russian results so first we can say that the paragraph models so yeah this column is a joint metric that combines all the three metrics used for evaluation so we can say that all paragraph models they nearly doubled the joint score and we also see that the T5 models are better at preserving the content which is logic from the architecture of the model but we also can note that the performance of paragraph models remains lower than the supervised baseline and overall is lower than the supervised baseline mainly because of insufficient content preservation and fluency of the output however we can see that in style transfer accuracy one of the paragraph models outperforms the supervised baseline that was trained by the organisers just a side note that none of the competitors have surpassed the supervised baseline that was proposed by the organisers so here you can see the examples of the de-textification in Russian so I forgot to say that there's an alpha parameter which is in charge of the style strength the more strong the style transfer is so we explore here we display examples of alpha equal to minus 5 and minus 1 so with minus 1 the style is stronger and we can see that all severe toxic words are cleared out by the model and so in the results you can see any severe toxic words alpha equal to minus 1 we can see that some toxicity remains yes and some severe words remain in the results so now next we perform some kind of an ablation study over our model and we compare just plain paraphrases with the paraphrases we added re-ranking or sorting of the candidates procedure and we can see that just adding the re-ranking it makes the joints score higher but it's just that simple because if we take a deeper look and perform a more fine-grained comparison of paracive and just plain paraphrasing we can see that if we aggregate metrics over 10 candidates if we sample 10 candidates for every sample in the test set we can see and then we aggregate the results we can see that the overall toxicity of the just plain paraphraser is much higher so this means that just using a plain paraphraser isn't enough to detoxify synthesis moreover if we look at some thing that we refer to as relative toxicity so here are the graphs by the OX axis you can see the source toxicity of the test samples and then by the OY axis you can see the toxicity of the resulting paraphrases and if we draw a regression line on these graphs we can see that the slope coefficient of the well you probably can see but trust me it's lower than the coefficient of just plain paraphraser which means that the paracive model copes better with the task of detoxifying the sentences and moreover the coefficient of the regression line is higher which means that it is lower so this means that the overall toxicity of the samples produced by paracive model is lower now let's take a look at the alpha parameter and here unexpectedly we can see a rise of the style transfer accuracy with the rise of alpha parameter so I'll remind you that the higher the alpha is the less strong the style transfer is so this finding needs more thorough investigation however with alpha equals equal to minus one we can see a predicted and explainable drop of style transfer accuracy and it goes even lower with no cave sampling equal to just plain paraphrasing next we also compare the results according to the entropy threshold and here we also see an unexpected behavior because with the rise of the entropy threshold the style transfer accuracy also rises which however is good because the higher the entropy threshold is the more rarely we apply the classifier so it is more efficient in terms of computation and for example the peak of accuracy is achieved at entropy equal to 1.5 and this sampling was 1.4 times faster than with entropy threshold equal to zero so let's look at the results in the English language here we can see that also the paragraph models are much less toxic than the just plain paraphraser we can also note that the paragraph model outperforms the paragraph model in terms of style transfer accuracy and it outperforms the second best baseline from the paragraph paper in terms of the joint score so to conclude we have adapted an existing CTG method for textile transfer we illustrated its applicability by applying it to a subtask of textile transfer desexification in two languages and we also note that for Russian it is the first known to us application of a CTG method for Russian desexification and we can note that paragraph significantly reduces the toxicity of the generated paraphrases and however paragraph remains inferior to supervised approaches mainly because insufficient content preservation and fluency of the outputs but on the other hand paragraph is more applicable because it remains an unsupervised approach so we do not need any parallel data to train on so we can train the classifier to guide the model on any classification data for the desired style so we just need examples of the source style and target style and that's all so for future work we can say that it would be important to assess paragraph with human evaluation because previous excel transfer research has shown that automatic evaluation that we performed cannot be fully replaced sorry, previous research has shown that human evaluation cannot be fully replaced with automatic evaluation secondly it could be beneficial to add support for beam search in paragraph because to date it works only with sampling and that would be beneficial because we have seen that just looking for the least toxic example in the plain paraphraser could be quite good but so assessing longer candidates with beam search could be more promising and also more promising in terms of computational complexity as compared to applying the paraphrase at each generation sampling step and lastly it could be interesting to add support for two classifiers for the CAIF model so it would benefit first the CAIF itself because we could for example control for two styles at the same time and it also could benefit a paragraph model because we could assess the content reservation for example just during the generation and not after it that's all I wanted to say thanks for a very interesting talk first of all I just would like to ask so let's say you want to do this dedexification in Armenian so you don't have parallel corpus but you have parallel corpus for maybe Russian or English could you adopt your approach to this across or multilingual setting would it be hard to do so basically first I can say that of course it would be better to have a corpus in the target language however the multilingual models have shown the possibility of working with a few short or zero short working with new languages also we could for example take a look at the automatic translation of the corpus ok so we are going to actually organize a follow-up task on this for multilingual dedexification at KLEA so in case you would like to test some of these ideas you're welcome let's thank the speaker again and the second talk also should be your find ok good morning the topic of our work is controllable story generation the topic of our work is controllable story generation based on perplexity minimization natural language generation is a field of computational linguistics that deals with the construction of computer systems which can generate understandable texts in English or other languages natural language generation technologies has a wide range of applications including dialogue and question answering systems story generation, product description generation and some others making tech generation controllable is an important fundamental issue in natural language generation controllable text generation or CTG is the task of generating and natural language texts that meet certain control constraints set by human such as topic sentiment, keywords and so on there are two types of CTG soft and hard the aim of soft CTG is to provide the desired sentiment or topic of the generated text hard CTG requires ensuring that the text contains explicit constraints for example certain keywords in the work we solved the hard CTG problem the table on the slide show an example of site generation the first row on the table gives the storyline consisting of plot phrases the second row gives the generated text containing the plot phrases in the order they appear in the storyline the problem statement is formulated as follows given a vocabulary V and sequence of prompt X which contains cut tokens of the prompt the goal of controllable text generation is to generate a target text with respect to a CTG by maximizing the conditional probability P the CTG can be sentiment, keyword and so on controllable text generation methods can be classified into 4 categories prompt engineering, retraining or refactoring and post processing our method belongs to the post processing category it has following advantages no need to create a training no need to perform training procedure which is difficult expensive and time consuming the goals of our work are development of a plug and play CTG method which allows generating stories in accordance with the specified sequence of guide phrases that make up the plot of the stories conducting experiments on controllable generation of stories in Russian using root GPT-3 large Rural Packer and SEGA models from a text copy containing stories with extracted story lines evaluating the quality of generated text using automatic and human centric evaluation methods the idea of our method is as follows first we generate several random short token sequences from the prompt to the guide phrase then we estimate the probability of following the guide phrase after each generated subsequence finally we choose the most probable subsequence we will describe the principle of our method using the example presented on the slide the blue color indicates token generated from some generation step I orange color indicates the guide phrase to which we want to provide a coherent transition at every step of the generation process we generate several random token sequences of some fixed length for example 3 tokens long examples of such tokens are marked in red on the slide we choose the probability of following the guide phrase after each sequence and select the sequence with the highest probability at the next step we repeat the process for generating and selecting sequences of tokens the method can be applied to any autoregressive language model for which the probability of a token sequence is decomposed using the chain rule the task of generation is to decode the sequences of tokens from the distribution p important component in the generation process is the decoding algorithm examples of such algorithms are top k sampling and nucleus sampling we consider a token sequence x where x with index from 1 to i-1 is a prompt x with index i to i plus k is a connecting sequence and t is a guide phrase theoretically it is possible to find the connecting sequence x with index from y to i plus k using exhaustive search of tokens from the model vocabulary however such search has an dependence on the length of the connecting sequence and is not applicable in practice therefore in order to reduce the number of variants we propose heuristic technique for generating and evaluating connecting sequences first as continuation of the prompt are different token sequences of k plus 1 are generating using some decoding strategy next for each of the subsequence x with index from i to i plus k of the r sequences the probability of following the guide phrase t after it is determined as the product of the probabilities of following the guide phrase tokens further at the current generation step subsequence is selected for which the probability is maximum and the subsequence of length k plus 1 are repeatedly generated we want to fulfill the condition of the explicit presence of the guide phrase in the text it after the completion of the generation the guide phrase is not inserted in the text we insert it by force its position is determined by the maximum probability for the entire generation after the phrase is inserted the generation continues towards the next guide phrase to conduct experiments a text corpus was formed from fairy tales in russian with extracted storylines the corpus is made up of fairy tales placed on the site with a length of no more than 5000 characters in total the training corpus contains 562 fairy tales in each fairy tale plot phrases were extracted using phrases that determine the main event in the story line to do this first in each fairy tale keywords and phrases were selected using yak and root term extract method the plot phrase was determined by a syntactically related for an event set where v is a verb or objects related to the verb m is a modifier prepositional object or indirect object the objects and modifier were selected from a set of the extracted keywords the verbs were determined from the parse tree for example in the phrase dragon holds princess in a cave holds is a verb, dragon and princess are objects and cave is modifier the minimal number of phrases in the plot is 1 the maximum is logarithm of n base 2 where n is the number of sentences in the text on the slide shows the distribution of the number of phrases in the plot in the training and test corpus the number of plot phrases varies from 1 to 8 we used 25 storylines from the test corpus and generated two samples per storyline storylines contained from 1 to 7 plot phrases for example the quality of the generated text was evaluated using automatic and human-centric evaluation methods for measures were used for automatic evaluation perplexity, repetition, self-blow 5 and for inclusion coverage perplexity is calculated the exponential average of the negative logarithmic probability per token in the language model a separate root GPT-3 medium model was used to compute the perplexity repetition score calculates the proportion of repeated 4 grams in the text self-blow 5 evaluates the syntactic diversity of a given set of texts it is defined as the average overlap between all generated texts what inclusion coverage shows the percentage of plot words included in the generated text three measures were used for human-centric evaluation coherence, relevance and interestiness coherence shows whether the story is consistent in terms of causal relationships in the context relevance shows whether the story corresponds to the plot as the events in the story unfold in accordance with the storyline interestiness shows how the user likes the story whether it is interesting the proposed method was compared with three methods of controllable text generation, constraint beam search few short learning and prompt engineering prompts for these methods are shown on the slide the table on the slide shows the statistical characteristic of the generated text the short method with root gpt3 model on average generated fairy tales three times shorter than the other three methods it should be noted that when generated longer tales the first tale was often interrupted and a new tale began similarly prompt engineering method with sega model on average generated fairy tales sega model was trained as a chatbot that's why when we asked this model the composed tale generated short but complete tales they corresponded well to the given plot automatic and human-centric quality scores are presented in the table the values of the world inclusion coverage show that our method ensures that more than 93% of the words from the storyline event appear in the text the text generated by our method met the requirement of matching the storyline to the best extent analysis of perplexity values shows us to conclude that our method shows almost the largest value of perplexity so where perplexity value makes the generated text look more natural the increase in perplexity indicates that the control process is unnatural for the model causes the model to be more surprised by the tokens observed in the text the self-below value shows that our method sega model allowed us to obtain the most diverse text among all methods to calculate human-centric measures the generated text will evaluated by three annotators for coherence, relevance and interestingness the assessment was carried out on a 5-point Likert scale according to the annotators the proposed method allowed us to generate texts that were most relevant to the storyline our method performed best when using the relatively small root GPT-3 model receiving the highest score on all three human-evaluation measures also root GPT-3 model generated less coherent and interesting text that were alpaca and sega the table on the slide show an example of fairy tale generated by our method using root GPT-3 model storyline consists of four plot phrases all four plot phrases appear in the generated text the guiding phrases are in position with the lowest perplexity value which seems quite logical experiments show that our method induces the model to shift the content of the text towards the plot phrase several examples of the generated fairy tales are shown on the slides we obtained the following results in our work we developed the method that allows generating stories in accordance with a user-specified sequence of quite phrases that make up the plot of the story we formed the text containing stories with extracted story lines we conducted the experiments on controlled fairy tales generated in Russian we calculated the various of the automatic and human-centric quality measures of the generated text the proposed methods perform best with the root GPT-3 model receiving the highest human scores among other methods for the larger models it can be used as a complement to other methods to increase the relevance of text to a given storyline thanks for attention a nice presentation and perfect timing we have time for one or two questions any questions here? thank you for the great talk I actually have a couple of small questions about evaluation I wonder, did you have any measures whether the model generated something irrelevant something extra, was there something of the like I couldn't get it from the 14th slide 14th slide like maybe some super irrelevant text or you would allow that for the model because it naturally invents something repeat please I wonder if it is at all important for the task whether the model invents something really irrelevant to the plot some plot twists that's just impossible or something like that how it takes to the relevant to the storylines I also have another question about a different metric on the next slide on this slide how did you measure the coherence exactly that is the causal relationships what were the annotators I don't know, placing some scores or tagging something annotators use just explain please about coherence, how exactly did you measure it that is, at some scale or did you set some numbers just how exactly it is done I just don't know use 5 point like the scale from 1 to 5 mark 1 is bad 5 is good thanks let's thank the speaker again for the good talk Hello everyone, my name is Polina and I will present a paper about using taxonomic information for hyponomic prediction for large language models so I will start with the definition of taxonomy and taxonomy is a particular case of knowledge graph it is tree-structured lexical database based on either relations and every node in the taxonomy is a set of words with similar meanings and also for every node in the taxonomy it is true that all its child nodes and all its parent nodes are its hypernomes taxonomies are applied for a wide range of natural language processing tasks so there is need to constantly update existing taxonomies since language changes rapidly however, manual extension of taxonomies seems to be infeasible since it requires a lot of human labor and deep knowledge so there is a large amount of approach to automatize this process however, most of them are based on measuring the distance between non-contextualized embeddings that lead us to the two problems and the first one is the fact that we need direct access to the large text corpora or large set of pre-encoded embeddings to capture really rare words and the second one, even more important is the fact that static embeddings do not allow us to resolve homonymy problem so we cannot see the difference between similar words with different meanings however, both of these problems can be resolved with the use of large language models and there are several researchers exploring Bird's acquisition of his relations and all of them show that Bird is able to predict hypernims and hypernims on a quite decent level and as for the approach presented in this studies the first one is the prompting where we expect from Bird to predict hypernims or hypernims in place of MISC token and the second one extending also includes providing Bird with information about taxonomy structure with projecting graph embeddings into the bird's space however, there is no such researchers for the decoder based models so in the current study we propose to the task of taxonomy enrichment as a task of conditional generation and apply decoder based models to predict the child nodes for the target node and also inspired by very high performance of decoder based model in Solven zero short text generation task we also aim to try to formulate such textual input that will also provide information about taxonomy structure to the model and there is also some additional parameters which start in the taxonomy not only the graph structure and also additional information such as definitions or sense numbers and for the first part of our research we will try to find the best form of input data to provide the model information for taxonomy enrichment task and for the second part of this research we will fine tune decoder based language models and try to predict hyponyms so usually when we speak about conditional generation we use some natural input prompts or direct instructions however, natural language input isn't very suitable for our task since we need to predict hyponyms for very large amount of different terms so it's quite impossible to formulate really universal prompt for example if we will speak about the simplest possible natural prompt like X is something we will also face a problem that even the choice of article highly impact expected outcomes and also when we will speak about some more extended parents that also marks easier relations we also can get some inappropriate statements for example my favorite arithmetic arithmetic arithmetic makes no sense since arithmetic arithmetic is a disease and the second problem with the natural prompts is that we are still not able to resolve their homonymy, for example we need to specify handwritten context also to define which meaning is here presented so in order to overcome these hindrances we propose to create some artificial input the main idea behind it is to linearize graph structure and mention in the input hierarchical structure in the flat form like we will mention grandparent parent and target nodes in the order and expect that the model will understand the button and then predict the child nodes for the target since it also advantages of this approach is that we can embed information from the taxonomy automatically for any target node so there is three main nodes features that included in the word need data which we use in our experiments so the first one is definitions for the terms in the each node the second one is dilemmas which is synonyms to the title of the node and the third one is sense numbers that mark order of the particular sense of the word for example bet as wooden club and bet as animal would have different sense numbers so based on these parameters we create eight formats of artificial input and you can see some of them in the slide so first one is the shortest it only contains mention of parent of a target node and the basic one also contains mention of grandparent nodes and the most extended contains all possible information that we can get from the taxonomy and so then we use this eight artificial input forms to fine tune GPT2 five base models to evaluate our experiments we use two data sets for each language one of them is bigger and consists of 1000 randomly selected preterminal wordnet nodes this data set is very suitable to evaluate and taxonomy enrichment tasks since it resembles real data of this task however it not allow us to assess to evaluate hyponemic creation acquisition from the decoder based models since this data set contains very rare and specific terms which might be not represented enough in the training data so to overcome these problems we also created two main datasets which consists from some very easy and frequent terms for example like beverage or cheese and also how we can calculate our metrics for each sample from the test data we generate 50th sequences using top case sampling and then we separate outputs by the comma by the frequency so we believe that this approach is more robust and reliable than the greedy search one here you can see the content of English and Russian manual data sets we try to find matching terms from the both wordnets however it is not really possible due to different graph structures you can also see some replacements here you can see the results of the selection of best form of artificial prefix so surprisingly we can see that the most full and extended in terms of information input format shows the lowest results we connected to the two factors the first one is simple reducing of the correct answers that model can see since we make prefix longer and the model can see less amount of correct examples and the second one we assume that large amount of unstructured information such as definitions can lead that it becomes very hard to model to capture the main information and as for comparisons of the models we can also see that GPT 2 shows greater recall scores while T5 is leading by the precision and also we can observe from precision at 10 scores that GPT is more sensitive to the input format comparing with T5 so for the next stage of our experiments we use default input format with sense numbers for English and default format for Russian since Russian word net do not have such sense numbers and used to fine tune three models for each language first model is a decoder second one is encoder decoder and the third one is instruction tuned decoder so here you can see the results for easy manual data sets for both languages and we can see that for both languages instruction tuned models outperform the other with a large margin and as for models compression we get controversial results and we can say which one decoder or encoder decoder suits our task better since GPT 2 is better for Russian data and T5 is better for English data and here is the results for big random data set and we can see that scores are pretty low in comparison with previous slide since this data is really hard both for human and for models so the scores are low and we also can see that in English Dolly still outperforms the smaller models but for the Russian SEGA shows lower scores than the GPT and we connect this to the fact that base model for SEGA is LAMA so SEGA can see much less lexical diversity than GPT 2 here you can see some results of the prediction for the English for GPT and T5 large and here is results for the target nodes which show the best scores in terms of precision so to sum up we found out that the quarter based model shows really high level of acquisition is a relationship and also that the most useful information from the taxonomy is pointing on the highest level of the taxonomy such as mention of grand paired node is much more important than the definition to help the model to resolve the harmony and predict correct answers however despite high results for the main node data set low scores on the hard one data set shows the further need for investigation in this direction and as for future work we think that maybe prompt union of large models can improve our results and also we expect that our approach could be extended both for the other languages and other taxonomy enrichment tasks not only hyponomy prediction so thank you for your attention I'm ready to answer questions First thank you for the talk and I wanted to ask how do the errors of the models look like so do the models produce very specific words or are they not on the topic or are they similar to the word that we are trying to look hyperonyms for or just something irrelevant no there is no really irrelevant results but we face the problem that for example formulation of terms in taxonomy is very specific for example as a human we expect for the beverage the word water however there is no water in the taxonomy there is only drinking water so we can get lower scores because of this fact so and the metric is for exact match of the word right yes maybe there's some room for exploring the metric like semantic similarity and not exact match would be nice and also maybe we will perform some human evaluation thank you the prompting approach works almost in all cases but not in all cases can you comment like what might be the reason why this degrades performance for some of the models we do not really use the prompting approach since we use artificial input and the model can like only for pre-training information understands what do we want so we use artificial input and also for artificial for artificial input we have much difference in scores between the models and I think that's really connected with the amount of information that model seen during the training so the larger model seems to pretend to perform better so they don't need this input or output post-processing large models no they all need since we use fine-tuning all the models learn correctly as expected format and produce terms separated by a comma so we do not need too much processing when we need to split and sort any more questions? I have one question so if to sum up you compare two large models and the huge model with six billions of parameters can you say something about how size of a model is actually matters because the large is less than one billion and the big is much bigger I think that size difference is really matter for the hard one dataset since like for the bigger dataset the difference between a smaller models and Dolly is like twice in this course and as for smaller dataset we do not have such huge difference in this course between smaller and larger models okay let's thank the speaker okay thank you hello everyone it's a pleasure for me to be here and today I'm going to present our joint research static dynamic or contextualized on the explanation of discovering semantic shift approaches and basically today I'm more like a presenter because the main contributor is Veronica Niganova who is here with us today on zoom and after my talk she will be also glad to answer the questions with me and once again the research is mostly hers and I'm more like a man than a presenter today because unfortunately Veronica was unable to come here today offline so as for the plan for this talk first we will talk about the goal and the motivation of the research and then briefly walk through the related work and the datasets then we will discuss the models we are going to explore throughout this work and the experimental setup and then we will look at the results understand the applicability of the study approach and discover the words which experienced the most significant semantic shifts at this point thus and of course we will talk about the future work and the possible development of the research so in this work we focus on semantic shift changes in other words on changes in word meaning that are analyzed by studying the context in which in particular word is used in different time slices studying diachronic semantic shifts is important for both theoretical and practical reasons on the one hand it helps researchers in historical linguistics to understand how the word meaning evolves across different time periods and provides linguists with some data-driven evidence for updating and improving dictionaries. In practical terms such models can be applied in natural language processing tasks such as information retrieval sentiment analysis and machine translation for example, helping to improve accuracy and relevance of these systems especially with historical text or multilingual data. Well, the main goal of these researches is to discover semantic shifts in the selected data set which deal with data and compare the performance of different approaches namely here in this work we compare the three main approaches static presented by work-to-work dynamic, we use dynamic word embeddings and contextualized we use BERT for these tasks we apply these models to two tasks discovering semantic shifts and detecting known shifts. These tasks are quite similar however we just hold the question, the problem in a slightly different angle and study it under a slightly different angle. Well, now let us briefly talk about the most important research in this field the first work is the dynamic word embeddings reveal static laws of semantic change where the authors use three different algorithms namely PPMI SVD on top of PMPI and work-to-work of Viscogram negative sampling to compare the results. In the second paper the model is trained on all the time periods simultaneously and it's also proposed joint optimization problem that comprises both embedding learning and alignment problems. In the first and in the third paper the authors apply contextualized models which provide a separate work embedding which are currents of the work depending on the context to semantic shifts problem namely elbow and BERT models. It is also interesting for us since it is dedicated to the Russian language and the fourth paper provides us with the data set and these line results for the semantic shift detection task. As I've already mentioned here in our work we use the two data sets namely the news corpus and social media corpus for the news corpus we used already collected data from Lanteru from the time period which starts at the year 2000 until 2019 and as for the social media corpus it was collected precisely as a part of this research from our contact social networks namely we collected posts from the year 2007 up to the year 2019 and we use these data sets in order to identify social cultural and political shifts rather than linguistic semantic shifts first because a time period is too small for considerable linguistic changes and second because the nature of news and social media implies reflecting cultural and political events and processes. Well as was already mentioned we used three different approaches the first one is the static we use the most classic model the work-to-work model with keep ground negative sampling then we use dynamic work embeddings based on the PPMI matrix and finally the contextualized approach for which we use the BERT model namely we use the classic rubric based model from Sberbank AI we conduct the two types of the experiments we call the first type is discovering semantic shift tasks and the second is the classification task on detecting non-shift basically in the first task we aim at revealing semantic changes from the data and in the second task we are already given the label dataset and the task is in the form of binary classification to predict whether the BERT has experienced the semantic shift or not for discovering semantic shifts we use the following pipeline first we train or fine-tune the embedding models on our data for different time periods then we align the embeddings and reduce them to the 2019 vector space for the work-to-work and calculate the cosine similarity measure for each on the eligible work for work embeddings of the year 2020 and 2019 for BERT we use the prototype embeddings namely we average the embeddings of all the occurrence of the words in the appropriate year finally we obtain top 20 words with the lowest cosine similarity between these time periods and analyze the revealed semantic shifts of term of their validity and actuality by finding other closest neighbors as for the second task namely the binary classification test for detecting known shifts we take the embedding model from the previous task and calculate the cosine similarity measure for each of the words in the classification list then we train a random forest classifier which is the classic classifier and use our cosine similarity measure at the obtain feature and evaluate model with quality metrics namely we use F1 at the main metrics and also compute the range metrics as precision recall and accuracy scores now let us proceed to the results we will start with the discovery semantic shifts tasks in the news corpus here we see the results namely the top 20 words with the lowest cosine similarity measures between the studied years and meaning that means that the these words have experienced the most significant semantic shift well let us take a look at the most interesting example for the work to work model there are words like narat and video in the beginning of the 21st century narat was associated with the police narat if I could say so while in the year 2019 this word used to symbolize an outfit like narat a beautiful dress or something there is also an interesting change in the word video while in the beginning of the 21st century it was more like a TV commercial then it shifted to the video clip which symbolizes which signifies the development of the smart phones and the growing popularity of things like YouTube videos and reviews and so on and so forth as for the dynamic work embedding model the word platno in the year 2000 meant the railway road in 2019 it is referred to a painting as for BERT BERT also captured the same technological change as the BERT VEC with a different word kadr this word was used in reference to staff and personnel in the year 2000 while in 2019 with the advance of modern technologies the word phones it is used to symbolize a photo so on this slide you can see the visualization we actually see that for example the word kadr has shifted a lot and here the platno and narat as for the social media corpus namely the analysis the contact post in my opinion it is even more interesting here we see the word norma and the word oh this is really great because in the beginning of the period norma or norm meant like good okay it's okay it's norm normal norm at the end of the time period it was used in connotation with the legal form and as for the letters oh oh oh in the beginning it was like oh the exclamation more like an interjection while in 2019 it is usually referred to the company like sorry I couldn't translate this on the spot but I hope that you understand what oh for the company means so and we can suggest that this change is due to the fact that the audience of contactity grows in the beginning of time period it was mostly about school students while in the now contact is more for adults it is mostly used for adults the same could be noticed for the dynamic word embedding model the word franzuski at first was used in connection with the franzuski subject franzuski lesson while in the end it is associated with other languages and as for birth we can highlight the word narni here it is in the year 2007 it was used in the meaning of alternative and in 2019 it was it is used to refer to the polar like north polar and so on well once again on this slide you can see the visualization of the semantic shift for the words that we discussed on the previous slide so this table shows precision scores for all the models for discovering tasks and we can see that the birth model showed the best performance on the second table showed the results in F1 score and other metrics obtained for the classification task once again we have here for the second task as I said we used our embedding models as a feature for the random forest classifier to solve these binary classification tasks and we see that our models show approximately equal performance with work during the best result and it should be noted that since our training corpus is smaller than the one used for baseline research we cannot compare the results directed with the ones from the original research but we can know that our F1 score indicators are close to the baseline scores so what take-aways can we take from this research basically work-to-back model is simply simple but rather effective model and the major problem with this approach is that we need to align work embeddings dynamic approach solves this problem by optimizing and aligning the embeddings at the same time during training but this approach is sensitive to the hyperparameters choice and is rather memory-consuming the BERT model provides us with contextualized work representations which automatically solves the alignment problem however in our research we have seen that BERT model shows slightly poorer performance and one of the possible reasons for it is that BERT shows poor results in comparison with work-to-back model is and it's not very good in detecting semantic shift for polysemi words well in this work we presented a new social media corpus compared different approaches namely the static renaming and the contextualized one to semantic shift discovering and detection tasks and conducted an interesting discovering semantic shift analysis in the experiments the tested model revealed political, cultural, social and technological changes in the rational language with the BERT model showing better quality of 80% for the news corpus while analyzing and discovering semantic changes for the social media corpus we suggested that some shifts can be connected with the fact that a large part of users that wrote the conducted post grew from school students to adults and there may be two main directions for the future research namely we are planning to extend the scope of the study models namely to use as a contextualized BERT-like models like rubric large and multilingual models and explore as a data set larger text corpuses and use different time slices so thank you for your attention now we are ready to answer your questions and hope that Veronica is here now Veronica if you are here yeah say something and could we please show Veronica on the screen is it possible somehow yeah test test can you hear me yeah Andrei let's ask a question I have two questions the first one is maybe I missed it but for the classification task was it like you said it was binary classification task was it about whether the sense of the word shifted or not one second yeah Veronica could we answer please at this time the question I didn't hear it very well due to connection problems I heard that it was about itself yes the question the first question is about classification task so was it the binary classification of whether the sense of the word shifted through time or not maybe this one will work Andrei do you hear us yes I can hear you perfectly yes so Veronica can't hear us now I hear you could you please repeat the question so my first question was about classification task I realized it was a binary task and if the sense of the word has shifted through time yes we did well actually it wasn't a data set with several words and we checked whether the word we had these words we had ground truth zero and one yes whether the word shift there wasn't and we had several words with the appropriate years so whether the word had shift from one year to another whether there was change and so we used our embeddings to assess this okay thank you and another question was about so this is an interesting matter in general do you know where this results can be applied in production or in some applications well as we said in the beginning there is practical applicability of such of such models yes for example for information retrieval, sentiment analysis and machine translation so we will give you the models additional information about different time so we will give different information about the changing working minutes in time yes so just as a helper to improve the accuracy of other models for example does it work like does it help when you add this information to the model for the task well I didn't check it personally but well it should help yes okay thank you okay more questions from the audience I think we have the question from Andrei Kutuzov the third question will be from Andrei thanks Maria can you hear me? yeah go ahead Andrei thanks Maria and Veronica for our very interesting talk and I have many questions not sure we will have time to address all of them the main one is about the evaluation of the findings in your scientific discovery step so am I right that you asked some experts that you mentioned it in the paper but how exactly was it done so you showed them these top 10 or top 20 most changed words and just asked these experts whether these words really changed was this the procedure so we gave the words with the closest neighbors five closest neighbors in each of the periods and yes the decision was made primarily on this information but broader context was also available if there were some well if they wanted more information yes yeah but still you showed the experts only the words from the top of the range as I say mainly the decision was made by these five neighbors closest neighbors from each of the periods but broader context was also available my question is did you show the experts also some random words from the bottom of the ranking because it would be because don't you think it would be more fair because essentially the demolition method you use then measures only like the precision but not the recall but as we say that it's precision we use as metric precision for this task and understand that it's rather subjective that's why we added the second task which is more objective and it's more objective to relate the quality of the models the first one was more about the data to reveal the semantic sheets from the data and to see it was more like interest or what words can reveal from these data maybe something interesting and to make it more objective we waited for the second task okay thanks and this classification task so why did you decide to use this data set from FAMIN at all 2019 when we have now available the data set from the root shift of our shared task which is much more methodologically sane thank you for your question it's because we've had the practically almost the same data set as the one the authors of these data set used I mean the annotated data set yes they also used the lentaru source and that's why we used because well I think that there would be more missing words in our data set if we're used for example in shift 12 yes in our data set because it's not really that large and here we could do our best to compare the results with the baseline research okay thanks yeah it's a pity that I'm not in the conference right now I would like to discuss it with you more and maybe the last question so the social media corpus that you released it's very interesting and thanks for releasing it but what are the legal terms of use for this corpus is it legally possible to redistribute it train models on it we don't impose any conditions on using it of course it's free from our part yes and since we don't say which of the user posted which post I think there are no legal concerns from the contacts because we collected it from the free API so I think yes this corpus can be used but you didn't get in touch with contact about it because in the past they were no we didn't so sometimes some data sets were removed from the net because of because contact got in touch with the creators of data sets and made them to remove the data you didn't get in touch with us but we didn't violate any rules when we were data scrapping this corpus so if they contact us of course we will have to remove it but I don't think that should be a problem thanks a lot now we have the last talk of this section it is ok so hello again I'm still giving an envelope and I'm going to present paper on probing which is titled less than necessary or more than sufficient validating probing data sets size so first let's do a short introduction doesn't seem to something is wrong yeah so first let's do an introduction into probing itself so we all witness the success of black box language models yeah but the success has started the interest into what's inside these black box language models and the area of probing merged and a bright example of this area is the data set named Semtaval which is made for one of the first probing data sets for English language and the probing itself is the task of detecting the capability of the language model so we ask the model for example does it understand the notion of the number of languages and diverse probing studies exist and for example exist studies which draw graphs based on the experiments which display the language capability of the model depending on the layers of the model for example with this 12 size model there are different domains of the language depending on the layers of the model and this picture is from Semtaval and diverse probing studies exist and there are lots of data sets for probing but the probing data sets are quite difficult to collect because they contain real language data and also the size of the methods for computational reasons as always so in our paper we propose a method called fraction probing which is used to determine the right size of the probing data sets and it consists of two tests data sufficiency test and data redundancy test so data sufficiency test is used for existing data sets to find if they could be smaller in size and data redundancy test can be used when building new data sets to find the point to stop collecting samples so the method is based on comparing probing graphs and their similarity both visually and using computational metrics and this method is based on learning curves which I will tell you more about them so speaking of the related works to this study they of course include probing they include probing data set size although this area is quite understudied they include simple size determination which is used for statistical experiments and is very common when starting new experiments they the related works also include learning curves which initially come from psychology but are used for for example training models and they also include the progressive sampling which is quite close to simple size determination but is used when we are building the sample and we continue to make experiments on it so now let's talk more about our method so here in the right picture you can see the original center valve and the graphs which demonstrate the capability of the model so for example if we take this blue one this is three depth which means that the model has to understand how deep the syntactic tree of the sentence is and it displays the performance of the model depending on different layers and how we interpret is that we see that the middle layers are more capable of doing this task than the first one and the last one so yeah and in these two pictures you can see what happens if we take a very little fraction of the original center valve and this is 40% of the original center valve and we can see that at 40% the graphs are quite similar to what we acquire when we experiment with the full dataset so the only thing is to find out how to directly compare these graphs so by our eyes we can see that this graph doesn't look quite the same as this one but this one is quite close so what do we do we measure the graph similarity first we leave the possibility of comparing the graphs visually but we also add some metrics so we suggest using three metrics and we experiment with three of them so first is Pearson correlation second one is Euclidean distance and the third one is Frasier distance so we apply all these metrics first two vectors like here if we have 10 tasks we compare 10 sized vectors so this one with this one and so on so like the columns and we also compare the graphs so we do that to compare the ordering of the graphs and we also compare directly the graphs themselves for example this one with this one so this goes in a 12 sized vector so we expect that Euclidean distance is better when comparing the absolute position of the graphs and Frasier distance is in charge of both the form of the graphs and their absolute positioning because Frasier distance is used to compare curves and is usually explained as such so imagine a dog walking a dog and they can walk any curve they want they can stop but they cannot go back and the Frasier distance is measured as the minimum leash size of the between the man and the dog that can allow such locomotion yeah so now let's proceed to the tricky stuff how do we find out whether the graphs are similar or not so for the data redundancy test with the existing data set we compare all previous fractions with the original 100% and we draw as so-called learning curve so this by the OY axis you can see the metric that displays the distance and we can see that moving to 100% the metric grows smaller which means that we are getting close to the original data set but the thing is to find out where to stop and we use some form of the elbow method we say that if we stop at the beginning of the play-to we are at the right place yeah so that's what we do with the existing data sets we just plot continuously plot the metric for different fractions and find out the elbow, the place where the play-to starts but the things are trickier for the data sets that are being built at the moment because we don't have any data set to compare with we don't have that imaginary 100% so we somehow simulate the setup for the existing the existing data set so what we do is we think like that so we have the data set that we have right now, we imagine that this is our 100% and we compare the previous fractions with it but the problem here is that we get the next points of the graph continuously and if we just look at the graph we cannot say if we have already reached the play-to or not so in addition to just computing the original numbers of the metrics we also compute first and second differences of the metric to say the first difference displays the absolute change of the metrics and the second difference displays the speed of the change so we propose a method and we also display its applicability so we work with the center valve, the most famous probing suit so it consists of 10 tasks of 10,000 samples each and we create these fractions of it continuously and we perform the data redundancy test with the existing center valve and we also simulate the data sufficiency test as if we were building the original center valve we experiment with Bert and Roberta and we use logistic regression as a classifier so the results for the data sufficiency data redundancy, sorry test you have already seen them and you can see that at 40% of the original center valve the graphs are quite close to what happens at 100% here you can see the table of all the results for the Bert model so here are lots of figures but it displays both the data redundancy test and the data sufficiency test so here you can see the three metrics that we compute the first one is just the metric themselves sorry the first one is the metric themselves so yeah these figures are the fractions of the original data sets that are so it's recommended by the metric so you can see from this table that the center valve data set could actually could be massively reduced although the actual numbers differ for different tasks so here we can, here are some conclusions about this big table so the visual method shows that each task could be reduced without losing its explanatory power the tasks differ by how they look like when increasing their size so for example word content here you can see that the absolute numbers of its curve are rapidly changing throughout the different fractions however other types of tasks don't show such behavior so we call this score growth so there's a group of score growth tasks and there are tasks with no score growth so when applying the metrics in a task wise manner we conclude that for shed distance shows the lowest mean error and Pearson correlation doesn't really look at the absolute numbers of the curves so if we want to compare only the shape and if the absolute numbers aren't really relevant for us then we can use Pearson correlation so for a layer wise method of application of the metrics we can see that in general preserving the ordering of the curves requires more data than preserving the shape for the data sufficiency test we see that simulated with the semtival we can see that the discrete differences can constantly recommend higher fractions so they're in a way more strict than the original metrics but however they are highly correlated with the original metrics so the results of the data sufficiency test simulated on semtival resemble the results of the data redundancy test on the actual semtival another thing is that the second difference produces less error than the first one which can be explained by the fact that it is less strict and looks at the change itself yeah so I already showed you this graph yeah and looking at particular different linguistic semtival tasks we can divide them into two groups first that we can reduce to the minimal fraction of 10% for both Bert and Roberta and second which require more data interesting is that this division cannot be really explained by the linguistic content linguistic sense of the tasks so this needs more thorough investigation yeah but the thing that we can already note that the standard classification parameters remain relevant so the word content task that you see here is different from all the other tasks because it has a massive number of classes it is 1000 as compared to other tasks which have like two or three classes yeah we experiment with two models so we can compare between them and we know that Roberta constantly requires more data there could be different explanations to this fact we go with the explanation that Roberta has like it is more it is more and it was more developed on the base of it and therefore it has like more quality high quality data encoded in it and therefore it needs more probing data to find out what's inside and Roberta similarly needs more data to preserve the ordering of the tasks yeah so these numbers are higher for Roberta than Burt yeah so to conclude we proposed a novel method for determining the right size of the probing data set it consists of two tests which are first one of which first one is data redundancy test which is applied to existing data sets to find out if they are actually bigger than they could be and the data sufficiency test it is applied to data sets that are being built at the moment yeah and we experiment with Centerval and apply it in both setups so for further work it would be interesting to look more deeply at the learning curves that we work with because we call them learning curves but they are not actual learning curves that are usually perceived by this notion they are created by artificial data and it would be interesting to find out if they follow the inverse power law which was shown for usual learning curves another important point for future work is to create a numerical definition of the Plato that we are by this time we determine by the visual method and in some cases it is tricky to find out where the Plato is yeah and another point for further work is to apply the proposed methods to other existing probing data sets because our results imply that actually the existing probing data sets could be smaller or even much smaller than they are at the moment thank you is your methodology for estimating sufficient size of data sets is generalizable to other tasks let's say not only probing but other classification tasks because I think it is quite valuable for pretty much every task thank you for your question actually it is applicable I could say that it is applicable to any probing task that draws curves but actually like developing on your question I can say that the method could be applied to any task that produces vectors so the novelty of the proposed method is that we come up with the means of comparing the shape and the relative positioning of the curves so like if we compare answering your question the method could be applied to any task that produces vectors which consist of numbers and if this task is pretty generic did you search for some alternative methods which exist maybe in the literature for estimating size of data sets maybe not specific for probing but just in general for machine learning such kind of methods exist yeah they exist and it is actually a big like classical area for determining the size of the data set for machine learning so they are usually applied to tasks that produce either one number or just classification and our task of comparing curves is much more complex in that way thank you more questions ok let's thank the speaker and go for lunch we'll come back at 3 the speaker so do not have the coupon please approach to me I will give you the coupons ok are getting in condition ok dear colleagues I think it's time to start the session please take your seats alright so in this afternoon session we start with our second keynote talk this talk will be given by Hakim Hasid who is the principal researcher in technology innovation institute Abu Dhabi which is United Arab Emirates honorary professor in Macquarie university in Australia so and Hakim will be talking about HAI or some movement towards HAI and well you're welcome thank you Maxim thank you everyone good afternoon so the talk of the afternoon is always complicated after the lunch so we'll try to make it a little bit light as much as we can initially I was planning to talk about I'm still will be talking about HAI but was more sort of a classical talk where I wanted to touch base basically on the different methods strategies that we are using in the HAI but then I thought it's slightly too classical for that so what I did I tried to gear the talk into more sort of industry inspired we have been meeting with many industrial people these last weeks and they wanted to share the small experience we had in relation to this HAI and generative AI models or LLMs I hope this would be useful for the different profiles we have here in the room and maybe we'll inspire and give some ideas on topics that could be eventually treated at the fundamental level so just before I start so I'm coming from the TAI TAI is Research Institute that is located in Abu Dhabi it's the research arm of the ATRC which is the Advanced Technology Research Council TAI is composed of ten research centres that you see here going from material, autonomous robotics, biotech space, directed energy we are in the AI centre that is in the Digital Science Research Centre these days we are focusing a lot on the LLMs, large language models but we are not doing only that, we are doing other stuff related to image processing theory or fundamental issues with Maxime who is here, we are doing some other stuff related to EDGE, so on and so forth so my presentation will be organised this way so some discussion about the generative AI EDGE machine learning, so two aspects one related to the inference on the EDGE the other one is the learning on the EDGE on which we are trying to focus, I will give three use cases on which we are working just to illustrate this sort of EDGE AI and then we will be finishing or concluding on the future of generative AI so the logic behind my talk today is basically to bring together these big models or these big generative models but also open the doors to this EDGE AI hoping that I will convince some people here that having bigger and bigger models is not necessarily the right option to follow but there are other options also that are there so just for the generative AI that is making the best today so that's not new, I'm sure everybody in the room is aware of that so the RNNs and the LSTMs that were behind the transformers probably were much older than many of the people who are in the room but then we have seen a lot of sort of evolution these last years after the transformers but these transformers were not the only reason we have actually the computation power or the physical layer that became much more interesting much more powerful which allowed the execution or the exploitation of these models now we see different models that we are hearing about we have the LAMA GPT-4 and we have Falcon which is coming from the TII for example so we have started in the past well the AI is not a new thing again so it's been there for a long time I think it goes together with the computer science and the computing in general but at that time if I focus like starting from the end of the 80s we started looking into this artificial intelligence closer and we were hoping that we could have really intelligent systems but I think at some point the objective was too big for what the systems or the equipment the physical layer was able to provide at that time I think we may only remember the expert systems at that time where people were promising that those systems would be able to solve all the problems that we have at work in the industry that didn't happen that fell short I would say then we had this machine learning that came in the mid of the 90s where we were focusing more on some statistical analysis and more simpler program then we get this deep learning again this deep learning was actually I think it came at the right time because the physical layer became much more interesting well then we are talking nowadays about the generative AI that is basically an AI that is able to generate content and this is very important to keep in mind this AI is able to generate content which means we cannot do for example some reasoning to which I will be coming later so we are here to generate some content you have different players playing in different domains you cannot go through them but that is the combination of the deep neural networks and sort of a higher capability in terms of computation when we got into this competition that we see at least from our side we see it almost every day who is the one who will be building the biggest model so there is a huge competition there this competition actually is I would say is not healthy at the end of the day so we are trying actually to build the biggest model thinking that the bigger we are better we should be so people start actually looking into this program in a different way we have different scaling laws that came up and demonstrating that the quality of the model is not necessarily related to the size of the data that you are putting inside but it is more related to the quality of the data that we have inside just to give you an example so when TAA has released the Falcon 40 billion parameters we have been doing better than Lama for example and honestly the architecture that we had inside was not that more sophisticated than Lama the only thing that has been done is actually to take the data and clean the data in a better way so we have used a smaller portion of the data but the team has cleaned it much better than the results where became actually much better than what Lama was having at that time so we have been ranked in the different leaderboards as first then Lama again came with the second version which was better but then we came again with the Falcon 1 ATB that is better so we don't know where things are going but there is this sort of new vision where we are saying maybe we shouldn't continue in the size but we need to look into other things this is justified actually not only by researchers who are working there this is why I was saying in the beginning we had a lot of meetings with different industrial partners and players who are actually complaining about this size because you can imagine if you have a model with a 180B to run it to run the inference that is also costly so those businesses are not ready to spend that amount of money to run models and we can add to that those models are too generic they are not necessarily specialized in the business that they want to solve so the equation is becoming more complicated and we need to look into other options eventually so in this slide they tried actually to look back again and try to see if there is a parallel between what was happening in the past in the hardware world and what's happening today with the LLMs in the hardware world when we started in these computers we have actually started building big computers and at the time the logic was also the same if my computer is bigger so it should be more performance but then we find that it's not necessarily following that logic then we started building smaller computers personal computers so on and so forth even to things where nowadays we have phones that can compute much more than some laptops we have the internet of things we have different stuff so I believe that this generative AI will also follow a very similar trend going from bigger the bigger the model is better the quality should be to smaller the models are the quality also should be better or at least equivalent to what we have so this opens actually the way to look into this edge and what this edge basically the edge is the edge of the network or the small devices that are there that are nowadays used mainly to capture the data and display the results instead of having any sort of computation that is happening there so can a single model large model actually sort out everything this is the question that we are asking more and more nowadays I think the simple answer is no we are facing a lot of issues when it comes to the practical dimensions the businesses as I said are not ready to spend a lot of money on a general model that is not necessarily serving the objectives of the business I hear that we can do fine tuning for example yes but that is also costly so people are questioning actually the use of these models you have different other aspects that are related to that I just mentioned I mean related to this answer of no or that justify the no answer so you have the domain specificity but then you have the data availability as of now all the LLMs at least as of three days back all of all the LLMs are built on historical data businesses are interested in more real data real time data how to process that real data how to integrate it in the LLM to be able to exploit it you have the multimodal tasks people are not interested only in text we see more and more multimodal models that are coming out but we still have the same issues in terms of generality of the content issues with their fine tuning for example so on and so forth the cost that is related to the LLMs or these generative models building a model is not I would say open for everyone for the moment we have big players that are or who are building such models you have open AI Google meta you have TII who is spending a lot on that but if we go to small companies or just companies who are not in that business the cost may be extremely high and it's too risky for them you have the issues of privacy and security your data actually is going in the LLM you don't know what's happening with your data your competition can get access to the data for example without knowing anything customization this is the things related to those who are working for example on the web and recommend the systems there is no personalization that is by default added to the to the LLM or to the generative model the scale and the complexity so here again the LLMs and the generative models are specialized in generating content right so they learn patterns and then they try to give back those patterns there is no reasoning that is there as I said and then you have issues related to the ethical and bias considerations and the carbon footprint that many of these actors I have mentioned before are trying to sort of work on but still there is a lot of effort to be done as of now just to give you an example again to the Falcon 1 ATB the amount of computation that was used correct me if I'm wrong Maxim I think we used 4000 GPUs for 6 months so that's huge and in terms of energy I think it's more than the cars that are circulating outside for some period so the generative AI process when it comes to the domains and I mean the conclusion from the previous one is that we need to think of more specialized models we need to rethink the approach that we have taken for building these models instead of building general models, big models maybe the idea is to look into smaller models exploiting smaller devices for those who are not familiar with the generative AI so you have sort of part of the system is to get your data be it text, image, speech whatever you have then you do some preparation of the data you do your training this is your foundation model but then when you want to apply it for a specific domain I don't know energy, education, finance you need to do some fine tuning fine tuning is the adaptation of the general model into specialized domain or specialized task ok so this is of course costy but this part also is less costy but still it will cost you into things so the adaptation itself and then the hosting usually of the inference to go to specialized models you have this fine tuning but you can build also we can think of building more specialized models and more much smaller models some sort of arguments that push us towards thinking of other strategies the privacy and security concerns so most of the companies and the businesses now are raising these issues of security and privacy you have even governments who are setting rules on sort of business data they should not leave the country for example this is what we have in the UAE for example your data as a business should not leave the country and should be processed and stored in the country using all those services like chat GPT is against the law so none of the companies is using these kind of things so I mean this is an argument that will come more and more to sort of push as either I mean push everybody either to build their own LLM or generative model or think of smaller think of strategies that will help in building smaller models in the future so there is a high cost constraints so everybody is complaining about the cost whenever you mention the cost people are not prepared for that well you have also the awareness right people they don't understand the needed energy or the needed computation to run these kind of models but still the costs are high and everybody is complaining about that you have also the some opportunities that are open at the edge level the computation power is getting more and more interesting so we can do some stuff at the edge level at the edge level you can have basically your computers your phones for example you can do some stuff at the same sort of order of magnitude as the GPUs or the clusters of GPUs but maybe these can be exploited for such a thing so the high demand and expectation on the performance people they want sort of less latency for example good so this is the architecture I mean very high level architecture of what we have as infrastructure today so we have the cloud we have edge devices that are related to the cloud as I said before the edge devices currently are used mainly for capturing the data and displaying the results that you may get so most of the techniques we have we use these devices we capture the data we send it back to the cloud where things all the computation is done then when we have the result we send it back to the for display so we could look into that and that will be later my conclusion I will link it to this one but we believe that the edge layer is not that much I would say used so we could use that layer we could use it in a better way to sort out these privacy issues to sort out these cost issues to sort out the latency issues and these are the reasons that pushed us to start thinking about the edge so the edge machine learning is basically what it's a combination of edge computing and machine learning the objective is to build and execute machine learning models directly on that of course we have to start from somewhere we will not start by building an LLM directly on the edge but what we are trying to do currently is to build at least traditional machine learning or some small deep learning models on the edge so there are two ways to go a little bit further so this edge machine learning can be operated on one device or multiple device they are constrained by no sharing of the data we don't want to share the data with the cloud for example or with other devices or at least we should have control of that sharing and we hope that we can offer similar capabilities to what you get from a cloud system some questions that motivate the use of the edge so is there a real need to share the data with the cloud, is there a need actually to let your data leave from your device and go somewhere else and then how can we allow training to happen directly on the edge this is a very important issue because of the limited sort of computation that you have on the edge and how can I efficiently execute those models on the edge when it comes to the inference so we have done a paper where we have, it's a long paper I will be trying to summarize it into four or five slides we have tried to understand the different requirements of the edge machine learning we have divided that into three parts, so the machine learning requirements, the edge computing requirements some overall requirements that are related to everything from the machine learning perspective you have the low task latency, the high performance we need always high performance to when it comes to the computation, generalization we always, again, we need some generalization when we find our models, enhanced privacy and security and then we have to be independent from the data that is labeled, from the edge computing we have the efficiency in the computation, optimized bandwidth, offline capability and low communication latency, this offline capability comes most of the time because the people they may have issues with the network and they still need to use their services, that's a very important thing and this should be allowed by the edge machine learning and you have the cost and the energy that I would say is related to all, so we have to play with all these parameters when we think of doing edge machine learning it's not an unlimited resource as we have in the cloud, for example so you have three parts when it comes to edge machine learning you have the learning on the edge, you have the inference on the edge and then we have some part, some of the work that is done around the preparation of the data directly on the edge, so the paper I was referring to is this one, for those who are interested it was published a few weeks back it's a long paper, it's around 50 pages but we go through the different aspects related to edge machine learning so from the inference I try to summarize things in this slide, we have different ways of applying or bringing sort of a big model to run on a smaller device the first one I believe everybody is familiar with is the quantization, we are working a lot on that so we are trying actually to work on the models to change the encoding of the data to reduce the encoding of the data so that the model becomes smaller but still we keep the quality of the model we are able to quantize models up to four bits, from 16 bits a very similar sort of performance you have people who are working on the weight reduction we are not sort of handling this part for the moment but the idea here is to see it's a sort of a pruning strategy that we are sort of exploring we have the knowledge distillation, the activation approximation I'm just going quickly on this one while we are not doing much on that but on the last two ones we are exploring it a lot, combining it with the model compression and we have the caching when it comes to the exploitation of the large language models on the web so the influence on the edge has a reflection, I think you have a variety of methods that are there, promising results you have sort of methods that are able to reduce the models 32 times and then bring them with a very similar quality you have lack of real public implementations, we don't have much of the things that are stable and that can be commercialized for example which means we still have a lot of work to be done at this level, lack of automation to do this quantization for example you need to have people who will be working it's a try and fail so you see what is not working so you try again and again until it works this is also related to the diversity of physical architectures, it depends on your processor it depends on the physical capacity that you have so you need to have people who should be behind that to control it again we are investing a lot on this part trying to bring some solutions to automate the quantization because it's very important and again it should bring an answer to all those businesses who do not want to see their data moving outside their company, we don't have a killer application yet so more work is needed for those who want to work there, there is really a lot of effort that need to be spent learning on the edge, so that this started lately basically, you have different approaches again, but the objective of what we want to do is to build directly the model on the edge so we don't want the help of the cloud we want to build directly on the edge, you have limited constraints and you need to work on that you have different methods, I'm listing them here but the most used ones I would say is the distributed learning, so we are trying actually to distribute learning on different devices, everybody is aware of the I guess the federated learning but you have other methods also that are in the same space I would say well we are exploring it but here we need the sort of we are exploring it from a theoretical perspective in the theory team, but when it comes to the application or the concrete sort of deployment of these kind of things we get a lot of physical constraints again that need to be solved, so we are trying to work on that especially when you use for example heterogeneous devices, that's not an easy task to have, where in theory for example we try to sort of make all those constraints basically, we don't take into consideration that so we have the federated learning, we have the split learning that we are also exploring transfer learning again I will share the slides for those who are interested, just not to go into technical things but then we have the summary that we have tried to build on the Edge ML so this is the taxonomy, again we go back to the Edge influence, Edge learning and data preprocessing, you have a lot of methods that are there, but again so in terms of papers, research papers we have a lot of papers that are there, I think the paper we have published has more than 250 references but when it comes to the industry and the platforms that support those kind of things, this is very limited and I think there is no need for investment on that side to make things happen, because again this will help bringing these LLMs to be built for at least the inference could be done at a lower cost to generalize the use of these LLMs so this is just sort of a summary on the learning part, so the generalization and the adaptation is complicated to do when you do the learning on the Edge, while we need to set the theoretical foundations of that, the architectures we are always facing the issue of heterogeneous device, so this is an important thing that I believe needs to be taken into consideration when we build that theoretical foundation, you have sort of the hybrid approach that should be exploited in my opinion, hybrid in the sense that I can use the Edge but I can also use the cloud as a collaboration between these two things data quality and assurance, any business you talk with will tell you that I have issues with this, because again we assume a lot that the data is good but in reality the data is not that good so we need again to find sort of methods that would verify the quality of the data, we have a PhD thesis that is working on that and then you have the standardization that needs to come again to help us into this diversity of devices, so I brought like three use cases here just to show you quickly some of the work that we are doing in the Edge side so this is George who is trying to show us something, so what we are trying to do in this application actually is to show the possibility of building a machine learning model directly on the phone, we illustrate it on the activity monitoring, so what we are trying what George is trying to have here, he has a small model that is running on his phone and he just showed that the working is, we can identify it and then he just tried a new activity that is not known by the model so what he will be doing now is to just record some samples of data of that activity and then he will rebuild the model again or update the model directly on the Edge so here he is basically collecting some samples of data and shortly he will run, now he got tired so now he can run the training the thing that is important here to keep in mind is that the training is happening on the phone, okay we are not sending any data to the Edge to the cloud, sorry, but everything is happening on the phone, it is taking a little bit of time the idea was not the performance but to show that the possibility that we can build and update the model directly on the Edge so now the model has been updated now he will test, basically he will start the inference and see if the new sort of data has been added to the model or if the model has been updated or not and then he shows that and I think those who work a lot on neural networks have a huge issue of catastrophic forgetting as you learn new things you forget the previous things we have integrated that also in the system we are able to incrementally update the model while not forgetting the things that have been already learned so the second case if I can go is the sort of reinforcement learning algorithms that we are using to help in the navigation of the drones we are trying actually here to combine image processing with reinforcement learning the idea is to help the drone to autonomously explore an area without getting or without colliding with the obstacles that you have the final objective is to run of course these models directly on the Edge, on the device the drone the nice thing here again is that we learn some situations it's reinforcement learning strategy then the environment can change but the model is still working properly and the drone will necessarily prevent getting into collisions against the walls I just jump quickly to the situation where we add it's this one normally so we have added some obstacles you see the black obstacles are new they were not in these ones they were not in the initial learning environment but the system is able to identify or to recognize at least the obstacles and go further so here what we are trying to do is to integrate these models into web navigation so we have built a sort of extension of the browsers that you use and any page that you visit you can actually request for summarization you can also discuss with the page or interact with the page with questions and answers this can be working with the cloud so what we wanted to do is to bring these things to run on the edge so what we have done the work that we do on the quantization we quantized the models and then we brought them to run on CPUs and then we add them to when you get this extension you get actually an instance of the model that comes on your computer so whenever you discuss with any page on the web and it doesn't go outside so here we have asked for the summary and then we started asking questions on the content that is on the web page so here for example how much funding is allocated toward low carbon solutions and then after some time you should get an answer explaining that we have done different questions and then the system is getting the data from the page and not from the LLM itself just to conclude so the future of generative AI so many people are looking into that as building the bigger models than what we have nowadays from our side we believe that the future will be not only in one model but it will be in the combination of smaller models but not only that the generative models will not be able to solve all the problems but they should be also mixed with what we call traditional machine learning models and then you can have sort of hierarchies of model coordination of models as we do in web services for example and then have a sort of more complex system that would bring solution for more complex program and more adapted situation that we have nowadays so the future generative AI should be multi-model that is an important thing to keep in mind we do not want only text or only images we need combination of different types of data specialized models should be at the heart of the ecosystem again we don't want huge models that are too generic or at least not only we need to build collaborative strategies between all those models we need reasoning capabilities so that we do not have nowadays and then we have security and privacy that needs to be taken into consideration and the last one and I think it's the most important one the action ability that needs to be attached to the these generative models as of today the models are only generating data so they can recommend things for you for example things to do, things to watch, things whatever but then when it comes to the action you have to do the action you have to follow up and do the real work there is a need to integrate some action ability in these models to make them really supportive for the end user and for the business I think that's all from my side thank you ok I have a question now some phones starting to have some accelerators for specifically this computation for example neural engine in iPhones the question is how accessible they are and what is the perspective because as I understand now they are not very accessible like neural engine etc what do you think about perspective in the next three years well that's a good question thanks Kiril so I think these things have started always in the same I would say pattern so the device is always expensive but as we move there are new technologies that are coming and they should be accessible but to be fair the people who are able to again it's a matter of how much funding people have so the people who are able to sort of budget things for 4000 GPUs for 6 months I think that's accessible for us right so we started actually acquiring some of those devices trying to sort of play with them at least for the moment and see what we can do but I think those will be generalized soon with the new sort of architectures that we are having you can check like NVIDIA companies that are trying to come up with new devices with new architectures so in my opinion that would come accessible much faster than we think could AI be used in genetic engineering and nano robotics if yes how yeah that's a good question well nano robotics were not at that level I believe I mean we need to explore right it's a matter of how much computation we sort of we can exploit so we need to look into that I don't have a sort of direct answer to that I need to explore a little bit more my understanding is that we don't have a lot of computation at that levels we need to look into that that would be my answer for the moment but it would be like much lower in terms of edge right so we need to think about that thank you Hakim any more questions I think you had to turn it on yeah so as far as I can get what edge AI is about is that the thing that's computed on the user's devices that's what the user asks for is there any room for exploration about computing everything like on the network of devices that are kind of signed in to use the service like something similar to a torrent network yes there could be issues with privacy but maybe ciphering the data would help yes definitely I mean the edge alone can solve again all the issues right so I think there should be a collaborative approach where the edge is collaborating I mean with the small devices collaborating together as they need also to collaborate with the cloud so the message here is not saying that we should eliminate the cloud from the equation but it's more to collaborate devices they have to collaborate together and collaborate with the cloud because at the end of the day you may have some staff that need to be collected at the cloud level and then computed there but there could be a peer-to-peer that's in terms of protocol you can have a peer-to-peer approach where different sort of devices are collaborating to build something yeah thank you and thank you for the talk thank you any more questions to a previous question what do you think about a multi-agent approach to LLAMPS where you don't have this one giant big boss LLAMPS but rather have a protocol of communication between different LLAMPS and maybe some specialized LLAMPS know how to better answer such question and others know how to answer question in another domain or in another language related similar to human society there is certain distributed system is this more or less falls into this edge competition paradigm or that's probably different well I totally agree on that I'm really aligned to that and that's why I try to build a figure that we have here so you need you will have the models what you call agent is basically a model so the models they have to collaborate between them and then this collaboration can happen one-to-one for example between a generative model and another one but you could also use some more classical machine learning things because the generative model will not solve everything but then you will have other layers also that will be coordinating sort of more complex or building more complex sort of logic that needs to be executed so human language is the most complex system ever invented and this is actually a mechanism to communicate but for LLAMPS or artificially created systems what should be this protocol do you have any idea should be this also human language or should be it artificial language what should be efficient protocol to perform this communication like an interface I would say that's an open question we need to work a little bit more on that thank you well it seems that there are no more questions still have some ones go ahead I wanted to ask are you concentrated on general tools and technologies or you also work with requests from companies about their precise tasks that's a good question so I would answer this question in terms of organization that we have so from the TII side we are more in that R&D basically building those generic tools or optimizing and working on this edge for example but then we have another entity that is called Adventure One that is more focusing on interacting with the customers than sort of doing some fine tuning and this kind of stuff so let's say a company may come to you with their request and you will tell them what you can do to optimize their process okay thank you I'm making a pose to be sure I think we are good okay let us thank Hakim again thank you we have like in five minutes we have started the next session so I'm reminding that we have two parallel session and the one on theoretical machine learning and optimization will be in a different room and those who are for NLP they stay here start all NLP session session on machine learning and data analysis on the different whole room let's start with first talk good afternoon everyone it is my pleasure to present a work on multi-label topic classification for the Kyrgyz language a joint work with Sergei Nikolenko of Stecco of Mathematical Institute in St. Petersburg and other places and Gulnara Kabaeva of Kyrgyz State Technical University named after Arzakov a little bit of introduction and motivation Kyrgyz language is one of the languages of the Turkic family of the Kipchak branch and many well several millions of people can call it their mother tongue mainly of course in Kyrgyzstan but also in China, Tajikistan, Pakistan, Uzbekistan and Afghanistan and Russia however and despite of certain corpus of research work with computational linguistics flavor dedicated to Kyrgyz language the number of open language resources for Kyrgyz is rather small and upon trying to solve some applied problem that involves usage of Kyrgyz language often meets with certain obstacles due to the lack of language resources so one can say that language is definitely a low resource one until recently and here I think we can say thanks to Zberbank there was no general purpose LMS for Kyrgyz however as for many other languages from those hundreds of families of models that were trained in multiple languages on common crawl and other large bodies of text one of these hundreds of languages is Kyrgyz so one can attempt to use Excelema Roberta, Bayes Large etc. Roberta Bayes Multilingual Cased and as we've just discussed although certain solutions may arise the nearest years the current NLP still leans towards universal models but still despite that reliable evaluation for any language is necessary and arguably is impossible without manually annotated resources which is why we decided to develop to make a first step to develop the first dataset for an applied task which could be suitable for fine tuning some LLMs for the applied Kyrgyz task to just find out whether it is at all possible the answer to this question is not as evident as one may believe and after that we will publish a benchmark a competition is ahead so we haven't yet but we'll publish a benchmark with all the results all the models and all the data so we've built our own dataset with the kind permission of the editors of the 24kg agencies that's the Kyrgyz news agency we've scraped the Kyrgyz language section of the site that's yielded 23 something thousand news articles and on that site there were no topical texts for news and Kyrgyz so as you can see at the bottom of the slide certain topics are present but only Russian texts are annotated with those and actually they are not quite suitable they are too general in some of those so to say topics are actually multi-topic so we had to decide on certain labels to do that one could try to use some general purpose thing like they use in advertisement like IAB taxonomy Demos taxonomy in the older days but those are also too broad we have attempted some zero-shot approaches and unfortunately nothing worked and in certain private conversations with practitioners I was told many times that working with topical data from a single source it unfortunately has to be custom so that's what we're focused on but one can't just invent those labels and we decided to do the so-called exploratory annotation to do that we've sampled 500 texts we've translated all the titles into English applied sentence, obtained sentence burden beddings for all of them and grouped them by hierarchical clustering and then the manual annotation for every cluster we've attempted to invent a title which is general enough to cover most of the news in the cluster and then add some extra topical labels for multi-topic texts so after that quick introduction we did the re-annotation again from the start because say the second half of the data some labels that we've introduced there were not available at the start so that's the example of the cluster and the proposed annotation clearly all the texts are about certain fines, certain crimes but sometimes with the political flavor sometimes with the ecological flavor and so on having done that we've decided that the label set is established and then we've done the same thing with two 500 text batches which yielded a dataset of 1500 texts so the first two batches were then used as a training set, the last 500 were used as a test set and here's the statistics here are the label statistics for the first two batches one can see that arguably the distributions of labels are relatively similar which can also arguably signify that the annotation procedure and the annotation scheme were relatively consistent in the process so hopefully usable. This yielded 20 topics as for experimental setup it was pretty standard but with a multi-label twist we had to do an accurate splitting so that the distributions of the labels in the splits would be similar so we did iterative certification for bag of angrioms models we used two-fold cross-validation because the dataset is rather small for neural approaches which are computationally harder we've used a simple train-depth split but basically the same splitting procedure for statisticalization we've used an LTK and we are lucky to have a morphological analyzer for Kyrgyz language from the project Apparatum and that's we've used for something that one could call stemming and basic word tokenization solely because that's well one could use an LTK or something or something maybe from the standard package but this is something that seems to be more or less standard for Kyrgyz. Okay now for the models so we are about to build a benchmark and we've tried some very basic bag of angrioms models with an extensive hyperparameter tuning and those were there were two groups of methods one group is based on linear models and well our own logistic regression and stochastic gradient descent with well basically on basically on linear SVM and logistic regression the first approach that we call independent classifiers adjust which this is just a set of independent binary classifiers for each labels for each label and the chain approach is also standard approach with classifiers that are trained in a row and the data for training for the next one in the sequence uses predictions of the previous one the other approaches may not be a great choice for large sparse vectors but we've included them since they are truly multi-label and yet classical enough for the nearest neighbors flavor for multi-label classification and here go the results so for evaluation we've used several metrics probably the most descriptive one is jacquard index sample wise jacquard index computed for every pair of predicted labels for the sets of predicted labels and sets of gold standard labels and averaged overall samples so the numbers show that the performance is not that great overall also we've computed exact measures Hamming distance F measures the micro averaging flavor and also sample averaged version we've also added the count of the percentage of times when at least one label was guessed correctly what we see here the first approach is a simple bag of row token engrams with different hyperparameters so these are the best groups of hyperparameters that we found with grid search for each family and the results are not great and clearly as expected the nearest neighbors methods perform poorly and when we move from row tokens to character engrams also quite expectedly would get a boost in performance when we move to stem tokens if you will just compare the numbers here and here when moving to the stem tokens we also get an improvement also quite expectedly since Higgs language is morphological reach, agglutinative language and removing almost all FXs gives boost an interesting observation is that when we take stems and also convert them to character engrams this also allows to improve the results and that's it with the very basic approaches but probably the most important thing we were planning to do with this benchmark apart from publishing it is trying some fine tuning some multi-lingual language model we miserably failed with training birth multi-lingual so it just couldn't produce any reasonable results however hard we tried but we've achieved certain success with with BPE tokenization and as you see it outperforms all the previous approaches by a large margin it is also important to note that for bag of engrams bag of stems approaches we did quite an extensive hyperparameter sweep hyperparameter search with Roberta that was not the case we just found the proper number of epochs and that's it so just more or less default parameters so without any proper hyperparameter search we've built the previous models by large margin which sort of proves the points and proves that the multi-lingual models are suitable for fine tuning 4KGIS tasks and that's and to conclude the main outcomes of the research is that of course that we've manually annotated and will publish very soon a data set for multi-label classification of news 4KGIS language we've shown very clearly that multi-lingual approaches are feasible and outperformed by a large margin the earlier approaches the more classical simple approaches and yeah everything will be published but of course you may have noticed that this annotation approach this size of the data sets and everything else is not without flaws and we're going to add some more new evaluations to the benchmark first of all we'll try to apply the model that has appeared before the ISD deadline the one the MGPT KIGIS it is also and as the reviewers rightly noted that it would also be nice to translate the KIGIS texts automatically into English and apply some English based models like BERT or something just fine tuning to test how well this is clearly something that should be done for low resource languages to test what can be achieved by state of the art also we believe that zero-shot learning prompt engineering approaches which were not quite successful when we tried them should not be thrown away and maybe we should return to that to certain articles conversion into something for prompt but most importantly I guess and due to certain recent developments in Bishkek that is kind of easier now we will annotate more data with the help of multiple native speakers and we will do it properly with the instruction with a fixed instruction and several notators for sample just since I have some time just a short note this work was carried out mostly at least the annotations were carried out mostly in 2022 but something changed in 2023 a large community evolved large data science community evolved large sub community of volunteers appeared in Bishkek and we've done something else some named entity recognition data sets and some more efforts are on the way such as yet another Kyrgyz corpus so there is more work that lies ahead and it of course should be done whether or not without our participation I think that's it thank you for your attention so thanks a lot of work has been done and one question which I'm interested in is whether transfer learning strategy through multilingual language models is preferable over just pure translation of data sets let's say you don't have resources for Kyrgyz you have a machine, I think you have a machine translation system one way is to just blindly take and translate the data sets and train something and you train a model specific language as a Kyrgyz MGPT you get certain quality or another way is you get a multilingual model which supports a lot of languages you train on let's say English data set you assume that somewhere in the model there is a separation of knowledge let's say some model learns to let's say detect sentiment positive negative sentiment but then it transfer to Kyrgyz so in practice if you just need to choose which would be in your opinion working let's say better out of the box thanks for the question so let me just clarify so the first part was about translating something into Kyrgyz that's actually a great idea and that's something that colleagues are doing colleagues in Kyrgyzstan are doing for other tasks but I would also like to ask you to clarify the last part of the question the last part would be not to translate to assume that translation is prone to errors and instead you train on a clean data set not only in English but using not monolingual model but the multilingual model which just was pre-trained on Kyrgyz so it knows that Kyrgyz exists but it trained on some English data set or Russian data sets and then it starts to solve tasks in Kyrgyz and then the question is which is more preferable strategy to translate with errors but still learn on monolingual model for the target language without or rather go this knowledge transfer approach yeah thank you for the question yet again it's a good question and I don't have a good answer unfortunately I have beliefs only for that first thing I've met I've met the thing with training on English and shared knowledge not monolingual model we did that with information extraction we've seen that effect but from my experience on other tasks of the sorts with Russian language mostly I believe that the first option is more preferable so translating some standard data set if possible right to Kyrgyz and fine-tuning on it should be better I think based on my experience on the tasks that are slightly less relevant than they could be that's also kind of my experience as it's a very strong baseline but I just wanted to thank you for your talk I have several questions the first one are just an idea because normally if you don't have a lot of data set on some tasks you search for another language which is quite close to this one so do you know which language close to Kyrgyz and other models related to that so maybe in future work you could rely not only to the multilingual ones not only to translation so maybe using some Russian-Belarussian Ukrainian probably so they are related and you can probably tune some models thank you for the question speaking of the similar tasks in similar languages I've seen, well speaking of this particular task text classification even without the multi-label twist what I found like to the best of my knowledge there are some data sets in Turkish and in Kyrgyz written in Arabic script the Chinese flavor of Kyrgyz and nothing else that I found but speaking of translations into similar languages, yeah there is that work of 2020 by the Turkish inter-lingual community which is now I guess special under its group in Turkish languages of ACL their model could be used probably but for the task I really don't see any data sets of the sort in other languages but of course the Turkish language processing community does exist well Kyrgyz is rather scarce in terms of resources but the situation is a bit different with of course Turkish language, Uzbek language Tatar language but for this task I haven't found anything else OK thank you and one more quick question your data set that you were creating is very specific so it's multi-label and it's in Kyrgyz and you told that you're going to reuse it during some for benchmarks for Kyrgyz or some other things like correct me if I wrong but are there any reasons why this specific data set multi-label because like there are many other tasks that you could start with were there another specific reason for some application maybe or what was OK thanks but well I would say that OK there was not some any specific motivation or inspiration that apart from the fact that well in my opinion and vision topic classification is one of the I think tasks that average data scientists could meet in one's practice right so something like I don't know things like sentiment analysis or topic classification and I don't know maybe that's it I mean named into recognition is already something on a different level so that was the motivation and it is like one of the classic tasks of information retrieval I just decided that that was a good idea thank you thank you very much yes so if I miss something in the beginning but can you elaborate a bit about the annotation quality check how do you guarantee that how many annotators do you use some crowd sourcing or do you have plans for this yeah the pretty descriptive thanks for the question the descriptive phrase in the beginning was quick and dirty so that's it so I was actually planning at some point well I have a certain access to experts that are more proficient in Kyrgyz language to native speakers that would that would check yet again I mean one of the authors but still so yeah the only more or less sophisticated procedure for establishing something guaranteeing some quality was the one I've described on choosing the label set as for the quality quality check that was just reading stuff multiple times thank you for the work presenter for the second okay so let me check yes so I'll start straight with a brief introduction to the problem the summarization there's two approaches extractive leverages existing text fragments to select a set of highlights and abstract summarization improves on an extractive by employing additional language resources to paraphrase and combine the set fragments into concise sentences now the main approach to solve abstract summarization is sequence to sequence where we have an encoder to extract the contextual information and decoder that generates the summary in accordance to this contextual information and the preference is justified since GPT models that have several times more parameters just fail to achieve the level of performance of specialized encoder on decoder or counter parts especially even after fine tuning as was proved on the second plot that's from official open AI article about summarization so they proved that T5 performs better in terms of human evaluation then they are fine tuned to GPT free and now there's an evidence that classic sequence to sequence approach is not enough several works just showed that integrating extractive summarization in training and inference loop improves the quality substantially especially for full transformer models the point of works so what distinguishes full transformer models from other architectures is that besides encoder-decoder bridging the attention is used for all intermediate embeddings in all layers meaning that the overall impact of attention is much larger to an extent that attention patterns are now part of summarization models so the models that are more aligned with ground truth extractive labels happen to perform better, converge faster to more optimal results since they spend less time searching for important sentences and just learn to paraphrase and combine them so many researchers argued that it could be beneficial to correct this attention and the first approach is to use the local binary masking of attention mechanism so it works by just selecting the important parts of the sequence and the problem is that it's equivalent of token removal meaning that the model wouldn't not attend to the masked parts and be propagated and mainly the whole information, the centrality of the context would be shifted in the other direction meaning that the optimal summary would be different so to alleviate the issue the researchers came up with an idea to apply the content selection masking to a subset of layers and attention they do so by searching for layers responsible for silencing evaluation and applying the mask obtained from some content selector maybe an extractive summarization system, maybe some query from the user and the alternative, the only existing alternative to that approach is just complimenting the existing attention mechanism to receive some more complex guidance signals relevant attention, this is the latest in state of the art approach, uses semantic query document matrix and applies some simple linear transformations and aligns it using some simple interpolation with cross attention weights to guide the decoder to query relevant positions and we hypothesize that actually there is no benefit in tampering with existing attention mechanism because it still interferes with natural information flow and for alternative solution we looked for inspiration in image processing area namely text-to-image Dalí text-to-image model uses one interesting technique it has no name they call it just result blending it is based on the idea that the model uses click embeddings that maps both text and image embeddings into the same vector space meaning that if we take two different prompts two different text sequences encode them and then just take a weighted average we will obtain some intermediate image embedding and it seems that the result is quite stable so following the same idea we derived a biased encoder mixture it is quite simple we don't use different inputs we just use different attention masks so one full attention mask go through original encoder using the original input and then we derive the selected mask and process it it is just our expansion the theory is that if we use an encoder that is more sensitive to masking that is more focused on masked positions it would provide more amplified signals that would better guide the original embedding to query relevant positions and so the decoder would produce more optimal results so to test the method we derived two masking strategies first is based on just using ground truth extractive label statistics and the second one is dynamic in case if we are planning to use it in practice it is based on extractive summarization system so for any kind we just need some distribution over the sentences that would denote the salency of the set position and so to obtain the mask we just use top piece something of these sentence position distributions we evaluated the methods on four domains so news, science and dialogue we used the respective state-of-the-art models and to determine everything for the methods we used simple grid search on validation part and the quantitative results are quite promising bias encoder mixture seems to outperform every attention modulation method well in almost every scenario and in best case scenario bias encoder mixture can bring up to 8% improvement over the original and in terms of quality well it was important questions so how it performs does it violate the coherency, the relevance and it seems that none of the methods violate these constraints there is a sampling from the news data set so there is a reference, the original model prediction and then there is a set of alterations corrections so relevance attention is based on semantic similarity matrix it just injects the named entities into the original generation, attention does the contrary it deletes some extensive entities like there was China and other nations now it's just China and the UK is now replaced with official state which is quite questionable and bias encoder mixture is the most different of the bunch, it just simply forces the model to revise the whole summary since it creates a new embedding the model understands that text is likely to be different and so the new summary tells us the story about the submarine drones that can work independently and track over thousands of miles and basically bias encoder mixture version is more aligned with the reference summary so about the patterns, about the alteration patterns they are also different the original generation can disagree with the reference at any position, meaning that the generated summary can be quite different from the reference so the attention based methods don't take into account the initial mistakes from the initial positions they scale with the article length so closer to the ending, they introduce the changes and their attention is just guaranteed to revise the ending bias encoder mixture is more radical and more uniform, yet still by model it is more aligned of course with the intended reference distribution as we've seen before it can force the model to completely revise the generated summary and in terms of semantics well the attention based are quite conservative the relevance attention since it uses semantics, similarity matrix just keeps everything in check with the original generation their attention is more brave and just has a wide range of semantics changes and the bias encoder mixture is the bravest it can completely diverge from the original meaning yet it is still better aligned with the intended extractive summary this concludes my presentation I'd be glad to answer your questions well you're referring to the news data sets from multilingual or the one that was collected by Gusev but they are quite noisy the reason why I chose a CNN-DLML for ablation or for case study the data set that was proved to have it was written by editors so they are quite aligned Gusev used automatically extracted summaries and sometimes they filled with just some automated things so they aren't so reliable for their experiments and besides there are no just multiple I used Brio, Pegasus models and yeah Brio, Pegasus but so it's summarization specialized models there are just none for the Russian thank you I have a question it's more like a theoretical question do you think you're a model or like am I true that the data set you are using it's summarizing just only one article multiple summarization multiple of multiple articles right? do you think it would be possible to transfer this summarization to like multiple articles and like is it applicable or not or what could be the difficulties to do this? you're asking for a technique well of course one of the main approaches to solve multi-document summarization is like one long document with multiple chapters so just using some special markers some special tokens we can produce some special embeddings for these chapters and so it's indistinguishable from the same long sequence and yes of course bias encoder mixture can be applied to any model that has encoder so even if you have multiple encoders that's one of the approaches just to use an individual encoder for each part for each document then yes you can still apply the same approach for multi-document thank you because normally the problem of multi-document summarization is that if you just make them as one long document normally the model looks at the beginning of the text and of the end of the text because it contains information that the model should look into that's why sometimes at some experiments I have seen it doesn't work and you just combine all the sentences all the text in one large text that's why I was thinking how it could be well yes the context window is just limited and the attention patterns just get lost so it could be now I recall yes it was one of the first work before transformers and they really did they used multiple encoders and just interpolated their embeddings using of course not just simple interpolation they used a fully connected layer to just to combine these embeddings into one and just passed them to decoder and they proved well they said it worked better than just you passing the long sequence because it was the recurrent neural networks now we have GPTs and so they claim they can accept the context up to a thousand tokens yes so just don't know never test it thank you good afternoon my name is Ekaterina Zalivina and today I present my work of automatic detection of dialect of pictures of scope dialects the purpose of my research is to create a model for transcribing dialect speech in this work we focus on scope dialects and provide researchers with a tool to detect dialect features characteristic of these dialects in the speech of informants why it is important as a field researcher I know firsthand that field data is collected manually so our tool allows to reduce the amount of manual work and concentrate on analyzing linguistic phenomena and we present experiments using Russian dialect data which are not so common now in solving the problem of automatic speech recognition what about the steps first we collect corpus data for manifest for automatic speech recognition task and do the manipulation for detecting features we fine tune models for speech recognition and have three approaches for detecting features and we choose the best approaches and create a big pipeline to work with them what was the data data in the corpus this one we must mention that is to corpora overtaken and the data in which was collected during expositions to trade region and on the map you can see the location of the villages where the data was collected and it is worth noting that the Pochetsky villages are located closer to the borders of Latvian Peveurs although these dialects belong to the Pskov dialects and this in dialect specifications note quite a lot differences between them how we preprocess audio data we take files from corpus files or text-read files we export notation to text-read files to have the similar files and then we have an audio segmentation based on the sentence lens we convert all to lowercase and we generate the file in a format that is presented on the slide and you can see a little bit of statistics that we achieved and I can say that these are very low resource dialects and we work with a little amount of data and about text data notation for binary classification for each notation we assigned one if it demonstrated the implementation one or more dialect features otherwise zero and for rule-based and token classification approaches each token in the notation was assigned in the IOB2 notation where B is the beginning of the sequence I is the second or subsequent tokens and O is the absence of a dialect feature and on the slide it's presented the sample of notation about phytonian approaches it was four approaches and we can see that we have two corpus and for first approach we train on first corpus and then test on the same corpus and the same approach is for second corpus and third and fourth approaches we have two iterations of phytonian first one we phytoned on Zapodon-Dvinsky data and then on Apoczytsky data and test on Zapodon-Dvinsky data and fourth we trained the same way we used the base matrix for this task for speech recognition is word error rate and character error rate and for dialect features detection precision recovery for one score and accuracy speech recognition we select models that were pre-trained on standard version we use three architectures the common mistakes of the two first models are combining two tokens into one splitting one token into two parts inserting characters and to correct such errors we use the spell checker Yandex-Speller and we see results on this slide and we see that for Zapodon-Dvinsky data the best model is the first and we train on this data and test on this but for Apoczytsky data the best result is a fourth approach when we fine tune on Zapodon-Dvinsky data then on Apoczytsky data and we get better results for detection of dialect features we use three approaches binary classification of entire sentence binary classification of token and multiclass classification of each token as for a rule-based approach we took a list of phonetic morphological and syntactic dialect features from grammatical sketch then for each rule we write over the function for example to determine the realization of a name in the first press variable we follow this algorithm we use dictionary parser library to obtain transcriptions instead of Russian then we determine the backness and the height of press-dress vowel and the presence of polarized consonant before the vowel and for morphological and syntactic features we use pymorphite 2 and Natasha we know that this approach successfully corrects the identification after detection for dialect features at the level of phonetics and morphology but variability is not taken into account and all positions in which dialect feature can be realized are marked with an attack but we know that now it is very common to have variability in dialects about entire audio classification for each audio fragment my frequency capstone case was calculated and during the experiment three classifiers were used and as a result we see that the method is more peaceful rather as one of the intermediate stages but the problem of classifying tokens is not solved still in this case we see reliance on the audio and consideration of variability and about talking classification we fine tune X-men Robert for two sets of tech binary and multi-class and we see strong influence of variability on the choice of classifier but we see that this approach still cannot cope with lexical and syntactic features next we decide to have experiments with the poshsky data in the rule base it shows results higher than on the podnodvinsky data which may indicate only more consistent implementation of the features in speech of informants and about entire audio classification it is important that the implementation is impossible without training only on the target data only on the the podnodvinsky data and it is necessary to train on target dialect too and last talking classification we see that models that x very few dialect features without fine tuning on the target data but it rather good results for after fine tuning on this data and we create pipeline to work with the best models and audio recording in web format is accepted as input data we convert audio to a single channel format then we divide recording based on areas of silence fragments of audio below special dust and we get transcriptions with the best model then we also have attack for each talking and we generate finally two formats to work with Pratt and Yvan programs and on this slide you can see how it looks like in the Yvan program and conclusion if the goal is to get the model to recognize one selected dialect fine tuning a model that have already seen a closed dialect to give better results then fine tuning a model trained only in standard Russian and we see it on the case of purchase data thus phonetic morphological syntactic and lexical differences between closed dialects do not impair the quality of recognition and the last data to fine tune should be the target dialect for which the model is being trained not some closed one otherwise the quality will be lower and we see it in the case of Zapadnadvinsky data and a new hypothesis has been put forward for further research to create a universal dialect speech recognition model and it is necessary to fine tune the model on the entire sample of dialects at the same time thank you, that's all for the scope dialects could you summarize the difference in the use of your I don't know exactly what's the difference yeah they had a lot of similarities but differences too and they had different realizations of phoneme, they had phonemological differences for example in Opoczytsky dialect they have final consonant in the sort present form of verbs and Zapadnadvinsky data in this case but in Opoczytsky data they don't have last consonant at all and we have a list of features and in most of them they have a little differences we're going to know it for a while could it be that you can artificially generate one dialect from another known verb in order to increase data in terms of it's a good question sadly but I think it's possible but nobody so maybe it's a bit generic question but so these linguistic expeditions could they be somehow modernized now with let's say modern technologies mobile phones and LLMs or something like this they install certain application and ask some people to interact with some dialect agent and then the data is collected more or less distributed way in a very cheap way do people actually start to think about these kind of practices or there is certain inherent linguistic data collection in this way now it is not popular at all to use modern instruments and expeditions I was in expeditions this year and we still go here with micro we write we record speakers and then manually annotate their speech but this my work I believe that I have done a step to modernize the process of this and I believe that we will go next year to the expedition and use this tool thank you quite condensed it's good that you found new hypothesis my question is about this proposal for universal dialect speech recognition model how this should operate what is your vision of this model and what is the purpose of this model should it predict like a multi-label one out of 100 or maybe 10 I don't know how many dialects and why people need this model I think that now we have dictionaries and other sources where dialect features compare with each other in different dialects I believe that we can find more if we try to optimize this process and also I should say about variability that some features died and we should see what is still in the dialects we can see is this task can be solved just by building some reference of all the dialects why do we need machine learning all resources were collected 50-70 years ago and it should be modernized and I think that it is one possible way to do it okay questions regarding the practical application of this interesting instrument first thank you for the talk and you have already said that it hasn't been used in the expeditions but it could be interesting first if it's not 100% if it's not 100% quality maybe it can first be used to guide the scientists and then they could correct the mistakes and second it could be interesting to look at if the model and the scientists make mistakes in different or in the same places so maybe the model could help them in the specific places where they are unsure yes of course at the first step it should be corrected by the experts and then I think that the model can reach better result if we train on the entire sample of dialects and it is too parallel process for me and about the second can you repeat yes it could be interesting to look at if the model and scientists make mistakes in the same places or not I analyze only models errors but it is interesting to see notators errors and compare them so I think that it will be good enough to do it in future ok thank you thank you for the talk I'm curious what are the most notable differences between Pskov dialects and the standard Russian they have unique phonetics for example I talked about 1st, 2nd, 3rd syllable they have for example then they had unique syntactic structure they have verbs forms that we don't have in Russian standard language so they have in our mind disagreement between auxiliary verbs and verb as a main so there are a lot of differences we can see also in constructions and morphology thank you and do you understand correctly what life is influenced by the Belarusian language? we have some researches about it and of course the main influence is standard because of TV people that were born recently but in some cases the influence of Belarus got it thank you hello everyone today we are talking about compression of large language model based on transformer architectures and compression is provided by matrix or tensor decompositions the field of natural language processing has made significant progress with the development of large language model transformer architectures transformer models share a common challenge of expanding scale presenting an obstacle to model employment and training especially for small research groups so in our work we decided to reduce the size of large language models for example Bart and Burt these are encoder based language models by the compression the initial layer using tensor and matrix decomposition so we have Burt and Bart and this is model based on transformer architecture and most of parameter several work shows that many parameters into this model are redundant so we decided to select several layers select several layers and compress it we see how many parameters contained in the different layers inside the model every transformer architecture consists of MLP blocks, embedding blocks and attention block and how you can see into this table the concern and Burt and Bart and this is out of scope of this presentation but this is true for GPT-22 for encoder based model too the most number of parameters are in the MLP block so we decided to take MLP block take layers inside this MLP block consists of two fully collected layers and apply several decomposition to it we decided to take one matrix decomposition is a singular value decomposition matrix decomposition with efficient formation one tensor based decomposition tensor train matrix decomposition and tensor train matrix decomposition with efficient formation so we made this variant of decomposition and replaced fully collected layers inside the proper architecture with the corresponding representation as a design we take full model model which was obtained by downcasted the full model in PyTorch FP16 we started to represent our weight not into the floating point with 32 precision but with floating with 16 precision and block pruning to do block pruning we select several MLP layers or several head into the attention block prune it through it over and obtain a model with a smaller size so how we can apply singular value decomposition into our fully connected layer every fully connected layer has a weight matrix weight with the size dimension input multiply to dimension output and we can represent this matrix weight with in a single value form with two factor matrices U we transposed and with the diagonal matrix sigma which contains of singular values to make a short version compressed version we can select the air there are most significant singular value into the sigma there are first row into the U matrix and there are column into the V transpose matrix and we obtain the compressed version of our decomposition truncated version so having initial linear weight W we can now we can obtain 2U weight WV2 with this multiplication of truncated version of factor U we can try to square root to the matrix sigma and WV1 with this multiplication of square root of matrix sigma to the truncated version of factor matrix V the compression rate is the following and into the denominator we have multiplication of input and output dimension of the initial matrix as a denominator we have some of the multiplication of the size of this and this weight matrices to understand how we can deal with tensor compression we should go into some tensor notation so tensor is a multi-dimensional array it is a big cube we have n axis where n is a number of dimensions and this big cube this big object can suffer from can suffer from having a lot of parameters so we can obtain some tensor compression technique to represent this tensor into the more compressed way in other word we represent the tensor multi-dimensional array as a set of factor object which usually have less dimensions than the initial tensor and because of this object have less dimension we have a compression comparing to the initial tensor so some notation we want to know by this letter tensor is defined by this letter where n is a number of dimensions into the tensor the tensor in try is defined one element inside the tensor this is the regular definition for the core tensor in some decomposition so when we decompose some tensor into the set of object with low dimensions this object usually represented by this letter this is matrix and vector and this is the definition for modern matrization or unfolding cooperation of a tensor when we make unfolded we create a matrix based on our tensor in other word we have tensor with n dimension in this slide we have a tensor with dimension number 3 and we select one dimension and scratch our tensor alongside the selected dimension like harmonic and we have matrix and this matrix is depend on the initial tensor in the axis alongside we scratched it so if our tensor has 3 dimensions we have 3 different matrices by unfolding this tensor by every dimension when we want to represent tensor in more compressed way first step is to define the way in which we will represent our tensor and the next step will be to fill this set of containers by the digit by the value in the proper way so we decided to represent our matrix in tensor train matrix format with an extension of tensor train format what is tensor train format this is when we represent our initial tensor by the set of core tensor every core tensor has dimension no more than 3 so out core tensor has dimension 2 but in the core tensor has dimension 3 the number of tensor into this sequence is equal to the number of dimension into the tensor so here we have a tensor with dimension equal to 3 so we have 3 core tensors to calculate to count the entry inside the corresponding tensor we should do the same for example we want to calculate entry under the indexes 231 we should select the second slice here from the second core for the first core the second slice from the first core the third slice from the second core and the first slice from the last core multiply it and after multiplication we obtain an object with size 1 with number of dimension is 1 we obtain a point, the proper point so this is the formula which we use to calculate every entry of the tensor so should say that every tensor has no more than 3 dimension the outer dimension is rank rank is a dimension alongside the neighbor tensor will be multiplying so rank 2 here and rank 2 here is equal the first rank into the outer tensor is equal to 1 and the rank are usually different so we obtain tensor train decomposition with different ranks but for simplicity of formula for simplicity of calculation all rank are usually set equal to each other we select our form in which case we will represent our tensor and now we should understand which digit should be in our core tensor how to do it there are several algorithms for tensor train compression and one of the most famous is TTSVD which is provided by Ivana Sleditz we have initial tensor and we should do initial tensor with d axis and we should do d steps on every steps we unfold our tensor over the selected axis this is unfolding then we apply to the obtained matrix SVD obtaining u, vt and sigma truncated to the corresponding rank this rank should be the rank of tensor train decomposition we obtain the corresponding core tensor by reshaping factor matrix u and the rest of our factors we multiply it and this multiplication goes to the next step so on the next step we unfold this tensor instead of the initial the times which equal to the number of axis into the tensor and obtain the number of cores which is equal to the number of axis into the tensor unfortunately we cannot apply it to the neural network architecture and for neural network we should use the expansion of the tensor train format the train matrix format is very similar to tensor train format instead of two points if we are talking about tensor train representation we are usually talking about tensor over the point if we are talking about tensor train matrix representation we are usually talking about tensor over matrix so we have no point with dimensionality 1 but matrix with dimensionality 2 and now our index y turns into the index tuple so now we have no core with dimension maximum 3 but we have n cores with maximum dimension equal to 4 and our formula of calculating entry turn out to be this one so we have two index instead of one and we have core tensor with the biggest dimensions and into the tensor train how we decided which form we will with which form we will work now we have matrix with size dimension input and dimension output then we factorize dimension into the factors our factors will be the initial shape of our cores and then we reshaped this matrix into the two n dimensional object two n dimensional tensor we should permute axis of these objects so the required axis from the tuple of the indexes are adjacent and when we obtain tensor c by reshaping permute we apply tensor train SVD algorithm on these the compression rate is this so into the denominator we have multiplying of all factors which in real life is a multiplication of dimension input and dimension output and here we have sum of multiplication of dimension of every core tensor so returning to our Burton-Bartz model we decided to see its behavior on three compression rates and for every compression rate we selected the proper rank for SVD for transcated SVD decomposition and for tensor train matrix decomposition so if we want to obtain for example 69 million into the BERT we should apply SVD with this rank and TTM with this rank SVD is quite simple we have two linear layers a sequential linear layer instead of one and in TTM we decided to represent our matrix as a set of four cores with these shapes so into the TTM algorithm we can varize the number of cores too so we can represent our matrix as three cores as four cores or as five if you want so with these shapes where our rank we take from this table and we choose jk that j multiply k are approximately equal among all tensors inside the sequence of core tensors okay we were working with pre-train transformer based language models and when we applied decomposition of course our quality has dropped significantly so we decided to align task objective which we used to obtain some decomposition of the given matrices with task objective which we have on our downstream task which with our model is working to do this we decided to inject fission information into our decomposition algorithm fission information is a way of measuring the amount of information that an observable random variable x carries about an unknown parameter theta of a distribution that models x in other words we have a set of we have some data set which consists of some object and on the certain object model gave some prediction so we can calculate loss of this prediction and in this loss explicitly contains information about some parameters inside the model so fission information is defined as here this is a partial derivative over the probability over the probability of the output which model gives on the certain object into the data set with respect to double v in this formula is the weight inside our layer and we can approximate this in this way so we calculated loss on every object take a derivative and it calculates the mathematical expectation so we obtained iv, iv is a matrix of fission information which has the similar size as our initial matrix v we multiply this matrix by matrix of fission information and obtain svd, a singular variable decomposition on this multiplication and we should also multiply u-factor to the inverted matrix of fission information in this way we inject information of the output of the model, inject of the task objective into the singular variable decomposition algorithm what we should do is tensor train matrix algorithm as you know tensor train matrix algorithm is based on tensor train algorithm and tensor train decomposition algorithm is effect a set of several svd so if we have fission matrix i double v, we do with this matrix the same operation as we do with matrix v to obtain our tensor and then we unfold tensor based on matrix v and unfold tensor based on matrix iv and we have two matrices and do with two matrices the same things that we do here so into the tensor train svd decomposition algorithm we have several set of svd steps and into the svd steps we do the same things that we do here and by this way we inject our fission information into the tensor train matrix approaches okay we define for decomposition technique that we have seen before and then we started to evaluate it the first evaluation point was built we evaluate our method for the natural language understanding task global benchmark which consists of language assessment, sentiment analysis prophrasing, semantic similarity, natural language inference tasks we firstly fine tune our model over every task from benchmark for one epoch then we compressed fully connected layer using one of four techniques svd-fivitim and then we fine tuned our model again during one epoch and we obtain this score and the score can show that on the big compression rate this part of table the best performance comes from ttm and fissure weighted ttm approaches but on the medium compression rate and low compression rate the best performance comes from svd and fissure weighted svd approaches and also when we apply fission information into one of the svd or ttm algorithm we usually have an increase in our performance model accuracy or some other scores the next point was sequence-to-sequence was a variation of sequence-to-sequence model BART and the first task for BART was prophrasing we made prophrasing of Paragitox dataset in this dataset we have a pair of sentences the first sentence is looks like sub-retirations and the sentence we want to obtain should look more polite than the initial phrase on this dataset for matrix works the first is style-transfer accuracy how accurate we fit into the polite mode similarity how similar the meaning of these and these sentences and fluency of generated text and the last matrix is joint score this is a multiplication of these 3 matrix on this dataset the best score usually comes on every compression rate comes from fissure weighted svd and fissure weighted svd gives a great increase into the performance according to other approaches the last evaluation point is also sequence-to-sequence model BART and we try to train BART to provide a summarization we have a dataset x-sum which consists of several hundred thousand of BBC articles and a single sentence which provides summary of these articles and on this dataset the result of this dataset imitates the result on the glue the best score on the high compression rate is provided by fit-forward TTM so added fissure deformation into our decomposition algorithm also some boosting scores on high compression rate the best scores come from fissure weighted TTM on the medium and low compression rate the best score comes from fissure weighted svd so there is a graphic for different task for glue benchmark for BART and on this graphic you can see the main the main tendency of the whole work red is svd blue is fissure weighted green is TTM yellow is fissure weighted TTM so usually the best score is for fissure weighted svd but on the several tasks on the high compression rate which is described by the right part of this graphic, of every graphic the best score goes to the TTM or fissure weighted TTM so this is the rest of our glue tasks and this is the average task of the old glue so as a result we take four different techniques for compression of fully connected layers in BART and BART models a different compression level and different technique can give better or worse result and usually for BART on glue benchmark and BART for XSUM on the high compression rate TTM fissure weighted TTM provides the best score in other variants of compression fissure weighted svd provides the best score and the alignment of the task and the decomposition objective by injecting fissure weighted information inside to the compression algorithm significantly improves the performance of the compressed model so thank you for your attention I think that distillation is quite good technique and distillation is good when you train your distilled model towards the desired task for example you can say I want to obtain distillation of BART I train distillation of BART to the task of the initial BART to the task of natural language understanding and on natural language understanding distillation can provide a good score but when you write a task on the other type of task distillation can provide a good score and our method gives approximately the same result over all set of tasks yes, and TTM too don't think so you mean the compressed different layer with different drugs and with different shapes it can make sense some research which is out of scope of this presentation which can see that some layers inside the TTM and BART into the BART or GPT can be less compressible according to 3D and TTM and can be worse compressible it can be seen by C to the single value spectrum and of course it makes sense it can give some boost but it's quite difficult and we decided to set one rank to the whole model so I have a question we have an age of large language models and the popular method of compressing them is quantization there is no studies but there is an evidence about that larger models are easier to compress have you conducted some ablations about the scales well I saw about base maybe you conducted some experiments with BART large maybe it was better for compression I mean the model with large number of parameters can be compressed better no, I haven't seen I haven't seen such paper which provides such experiment I only have seen a paper from META about binding between the number of parameters into the model and number of talking into the data set on which you train your model there are different scores with different metrics and people proved that this function is logarithmic but unfortunately I haven't read any paper which can directly prove that we can compress so the large language model is better compressible follow up question about all this decomposition there are some related work much older about SVD and they showed that after decomposing the model if you run about one epoch of pre-training it will perform better have you done something similar in your ablations? yes so it does help absolutely it's very dependent on the size of task we selected people usually evaluate large model not very big data set one epoch is really enough to set model to some good point another question thank you for your talk I'm also curious about how your work is compared to quantization because in the field of large language models nowadays quantization is very popular and I know that 16 models trained in 16-bit precision can be compressed for example to 4 bits almost without losing much quality so basically we will lose something like 1-2% of perplexity for 25% compression rates so I'm curious how your approach is compared to these advancements you asked about comparison of quantization and my approach or can we apply our approach with quantization? comparison so which one is better? I suppose we have information in slides because we didn't compare with 4-bit quantization but for 60 floating point and we can see that our approach provides the best score then fp evaluation this one this role into the paraphrasing paraphrasing something change and fp16 provides the best score then every of our approaches I think one issue here is that here we see naif so basically all the weights were in float 32 and were naively converted to float 16 but these new approaches are smart for example they take outliers and they quantize some percent of outliers differently and so on and they provide it may be interesting for you to compare with them because they provide really good compression rates without any significant harms to quality now that we have more sophisticated ways of quantization and I think it's really interesting to compare with this thank you I'll turn the approach hi Dmitry can you hear us yes do you hear me yes we can hear you please share your slides and you have 20 minutes for the talk and fight for the QA session okay just a second please please let me know if you see the slides yes now we can see the slides hello dear colleagues today I'd like to present to you the results of our research which was devoted to the development of the heat room recommendation system the title of our paper is on the slide the authors of the paper are me Dmitry Hrnoklev and my colleague Pavel Prostovsky let us begin with the main idea behind our study as most of us probably know idioms are quite an essential part of many languages and native speakers usually tend to use them in certain circumstances because idioms enhance the fluency of speech and the expressiveness of speech however non-native speakers may sometimes struggle to find an appropriate idiom for some contexts also I'd like to mention here that nowadays become more and more popular automated writing assistants writing assistants some of which are powered by the way like Deepel and Grammarly are able not only to correct mistakes grammar mistakes or spelling mistakes but also they are able to improve style and suggest continuations for the entered text so here we come to the motivation of our research we aimed to suggest a system which is able to recommend an idiom for a given context on the upper image you can see an example of writing assistant and below there is a kind of simplified scheme what desired automated system should look like and how it should function for the purpose of our study we obviously needed data and by this I mean a set of idiomatic expressions and contexts which contain these idioms based on the literature review we have chosen EPI and data set which is the basic corpora in our study we have chosen it because of several reasons firstly it is the second largest really available data set another reason is that it contains definitions for the idioms that are presented in this corpora table added up of the slide contains several examples from this data set however this data set required some sort of preprocessing which is described in detail in our paper after it we have split the result in data set into train and test subsets and in the 88 to 20 ratio with stratification just to make sure that all the unique idioms are represented in both sets also I'd like to mention here that the test set is fixed for all configurations which I'm going to describe further to have a right to compare different configurations during the experimental part of our research we finetuned one of the considered model and in order to make it more robust we required additional data to get more context for our idioms we have used the Guardian API which provides handy interface to parse articles published in the Guardian newspaper this allowed us to obtain almost 25,000 additional sentences and therefore we have increased our initial corpora by more than two and a half times the graph on the right side of the slide presents two distributions of the number of different contexts per idiom before and after the parsing process on this slide you can see a scheme which represents our proposed approach our proposed approach is based mainly on the semantic similarity search ideology we obtain embeddings for input sentences and all of the collection sentences input sentences are also called queries in terms of semantic similarity task and the collection sentences are called documents then we obtain embeddings using some model for both of these input sentences and collection instances and then we rank the collection items based on cosine similarity to the query and take the corresponding idioms as recommendations the main metric in our research was mean reciprocal rank just in case if someone is unfamiliar with it here is a formula now before we move further I'd like to mention here just one more point related to the test set it's obvious that when our approach is used on inference we receive sentences without idioms so we remove original idioms from all of the test sentences it's illustrated on this scheme now let's return back to this to our main scheme we can conclude from this scheme that our algorithm has only two key parameters which can be varied the collection on which the semantic search is performed and the model which is used to obtain the embeddings so using different collections and models we obtain various configurations of our main approach let's discuss different collections first in our research we consider four different collections the first is called idioms and it consists only of the initial data set themselves then we have idioms plus steps collection which consists from idioms with concatenated corresponding definitions for these idiomatic expressions then we have sentences collection which consists of the sentences from the train set and finally sentences plus-plus collection which consists from the instances from the sentences collection so it's basically just a sentences collection which was enlarged or extended with examples from the Guardian API in the tables on the slide you can see some examples from the collections that I've mentioned now let's switch to the models in our study we employ pre-trained on the Google news data set work2vec model from the Consin library as a baseline model just to establish some kind of R against which we could compare certain configurations in the case of the work2vec model to obtain sentence embeddings we averaged embeddings of the words in the sentence but the main model in our research was sentence birth model since it is considered a state-of-the-art model in the task of semantic similarity search this came on the right site of the slide presents a sentence birth architecture at inference just as a reminder also in our research we use several sentence birth models straight out of the box from the sentence transformer framework which includes a Minilem model a sentence birth based on Minilem model a sentence birth based on distiller birth and MPN besides we also finetuned a sentence birth model based on distiller birth which is noted as distiller birth plus to obviously achieve better results now let's talk about the finetuning process of the distiller birth model as I've said earlier we have parsed additional data so we joined our initial train from the EPI dataset with contexts that we have collected from the garden then we split this new set into new train set and validation set in the ratio of 92 to 10 again with certification then we create so-called positive and negative pairs to create a positive pair for a sentence from new train set we matched a sentence with another random sentence from the new train set with the same idiom and then we have removed the idiom from the first sentence the process for creation of negative pairs was identical except that we matched two sentences with different idioms this process is illustrated in the table at the bottom of the slide on the right side of the slide we can see hyperparameters we have used for finetuning and the graph which illustrates the dynamics of the accuracy on train cross five epochs so as we can observe at the fifth epoch accuracy reaches the plateau we have stopped training now let's take a look at the final results this table right here contains MRR scores for all of our configurations as we can see in a green cell finetuned Distilla-Roberta model achieved the highest result overall on the idioms collection as I've said earlier we consider work-to-work model with sentences collection as a baseline configuration therefore we can see an 80% gain compared to the baseline and 46% gain over MpNet at sentences configuration which achieved the highest MRR score before the finetuning process also we can draw a conclusion that MRR is higher than 0.5 which means that on average the correct idiom is ranked second on this slide you can see examples of simple and difficult idioms for our best configuration simple idioms are characterized by a high average reciprocal rank averaged over all corresponding sentences from the test set while average reciprocal rank for difficult idioms is close to 0 as a possible explanation for these variations in performance I can mention that first of all some idioms might be used in a wider context or in a more complicated context or perhaps another reason is it could be related to our evaluation protocol because we assume that there is only one suitable idiom for each sentence which might be overly strict however these hypotheses are under-researched and we plan to examine this phenomena in the future as for the main results of our study we'd like to highlight that first of all we automatically expanded EPI and dataset by more than two and a half times and therefore we have created basically a new dataset for the task of idiom recommendation for the English language secondly we present a fine-tuned spare model particularly for the task of idiom recommendation and secondly we present a novel approach which is based on the semantic similarity search and implementation of centers bird and also we have examined the suitability of several neural models including word to egg and centers bird task of idiom recommendation as for the future plans in the foreseeable future we plan to further expand the dataset because we have a hypothesis that showing the model even more various contexts on the training stage might result in even higher performance and secondly we would like to add filters to prevent some kind of inappropriate recommendations certainly we would like to analyze the impact of the context life on the performance of our approach and experiment with contexts longer than one sentence and finally we plan to use some kind of tool to filter sentences which contain idioms used in the literal sense because in the initial EPI dataset some idiomatic expressions were used almost exclusively in the literal sense on this slide you can see a QR code which leads to the code models and extended dataset which are all freely available we invite all of you to comment see it that's it, thank you for your attention I'm ready to go any questions maybe we have time for one quick question because we're slightly out of time okay good but we have many but yeah I will try to do it quite fast so my question is about the test set because there are multiple ways to split the test set sometimes you may choose the sentences that were in the train with the same phrase but it's not seen in the test set so like can you please elaborate a little bit how the test set was split created thanks and thank you for your talk it was just a random split we didn't have no duplicates because we have removed it if I heard your question right so it was just a random split we just have some contexts containing idioms in the train set and some contexts without idioms in the test set which we want to find the correct idiom for okay thank you and one more quick question like do you have I think that the scores that are higher for the this fine-tuning step is that because the structure might be quite evident for the model because some phrases you cannot like food for thought maybe use only with some preposition that you don't hide and some phrases that use a verb that could not evidently according to the grammar cannot be used there so like probably have you thought about that and maybe it's going to be for future work or maybe there are some examples that they were afraid that were correct but not identified by the test set thanks as for the first part of the question yes it's up to the further research we have thought about it but we haven't enough time to check all the hypothesis some of them are described in our paper but it's too it's not it's hard to put them in a few words so I just can I can just invite you to read our paper and as for the second part I mentioned that our evaluation protocol isn't quite as logical because we don't consider the fact that some contexts may contain more than one appropriate idiom so there is not only one correct answer so it's it means only that the fact that we have kind of lower estimation for our approach it might be even better okay thank you let's thank the speaker again and the session is over see you all tomorrow