 to once again say thank you to the user organizing group and all the helpers and especially all the people with our ladies as in Tina has made all of this possible. So we will be so we're very happy and excited to be giving this tutorial on predictive modeling with text using tie data principles and you will be seeing me and Julia as speeders and we have all the links and information right here. That's right we're both really excited to be here so thank you all so much for all the all the work that has gone into getting us into this place. So the plan for today is we're gonna have a little bit less than one and a half hours of tutorial we'll be doing a case study where we'll be talking about it with slides and we will do a couple of breakout sessions where we'll be going through some of the code we're going to see a little bit more what we're working with and after the tutorial you can look at the material on GitHub and you to be taught to run it all yourself. It is most likely not feasible for you to run the code alongside as some of the total chance stayed at this amount of time to run right off to Julia. So what we're if you have come here and are interested in what we're we're doing here today you probably have run into somewhere in your working life as a analyst or a data scientist into like some text and this is and wanted to learn something about that text and that is that is very common for so many of us whether we work in in tech in health care in finance we often get into the situation where we have text and we want to learn something from it the case study that we're gonna work on today is a case study from complaints that are submitted to the United States Consumer Financial Protection Bureau so the way that this how did this data set get created what is it so the way that this works is that there's there's a person and that person has a problem they have a problem they they they they they're using some financial product or service like a like a bank or their credit card or they try to get a loan and they have some problem and then what happens is they can after they have that problem they can submit they submit a complaint you go to the next yeah they submit they submit that complaint to this organization this organization is a government Bureau in the United States and then when they submit this complaint they explain what happens so they actually go and they type in what they what happened to them so this is you know text that is submitted to this government org and then they we they it turns out we get a lot of other information along with what that happens so they submit the complaint and then what happens in real life after this is that the complaint gets sent to the company they can respond or dispute and then this organ what this what the Consumer Financial Protection Bureau what they're in charge of is trying to make sure that in in the United States consumers for things like banks loans credit cards mortgages that these things are fair that these things are fair and consumers are treated well so that that's the idea here so let's look a little bit and think about what's going on here so this data set is available publicly available it you can go to this website and you can download it here it's a pretty big so for the purposes of what we're going to walk through here we've actually sub sampled it down and we have we show how much by here so you can kind of get an idea of what's going on here and the this is a day a real data set where text with this case study that we're going to walk through where where text is has information in it latent latent information that can be used to learn something we have information other information too we know when the complaint was submitted we know like what company it's about we know the zip code of the person of where it came from we know when it was sent to the company and then did the you know what did the company do about it and all you know we have all this different kind of information of what is happening here so let's just let's just look at a few we're going to we're going to really dig into this a lot during the course of this case study but let's just look at a few of these so this just gives you an idea just a couple just a sample of a couple of like what the actual text is the text here is called consumer complaint narrative but here it says like you know I opened a bank account with Chase bank in xxx a week after I deposited check and then they go on explain what happened and went wrong the I had a Nordstrom visa credit card and it was closed you know like they like they're saying these things that went wrong so like this these are this this gives you an idea of the text that is here and then these are this is an example of the kind of text that gets generated by all kinds of processes on in so many of the different domains that we work on when we get a glance at here we already are starting to learn some things about the text that we have for example this particular text has been censored or or or or cleaned in ways to protect people's personally identifying information so that's one thing we can see already all those X's are there to protect things like dates people's credit card information locations things like that so let's take a step back and say what is it that we're trying to do can you go to the next slide Emil what is it that we're trying to do when we when we do something with this kind of text so we text like this we can do exploratory data analysis with we can build unsupervised you know models with and learn what's in there but text like this also can be used for supervised or predictive model modeling we can we can much like we are used to using nice rectangular data that are numbers to build some kind of predictive model we can we can use the unstructured text data to do this as well we can build you just as we are used to building regression or classification models with other kinds of data we can build those same kinds of models with text data so in this case study that we're going to walk through today together in this tutorial we're going to build a classification model it's going to be a classification model is what we build today but we can build regression models we're gonna we're gonna walk through a binary classification model that you can build you know multi-class classifiers like many of the options that are available to us in our general modeling toolkit are available to us when we use text the the hump to get over the thing to get over is that we have to we have to use the the structure that is in language we might say though the organizational the ways language exhibits organization to be able to create features for modeling so we have to do the work of feature engineering we have to do the work of data pre-processing would be a way that we might say this depending on you know what kind of language we want to use we have to create the features for the modeling to be able to get to that step so there are specialized techniques there there are you know what we're about to talk about like specialized packages that get us from the unstructured data to the features that we can use in our well-tested familiar modeling approaches and I think one important note here is also that as we saw before that the the field in which it sets the in can be very varied in length and we still need to make sure that we get back to a rectangular data set so we need to even if the text has 10 words or 10,000 words we still need to find a way to get it into the same rectangle so it will fit in our models we're using and almost more importantly we need to get numbers out on the other end because that's what the models actually need to use so for this we will be using the tiny model framework which is a collection of packages that are developed to help and ease modeling and machine learning workflows and it uses the tidy verse principles it is in a way successor to the tarot package so it doesn't implement a lot of the methods like statistical methods themselves but add binds and hoods and a lot of the workflow problems are being eased with using tiny models and then to be able to do turn the our test features into numerical features will be using the test recipes which is a package I have been developing that handles a lot of the different ways you would turn test features into the marital features in a way that fits perfectly into the tiny model framework and our general workflow is we'll be starting with some preprocessing so we get some data in and we will split it we will do some preprocessing David to a model where we will train it so it would be a logistic model in a forest and it will do the modeling and then we at last step we want to do some validations we want to see how well does our model actually perform at the test we specified it to do yeah and one thing like when you look at tidy models as an ecosystem you can think of your mental model that are being similar to the tidy verse where it's a meta package that contains a lot of smaller packages and each one is it's it's it's modular so that if you you know if you only need functions from one you can only you know load up that one and they're they're broken apart so that each part of like a modeling workflow here is kind of in one in one separate packet so much like when you do low if you're familiar with saying library tidy and understanding that oh this gets me ggplot2 and dplyr and tidy r if you do library tidy models in a similar way it it gets you all these different packages that have that are have specific emphases and and functions yes and this talk is very nicely accomplished with a long-standing project julia and i have been waiting on so starting a little bit over a year though i approached julia with an idea for a bird and we have been slowly waiting on it ever since and we're very happy to officially announce uh the book we're working on so the boot is called supervised machine learning for test analysis in r and it has like we have a rough draft of the first two thirds of the bird since the name is a try a bit of a handful you can find it as smlcar.com yeah and neil how do you like to say it i like to say it small tar yes and that's we almost saved it for bought the domain without the a that's true we did we're like oh no there's an a so a lot of this sort tutorial is based on material on the bird and we won't go nearly as much in depth as we do in the bird so if you need a more deep explanation you can go after the tutorial and find a corresponding chapter yeah that's great there's one question so far that i think now might be a good time to address there's questions here about how well this works with the stm package and the stm package is a package for topic modeling which is an example of unsupervised machine learning and um uh what we're going to focus on today is supervised machine learning so in the way a way in which they're similar is that both um both uh work well with um using a like a tidy data principles approach to your text analysis um like if you want to use if you want to use um like broom style like tidying of your output um uh however a difference would be you would use scm if you have unlabeled data and you want to do unsupervised modeling um and you would use uh the approaches we're talking about today text recipes supervised machine learning if you have labeled data and you want to train a classification or a regression model that actually is exactly what amila is about to talk about is like uh dealing with the labels on our data yes so one of the first things we need to find out is we want to predict something from this data set and in this case we have this hypothetical scenario where someone forgot to label what type of complaint it is so uh uh in this data set there is eight different territories so you have complaints about mortgage or student loan or credit card and prepaid card and there's a bunch of different territories but as we can see right here there's not an even amount of complaints within each class so for the remainder of this tutorial we will be lumping so everything in the most common class will be lumped into one class and then everything else will be lumped into our second class so we turn it into a binary classification task so let me just quickly share my my word space so we have and as a note both julia and i have already pre-reigned all the towed in these chunks because some of them take quite a bit longer but we can see right here so we have the original data right here which takes a little bit of a while a little while to load in a lot of a while to see we have all the data right here and we have date receipt we have to produce this is the class we are trying to predict and we have a lot of different fields what we are what we'll be doing now is we are taking our complaints and creating a factor of two levels so one level if the class is this very long trading reporting trading repair services or a personal consumer reports and then labeling everything else other that way we're having a factor variable with two levels and additionally for to save a little bit of room on the towed we are renaming consumer complaint narrative into text because it's quite a long word and we started having a little problem with room on the slide so you can see now here we have the same data as before and we can say some planes planes to last and we can count the product and we see we have a roughly more or less even distribution between credit and our other class which is really convenient for us now yeah and it's not to say that you can't do a multi class classification but for the purposes of keeping this tutorial fairly short we are simplifying our test a little bit this is the editor we used to turn our data into a multi class exactly um so once uh so we have this um now that we've said okay we are going to do this binary classification um we're going to approach this as a binary classification task and so then the the the one of the very pretty much the first step after we've gone through exploratory data analysis with our data and it's time for us to start the modeling process is to think about splitting our data because many of the machine learning approaches that we can take with any kind of data but including text data can be very powerful and they can memorize features in the in our data and we we want to be able to have as good a possible estimate of how how good of a job it's going to do if we're going to apply this to new data and so to do that we need to have held out data so we need to have some data that we're going to train on and some data that we will use to test how well our model is doing so in almost any realistic situation we have given amount of data and we we have to think about that data as um as like uh like a resource that we have to spend in some way so one way that I like to talk about this is like we have to spend that data budget wisely so um the this little visualization that we have here is like a typical um test train split so you start with certain some certain amount of data and then we randomly assign some amount of it to your training set and that is the data that we'll use for training all the models all the models that we might want to try out and then the test set is held to the very end and the test set like we have here we've said here on the slide is a is a precious resource and we can only use it once and the purpose of the test set is not to compare models is not to decide which one is the best to use the purpose of the test set is to estimate um performance on new data the purpose of the test is test set is to say how is my model going to perform on new data so let's um let me uh go here so um and just show uh a little bit about um if I have got so let's say I have that complaints uh to class that um that Emil just made where we where this product has now been um has now been um uh has now been uh dichotomized and to do these two we're gonna say is this about credit or is it about any one of those other kinds of things so what we want to do is we want to split it so let's look at this split object here that we have so the split object here is um like like let's see what is it what is this thing that we made complaints split um it has class r split and um it's like a Monte Carlo split thing here so if we look at it um it is a it's it's not a data frame if you're used to using tidy data like what is this thing what is this thing when it prints out it says okay I started with this much data that's the total and then I put this much into the analysis set and this much into the assessment set so the data isn't in here what this is doing is it saying I'm keeping track of which um of which day observations go into the training and um testing so analysis is like training and assessment is like testing and so the way I actually get the data out are with these um or with these functions so I I'm gonna run this one and then I'm gonna run this one and so now let's look at them and see how big they are so complaints train notice that it is um uh it has this many rows almost 9 000 rows and then complaints uh test has about just shy of 3 000 rows so the the um the the default for initial split is to split uh it's three quarters three quarters in training one quarter in um testing but that's something that you can change if you have different needs for your um particular uh modeling question but that that's why it ended up splitting in the way that it did so notice that both of that so that so this thing does not have data in it it just keeps track of which observations go where but these things both do have data in it and we've used these functions to say okay um I know where the data is and I need to go get it and so that's what these things are and they are ready to um they are ready to go and but and so now we have these data sets it's just data frames but we don't now we now we as modelers have to decide what are we gonna do what are we gonna do and we in this case we have both text and other um and other um uh variables and so now we have some choices to make don't we Emil yes so um yes so we have these 18 different variables and we need to find out which one of them will be using in our modeling and for this we have a very hard feature selection list chat list so first and almost most important question is is it ethical or even legal to use these variables so using variables such as race and sets and gender may or may tend very often not be ethical and even in some cases be legal for you to use in your predictive modeling and this even starts to include things like zip code which is often used as a proxy to race so we need to be aware of what we are allowed to do but I would say almost more importantly what is the effortful thing to do because laws tend to be very slow in that regard the second uh thing we need to make sure of is is the variable we use to train our predictive model available at prediction time so since this data set has been created afterwards some of our variables are created after we would actually be able to use them in the model once it once it's being put into production and lastly does the variable is the variable likely to contribute anything to the explainability of our model so let's take a look at so keeping these three chats in mind let's take a look at our data one more time to see what we can say about it so we already talked about how we have the zip code so we have the zip code right here is very unlikely that it's not a good idea to really use it because we most likely just need to be able to find areas with uh uh racially saturated areas which we're not allowed to and we shouldn't use and in some way we might even say the state has a little bit of the same feel and it seems weird to maybe think that one state should be more likely to do something than our state then we have the will it be available at prediction time so we have so we're assuming that this model is being put in production when the complain is being received by this government entity so a variable like date sent to the company is not available at prediction time because that's going to happen later on the same with the response from the company after they have received is also not something that will be available the same with time response is we'll be looking at this way before and lastly uh here we actually have a variable that may or may not there was any any thing valuable so we have the complaint id so this is most likely just uh it could be a sequential number denoting how often the just a unique identifier of the complaint itself which doesn't there was any information at all at most if it's sequential it could be a proxy for time but we already had a better explanation we already have a accurate variable in the date received so we will be using the three predictors right here so we have the date received we have the tats which is uh low tarnality factor variable of what are you so there's like an other and then they have like service members and elderly and things like that and then lastly we have the consumer complaint narrative which is the text and before i hand off the julia i want to make a note that we are not only using text data but we are using it in addition to class at all predictors yeah yeah we had a question that um we that uh we i was great to address if you can go back just a couple of slides to where we show the initial split in the initial split we uh initial split with strata equals product and the that conducts stratified sampling so that the so remember product is our class our credit versus not variable and so when you say strata equals product what we're saying is i want to make sure that my my test and my train have equal proportions of of this which is our outcome variable this is the thing we're going to predict so that is what um uh that is what the strata argument does there um this particular data set uh is actually pretty even in strata so it probably would have turned out mostly okay without specifying that but what this does is it makes sure that it is the same so it divides the data set by strata and then samples actual after you divided it awesome so if we um so we've got these we've got these features we've decided what features to use and so now it is time for us to do this transformation to um pre-process our data to get it ready and wow look at that pile of code that's an enormous that's an enormous pile of code there that we've got and um what so this so think about this i i i invite you to think about this in a variety of ways depending on what you find helpful so you can think about this as a way to specify how to pre-process your data like your data came to you in one form and you can pre-process it in a way to get it ready for modeling or you can think about it as feature engineering your data came to you in some form and you are building a machine learning model and you need to engineer the features these are equivalent in my mind in my opinion and at what so what we're doing when we have this huge pile of code here um to get your data ready for modeling is you are in a principled um reproducible way making it so that you can um uh uh get your data from the form it came to you in in a form that is ready to go in to um to a model into a some kind of machine learning algorithm that is that it is ready that is where it is ready to go here um before we move on let's just notice that's just let's just i want to look at things there's there's a lot going on here but one thing i just want to draw your attention to um here look at that it says tune what does that mean i don't know we're gonna come back to that we're gonna come back to that one okay let's keep going and um let's notice one thing that we did here we're combining text and non-text features in our model and so um i'm gonna walk through um first i'm gonna walk through it via live code for through through the coding so you can see what's going on and then we'll just show the the slides again but we are um we are looking at we are looking at a combination of um uh uh of we are looking at a combination of um text and non-text features and that's one of the really great things about using tidy models for this so let's start with this first line um we start by defining a recipe so a recipe in this ecosystem is a way of saying um i want you to pre-process my data and get ready to go into a machine learning model and the first part of it here is is a formula like if you use r for modeling um you probably have seen this kind of formula before where we say this is the outcome these are the predictors and here's the data notice that we're using the training data here because um much like a machine learning model has to learn from training data and then be evaluated on testing data um data pre-processing also has to be learned and then evaluated because otherwise we um uh have data leakage so um this is how we start so first let's look at these first ones dates dates here so um these three rows deal with that date column this is the date that the complaint came in this this function step date um so what we're going through here is we're adding steps to our data pre-processing um specification our feature engineering we're saying i've got a date column which it turns out is kind of a weird thing in our um what am i going to do with it i need to make something that's that can go into a machine learning algorithm and i can do math i can do linear algebra on so i'm going to take that date and i want to get out some features i want to get out the month and the day of the week so instead of having um a date column i'm going to now have the month and the day of the week so what i'm what i'm gonna what i'm getting at here is like hey model um uh are people more likely to have problems to submit problems with credit with their credit uh consumer products um during different months of the year or like over the course of the year or a different days of the week that's what by doing this feature engineering i'm taking the date and i'm getting out different features then i'm like i i don't want to keep the date itself i'm getting rid of it step rm and then step dummy so except that this is i'm creating indicator variables um uh which are binary um numeric binary columns um from like a factor column so that's what this is doing so if i do this and then don't stress out about this next part too much uh i'm demonstrating it but if you're like whoa what's going on don't stress out just kind of like look at the just kind of look at the outcome i'm just showing you what it's doing so it uh it's not handling the tags of the text but i'm taking the product is still here as a factor but look it's made columns like um august september it's made columns like tuesday and wednesday so that's what these first that's what these first um couple steps are doing so now let's do the tags and i'm gonna delete the so juli before you move on we have a question about uh data data leakage ah in general that is oh okay okay so what is data leakage so data leakage is when you end up using your training a model and you get some data to train your model with and you say okay here i'm here's my training data here's my testing data and accidentally you end up using some information from what you thought was the testing data when you train the model and then you end up with an overly optimistic a value like estimate of how your model is going to do so it is very easy it turns out to um have data leakage uh let um would you would you like to explain it in any other way amul that was kind of short what i just said yeah so one of the ways i think about it is if you have data leakage you you're using data for the information from the testing data set which in a way is using information from the future to affect how your model behaves now and this is only going to be a problem once you push it into production and don't know about what happens in the future because the model you fit assumes that it will know what happens in the future yeah that's another great way to put it um i think that um uh in machine learning practice and modeling practice i feel like we've gotten a lot better at talking about the modeling part um with data leakage like data that we put into the machine learning algorithm and it is much but we still don't have great practices around data preprocessing when it comes to data leakage and um that's one of the real benefits of using tidy models and recipes specifically is that it makes a lot of that very explicit very explicit um so let's so so we've got here this um the tags so so now um if i'm going to do now i'm now i'm dealing with another column i'm going to do step unknown which handles what do i do if i if i don't have something there because remember we saw we had like a bunch of n a's um so i give a new caught a new a new value for that and then i do the same where i create the indicator variables and so i can do the same thing here prep um juice so don't worry about that part too this part too much so now the tags older americans tags service members tags unknown so i've created these columns so these are that's what that's what i'm doing with these steps is i'm getting ready to make all these columns that i'll be able to put into a machine learning algorithm i'm not going to run it for the text one because it gets a little bit um hairy but let me just tell you uh so i'll just show you the um the results here um so so if we put it all together what we're doing is i'm creating date features getting rid of the date um creating indicator variables which means making those columns that are like yes no columns is it in november or not um the same thing for the tags and now we're getting to the text which is the thing we're actually talking about right so i we tokenize text which means breaking text apart into pieces we'll talk a little more about this later we're removing stop words for text um we'll talk about a little bit of this more later too we remove um words that are not too informative um we find n grams um then we remove things that are not um we decide how many basically we decide how many tokens are we going to keep in this model and then at the end we um we wait this by tfidf which is a um a statistic that you can measure about a uh word in a document in a collection of documents so it it um it waits so tfidf waits up words that are important in one document in a collection of documents and it waits down words that are in all of the documents so it looks so it's a it's a way of waiting the words so so at the end what we have here is um we have all we have this data coming in and at the end what we have is um a way of um processing our data so that it's ready to be used in a machine learning algorithm amila i talked about a lot of steps there including stop words didn't i yes i just want to do one last note on this is specifying this recipe barely takes any time at all because we're not actually doing the calculations that's gonna happen later on so by specifying it take like a fraction of a second because we're just saying what are we going to do we're not actually doing it which is one of the cornerstones is the tidal models is just specifying what we will do all right so we have and we had a twit note uh question is the roles equal dates so that's basically saying we dip all the variables coming out of this step up here we're saying dip them to roll dates and then it's not this slide right here but then we uh later when we do the dummy we are saying apply domification to all the states all the variables that have the role equal to dates so it's a way of just selecting multiple variables at the same time yes let's just keep going to stop words in the interest of time yeah uh yeah so stop so we talked about removing stop words so stop words are very loosely defined as words that have very minimal meaning so hopefully it will be words that we can remove with little to no laws of predictive power and it's commonly used like you think of the words such as and and off and and just whereas the kind that just needs to be there to make this sentence sound nice but hopefully doesn't add too much information turns out there is like as many stop words lists as practitioners in a way there's so many different ones there's a lot of problems with all of them and they're very specific so here if we load in the stop words pat hits within tall stop words which defaults to the snowball stop word list and we can see the first words right here so at a glance they do look like oh there is most of these words don't seem that much but we still have some words that may or may not be that may or may not be exciting so the first step would be we have a lot of pronouns so we have like she and and he and those which may or may not be important like you may want to remove or may not want to remove depending on what you're doing yeah before we go on to the next slide it is there's actually pretty good support for stop words and other languages and there's a kind of an interesting plot in one of the chapters in the books about removing stop words for a bunch of different languages so that if you are someone who is a practitioner with text in another language that's a kind of interesting thing to look at and I think it's very important to note that if you use a stop word list you should manually chat everywhere on that list before you use it because it might have words that you don't want and some of these lists have been computed translated and may or may not be tested by a native speaker in that language so here we have an upset plot of just three of the termally used stop word lists termally used in R so we have the snowball so first we have in the lower left we have a histogram of the number of stop words in each list and we have stop words done here with I think the 171 stop words so it's fairly short and then we have the ISO stop word list which has almost 1500 so already there we have oh we actually have them right here so we have 160 actually that's not right but we have tried a bit dab between the number of stop words in different lists so they we have some conservative one and some tried liberal ones that may bleed into important information and we see that the stop words that they add along the way overlap but don't so actually the most important on this one is we have this little 10 right here on the fifth column which is basically saying that we have 10 stop words in the snowball list which is the smallest list and the ISO list which is the biggest list that don't appear in the smart list so they are not perfect subsets of each other and this is just looking at three there's many more that has problems another important note is stop words are tons and specific so these lists we've shown so far are trying to be English specific but if you start doing into domain certain words stop having meaning so if we did a classifier on like documentation in our packages the word function stops having any meaning because it's there appear basically everywhere where the word he and she maybe subtly becomes very important because why would they appear in our documentation they will also have bias from in various degrees and it didn't happen and again it's a little bit on tons and specific so if you use a stop word list defined in one domain and try to transfer you don't have some bias in that way another thing you might find is if the stop word list is generated on tons even the inter problems of having some male pronouns while not having the female pronouns because it's being trained on uneven data and you can of course modify and trade your own stop word list to your specific need so if you know from your domain knowledge in the test you're doing that oh these certain words like package and function and namespace they don't add anything so you just add domain and say we don't want to look at these words and as always there's more information in chapter three where we look more into how stop words are being used before we jump into models there's one question I want to address right now which is like how well does this data preprocessing extend to other languages since you know we've talked a little bit about it and the infrastructure the question was specifically about like recipes and the infrastructure is general the infrastructure for data preprocessing is general one thing that we want like well real goal that we have had like in the book and that we want talk about here is the specific preprocessing steps do have different outcomes for different languages and different kinds of language and that's because they have been developed typically by dominant people from dominant cultures and backgrounds and so for example the stop words are you know like the the stop words that are the best quality are English the stop word lexicons that are the best quality are the English ones and so you go to use the other languages and they're okay but like less good um when you go to your we're going to talk about tokenization at a little bit token and we're you know it's good it's one of our slides is that um like tokenization um works best for English and that's because when we have made like everyone who's made these algorithms tends to work in this domain and so the recipes infrastructure is general um the the the in the the inequity that we see when it comes to what are called high resource languages and low resource languages come from the preprocessing steps and um we we talk about that um uh it several places in our book and what to be aware of and how to um at the very least um acknowledge the limits of what it is that we're doing um well we're going to address some of these other questions as we go along right now what we're going to shift to talking about so we've we've been spending a some time talking about getting your data to the point that it can go into a model that it can go into a some kind of um algorithm some that's going to do math you know that's going to do like some kind of like machine learning algorithm so what we want to talk about next is what kind of models um work well for text like if you're if you're it's like okay i come it's time to train a model what am i gonna what should i do what do i need to know so um the most important thing to um to know to keep in mind is that um is that text data is sparse um sparse in the sense of like a sparse matrix sparse in the sense that um uh most documents don't use most words so um uh there's a special name for this kind of relationship in the in language it's called zip's law so um most words um are not used many times a few words are used a lot of times and so we end up with these sparse representations and so this is important when it's time to pick a model so for for models you want to pick models that work well for sparse data so um uh one of my like go-to's the go-to actually that we're going to use in this case study is a regularized linear model so uh in r the the big one is that um is uh is uh glim net or glim net i don't know i hear pronounce both ways and you know how do you pronounce it i spell it out you spell it glim net yeah i don't i don't i hear pronounce both ways so that's what we're gonna use um support vector machines also work quite well for um for text data uh naïve bays is also really a good fit for text data that's because um it uh it treats all you end up with a ton of features like um the features in this case are words or tokens or other kinds of like engram something like that and so you have a ton of features and naïve bays treats them all independently and so it can chug chug away and gets you um to uh you can you know it can fit and train what about tree base models things like random forests no no tree base models do not perform well with um text data this is actually kind of a something to um to note to um to remember because in general random forests are kind of um a mainstay of machine learning right like you you have some data that comes along and you're like what are we gonna do um throw a random forest at it right like like there have been studies right that it's like one of the big it's like they they perform really well in almost in so many different circumstances but tree base models do not perform well on sparse data they they can't take advantage of the sparsity very well they just don't tend to do great so um this kind of information is important to um keep uh in in um in mind when we are um when we are choosing what kind of model to try um for our text data so I keep saying that text data is really sparse but does text data have to be sparse is it always sparse um so uh there is a whole field of like nlp a whole sort of subfield of nlp that the point of it is to use the information in text um to transform this high dimensionality sparse space where we have the language data to a lower dimension not no lower dimensionality dense space so there's this quote from John Rupert first who is a linguist where he says you shall know a word by the company it keeps so um this is the idea of word embeddings so if we have um if we have uh we can learn from a large data set of word of language which words are used together and then we can use that to transform like instead of a tfidf weighted high high dimensionality sparse data set we can transform it to a lower dimensionality dense data set that uses how how um how language how words are related to each other we can use that together so I saw that there was a question like like maybe I should maybe we shouldn't like why use tfidf and way it that way why not use word embeddings so chapter five of our book is about word embeddings and um I recently gave a webinar for yr that focused all about that and um this is definitely an option that's out there and one that you can use tidy models and techniques recipes to do this is for sure a um an uh a step that should be taken um with full with a pretty full knowledge of the um of the um basically risks involved because more so than other text preprocessing steps this is um this is a preprocessing step that uses um historic um uh uh bias um in large data sets and then ends up uh ends up uh built in baked into your model so um we we're not going to talk anymore about word embeddings today but point you to these other resources so that's a little bit about models um a meal let's talk about how we actually practically go about building a model yes so when we're building a model in tidy models we have three main steps so first we need to hit the model we will be using so that would be our linear model naive base glm net model so just a statistical model frame what we want to use then we have set the mode so the mode is what kind of regression or what kind of modeling do you want to do so the classic ones are regression and classification and then lastly and lastly we need to decide the engine so what package or library is actually doing the calculations later on so here we have the a wonderful art from alison host and so we'll be using parsnip to do all the modeling and parsnip adds a standardized interface for physical modeling so it doesn't include any of the totally run the models themselves but it attaches hoods into other libraries in r and outside of r to handle the computation for you and it we have here at the title models dot org slash find slash parsnip there is a little searchable area where you can find the different models and model types and what pat is there from and there's more and more being added along the way so first we need to so first we specify a models let's say we're doing a support vector machine we'll have radius basis function and then we pipe in and say oh i want the mode to be regression so this would be a model specification that does regression and we can then say oh but this model knows to do classification so we need to upfront say this model will be doing this kind of task we also need to set engine so in the case of our support vector machine we have different packages that will perform these calculations so we have and and we need to upfront say who do you want to do this calculation so here we have turn lab as one of the ways to do it and we also have literate svm which is another bad run engine to do this so then when we are doing a model we are saying so we said earlier we are going to use a lasso regression so we start by specifying a logistic regression and we said message to once we did a lasso and then we pipe in and say i want to do classification and i want to use glm net as the bad end to do the calculation and that this takes like no time at all because then it's just a specification saying this is what we want to do and then we have a model op yet with these things and we noted then that we have a little tune up here we have a little sooner than all that tune and what are we going to do um amiel uh what if we took our break now do you think a break would be a good time yeah yeah that tune so we're going to leave you right now like oh no tune what is it what are we going to do um so uh we've been sitting here for a long time and so have you so i think the next thing we're going to do is take um a five minute break so let's come back at 15 after the hour um i'm going to just turn off my video and mute for and for the next five minutes and i i recommend everyone get up stretch stretch your toes stretch uh reach up get a glass of water and then we'll come back and the next thing we are going to talk about actually is the tuning and what is what does that mean and what is it that we're going to do about it so um let's take a um a five minute break and then um be back at 15 and after the hour all right we're about ready to start again yes so so we are roughly halfway through but we take a lot of them we initially thought a little bit well um and hopefully i don't know i have the time but i don't know if yeah we'll try to um yeah let's see and we have like a couple of open questions but i feel like we just wait with those to the end like the ones that i see here i agree with you amiel i think that's a good idea yeah the the couple that i see we'll we'll take at the end yeah so yeah so i think we will go to we yes we're a little bit past where we'd like to be so we'll we will go we will see how much we can do in the next 20 minutes or so and then take questions then that sound good yeah yeah i will probably skip maybe let's um i think we should skip transportation okay well let's do that fast do it fast yeah that sounds good and maybe skip the live coding for variable importance yeah sounds good okay all right so um when we so welcome back everyone we are so glad you're here um so when we train models some of the model parameters we can learn from data during fitting and training you know if you've fit you know a linear model to some data you're like yep i learned the slope and intercept from the data but it turns out some model parameters cannot be learned during training they and we call these hyper parameters we so we just cannot learn some of them there we like what are like what are we going to do and so these we call hyper parameters so how how do we go about learning these what we do is um we train lots of models and we with all with different combinations of these hyper parameters and we compare them and we see which ones did better and we use that to um to estimate which parameters are the best ones or the right ones for our data so um that's what tune means when we had to tune in those places that was us saying i don't know what the best value for this is instead i want to try different values of it and so now it's time for us to go ahead and do that we're going to make a grid in this case a grid of possible hyper parameters here so here's how i can go about setting this up think about this as us say as us you as the model um modeler saying um i am doing generalized or i'm doing um regularized uh uh a regularized linear model but i don't know what the right amount of regularization is so i want to try a range for the regularization penalty and uh i am putting tokens into this model this this like text model but i don't know what the right number of tokens is so i'm going to put a range in here and then we're going to try a number of different model levels so then we end up with a grid that looks like this where it says okay here are the values for the penalty i'm going to try here are the values for the tokens i'm going to try and then we cut we're then we're faced with okay great great here the values i'm going to try but how am i going to compare and evaluate these different models what is it that we that uh that we are going to do and i just have a quick note so if we go back here before we specified the range but all of these uh hyper parameters have fairly decent default ranges so you don't necessarily have to specify them all right so we imagine up here the top we have all our data which we already split into our testing data set and our training data set but if we just applied so we want to try all these different things but we can only use our testing one so we need to find a way to assess all these different hyper parameter uh combinations without touching the testing data set so we'll be doing this by taking resampling our training data set to create an analysis and assessment data set which a little bit like training and testing where we apply to our analysis evaluate our assessment and we do this a bunch of times to find out how well we're doing so as we said before we don't want to make sure you're spending your data budget very efficiently so we are doing something called uh unfold trust validation so we take so we imagine you have your data right here where each row is let each column is a data set we split it into different folds so here we have five folds then you know first split we take the we made one of our folds the assessment data set and all one analysis and then we slide along to that a bunch of different resamples of testing and training and then in this way we are trying to spend the data wisely because it's a limited resource what we have in our training data set and this allows us to assess multiple different variations without having to touch the training data set so now that we have our the resamples the resamples is the trust validation sets we have the features and we have the models so now let's see what we do now that's right and so the next thing that we're going to do is we're going to create a workflow so um we have our resamples in this case those are those folds here that looks like this where each we have so this tenfold cross validation so the these are these splits which are the same kind of thing that a testing train split is where it keeps track which um which things go in which in analysis or assessment and then we have a recipe which in this we talked about that earlier right this is our data preprocessing recipe and we have our our our model specification and notice we've got we've got you know we're saying some things are being tuned and whatnot so it's time to put those things together so we put those things together in something called a workflow so a workflow is a convenience function to put to put um pieces of models together so for example if i just added the um if i just added the data preprocessor it would say the preprocessor is a recipe and model equals none so it has like slots it's like legos where you put things together and it gets everything ready for you to fit a model so it's a a convenient way to carry around different pieces of a of a modeling pipeline or a modeling workflow and so now once i add a recipe and i add a model it it says ah okay you gave me a model and you gave me a um you gave me a you gave me a preprocessor and you gave me a model so now i'm ready to go and so what so the the benefit of using a workflow is it's uh is it's convenience it's protection against um uh data leakage like we call it what like we talked about before and it's how it fits into this um this ability to do um uh tuning which is what we've been talking about with this like we we want to try we don't know what the right values are for example for regularization or um the right number of tokens so i'm going to um uh go back to the slides uh can you go back to the slides to show the tuning um because uh that is slow so then now we're going to tune so uh tune you can tune it in a couple of different ways we're going to show kind of here like a straightforward way to tune which is where you take your your now it's time to put all these pieces together the workflow which says what is the preprocessor what is the model the re samples which are the data right that we're tuning on that grid remember that is um uh what what are the possible parameters we're going to try and then some some details about the modeling so a control like in this case we're going to save predictions so that we can explore them later so the output of this will look a lot like the the cross validation folds where we have the splits the fold but now we have some things like metrics and predictions um and so we so this is our these are our um our results that we have here um you one thing you noticed as we went through this is that um we did quite a number of preprocessing steps um uh including um tuning the number of tokens that we had here and that we um and that we were using n grams and that's some something that we haven't talked about next and that's what we're about to say yes so i'll be talking very briefly about what we really mean by tokenization so very general term tokenization in the domain of nlp is splitting some text into smaller pieces of text which we will refer to as totans and the most common totan reward with is generally a word but you can also tokenize to other things like sub words small sentences or even characters and it is an essential part of most text analysis so most anytime you do something you want to turn it into totans so you have something to count you want a smaller unit than the whole text by itself and as always there's a bunch of different little options of how you do tokenization so here i have a little example of a complaint we had in our data with some slight modifications and the first and most naive way of tokenizing is splitting this string of characters by white space so that would be spaces tabs new alliance so we see we have a fingers we have before and we notice that punctuation and things like that stay in it so we have me period because it's not being removed because it's not white space another popular option in the our ecosystem is the totanized package which if you're using tiny text is the main bad end for tiny text so that has a more sophisticated tokenization procedure which still takes the test text but turn it into different kind of totans so the first immediate obvious thing here is that we don't have periods and everything is lower case and we wrote we have the steps this algorithm is going through in the foot and it's quite complicated and then let and as another sample we have the sparse library so this is a python library to do nlp and we have a bad end via the sparse er our patterns so that's a they have a different algorithm to split text into totans and if we look at all of these side by side we see that there's a lot of differences in how they're doing it so the totanized package by default turn everything to lower case and remove any kind of punctuation by default but the sparse er tips a lot of tip everything up a taste but keeps us in long time hyphenated keeps the hyphen as its own totan and split doesn't appear into two totans so it is a it's not necessarily a better tokenizer but it's a different one that may or may not affect the way your tokenization is done so we have a lot of considerations when working with this like how do we deal with our taste how should we handle punctuation and what do you do with that like that non-word in other words or do you hyphenate it do you keep it do you split it bunch of different ways and something wrong will be taught here a lot right here but you also could have the idea of multi-word expressions so if you're having a looking at text from the political in america you probably want a white house to be its own totan because it's it's one unit by itself but you need to find a way to the tokenizer to accept those and like luckily in one way and unlike in other ways its tokenization in English text is quite a bit easier than most other languages simply because the idea of a word is very cleanly separated by spaces so it's very easy to you tend to try the performance by just using a white space separator because the notion of a word is so interdult in english but you have a lot of different problems in other languages and we touch upon that in the book as well and then since we're using n-drams in the model let's just briefly tell over what it is so an n-dram is a sequence of n-situential totans where n would be the number of sequential totans so doing this captures the words that appear close together and we're even able to detect gradations so if the words not happy appear a lot together we can find that in the n-drams one of the downsides of this is once we start counting pairs of sequential totans the tarnality draws while light draws probably normally or worth the number of n in our n-dram and here's just an example where you would think that the totanizer's part is so we see up here with n-drams one is just the same as words but once we do n equals two which in another way also called by-dram we see that we have every pair of words and we start seeing right here in the middle have identity theft as being one unit that's worth counting which seems like for me it seems like very good units to be able to count that in our final model and down here we have a long time and then it's worth kind of feel like they tohesively fit together and at then we have more in our board in chapter two about how to do tokenization awesome so let's talk about when we have tuning results what do we do with them like how do we get from we tuned to we know we know what to do next so the the tuning has results in a metrics column and remember we said we were we trained a whole bunch of models we're going to evaluate and see which ones we didn't we did best and the metrics are what we use to say which models did best in this case we use the default metrics for classification which are ROC AUC and accuracy and so if we do collect metrics it just gets those out for us and we can see for every combination of regularization parameter like penalty and number of tokens what what what did we get there so that gets out that gets out the metrics but if we want to but it's better to visualize it and be able to see what what actually is going on and so there's a you know we can you we can get a pretty nice visualization out here and the we can see what what's going on here so as the regularization changes the ROC AUC changes from you know very good to quite bad like right that that's it's down there by like point five and we can see that there there's there we also see some change with the number of tokens there as we go from like 500 up to 2000 and so we can see the change that we get that we get out there we we can get the actual best ones or we or we can see we can see okay of all of those that we had there which ones are the best so these are the five top performing models notice in this case that they do all happen to have the same regularization parameter and that they have different number of tokens also notice the top tokens you know they're not necessarily the biggest biggest models in terms of tokens they have a lot of tokens but not necessarily the top so which is kind of a sometimes interesting to see so these are some of the the results that we see and we can actually get out okay what is the best one we can get out one model because really that's what we need from tuning results at the end is we say I got to get one we can choose them in different ways and in a in a you know if you're using regularized models you might say like I want you know a simpler model with one stand within one standard you know error whatever and you can get that out but for this case let's just say we want the best performing model and you can get that one with a function called select best there's other selectors available to you but in this case let's get out the best one here so we can get out the best so that so we tuned we get the best one we can also get the the predictions and if we have the predictions we can we can evaluate in a different way visually and actually look at ROC curves for and in this case we actually had 10 resamples so we can get 10 ROC curves one for each resample that we have here and so this is when we resampled the these data set of complaints we trained you know this model for each one um are we two you know we tuned it each time what were the ROC curves and we see here this is this is the kind of ROC ROC curves that we got here and we can visualize what it is that they look like and these are very helpful for for um being able to evaluate how the model did so at when we get here to we're getting here to like the end we we trained we explored we tuned and then we can we can actually then update the workflow with this best ROC AUC and then the model is ready to go and we can we can you know save this and use it with with new data there yes so lastly in the last couple of minutes we'll be talking about how the model actually is deciding what to do so let's just over there so I'm taking using the VIP patch and using the vi function that will extract the very important of the different features in our final model so we can start visualizing how well how the test influences our predictions so here we have the top 20 something totems from the test and how important they are to determine if something is a credit term plane or other term plane so we see in the other identity theft too for some reason is this is not anything to do with a credit term plane this is probably in one of the specific groups and we can see that it seems to fairly well it is really the other side we have agency experience fair credit attached credit chat services so we have a lot of voices actually feel like they're making sense of how we're doing it so we created this little visualization so we take one credit term plane and we have taller each of the totems that only has like unit uh unit ramps and taller them depending on the variable importance so here we have a credit term plane and we see some very bright like entry fats and credit store and credit a bunch of treads around and they are pulling it to be more of a credit compared to other and we have another one right here then we have credit report reporting entry fats pulling them towards a trade term plane and here we have all the same deal then we have a lot of non-green wire that's because this just happened to be a not credit related one which is why we're classifying this as another and then we have right here awesome I love those visualizations I think they're so helpful for understanding what is the model doing what is the model doing which is um uh so valuable okay so we we we updated the workflow we understand some about what the model is doing and so finally finally we're going to do something with the testing data so there's a function called last fit which is very convenient because what it does is it says okay I'm going to fit the model one last time using the training data and then evaluate it on the testing data so that's what this what that's what it does it takes that finalized workflow it takes the split the that testing training split and says we're going to do this one last time and then what we can get out of it is well first yes just notice this is the first and only time we have used our testing data we have not touched it before now we didn't use it to compare or decide what to do decide which what regularization parameter or anything and then what we can do with the output of this is we can look at metrics for example this is these are the metrics evaluated on the test data and fortunately for us in this case we can we can look back and see that these metrics are about the same as what we evaluated on our resampled our resampled training data which means we have done a pretty good job here we or like we are not you know doing a terrible job with our fitting and we can also get out the predictions on the test set and we can we can end up with a another rc curve that shows that we have done a similar here thing as well so what at the end of all this at the end of all of it we have started from a very raw unstructured text data we transformed it to features that can be used by machine learning algorithm we used resampled data to tune and then we have talked about how to evaluate data and ended up here with predictions on our test set that we can use to evaluate there so Emil you want what do you want to say to as we wrap up well like at once it and I would like to thank all the organizers both from our ladies Argentina and the user 2020 organization group and all the sponsors and of course I'd like to thank all your viewers to come and listen to watch us talk about everything we've been talking about yeah I I agree with that so much and I know we're a little bit over time already from what we said but Emil and I do have a little bit of time to take some questions I know that some of you probably need to go but we as long as the organizers have a little bit of time still we can stay and answer some questions and we just want to thank you all so much thank you to the organizers and thank you all who came and we can stay and take some questions as they as they come in here at the end sharing fun yes awesome all right so I'm gonna open up the I'm gonna open up the Q&A and so there so I'm gonna just kind of go up a little bit so we did go over variable importance very briefly um uh we did go over it super briefly so somebody expressed some interest in like hey what was that that went by really fast what's variable importance and um uh and how did we do it so I'll start when I say one thing and then maybe Emil you can uh maybe you can chime in a little bit the so variable importance um uh is it is one way to approach the question of like model explainability like model um uh like what is driving your model predictions um like what is makes a model more likely to predict one way or the other or higher or lower or something like that in the in in the in some cases um that you can have model the variable importance is directly related to the kind of model that you have and that is true in linear models um including regularized linear models and so it's um uh you know we can get them out pretty directly it's also true in the case of like tree based models you can get directly out a measure a variable importance because you can you know you know how everybody's all the trees are voting and stuff like that in other cases uh you you can't get out directly a measure of variable importance and so you have to go to model agnostic measures and so if you're doing training something like a support vector machine it's more like a black box right and so you have to do um you have to be to use um simulation and and to get out some measures but there are approaches to do that and so we um we have some and we're going to develop a little more about how to measure um variable importance for text models in um yeah do you want to add anything more about variable importance Emil? Uh I'm going to add a little bit by answering our questions so we have one about the taller text visualization which is almost visualizing the variable importance so for now there's not a lot of dirt uh tools to in art to visualize these things you can use the tote from the slides which is available in the repository I probably will do that around to writing a more formal package one day to easily assess uh models with test features from coming out of title models one day yeah okay I'm scrolling up now um and going down um there's a question about uh should we try more than one stop word and compare performance and there's actually an example it says case study stop words in the regression chapter in the machine learning regression chapter where we show exactly that and yes um uh that is we yes we actually I think that would be a recommended course uh to say like hey should I go really conservative and only remove a few stop words should I you know should I remove a lot should I take one of the big stop word lists but one of the great things about um this kind of approach that we're that we're kind of showing how to do here is that um this like very fluent to be able to do that and then compare um the performance because like you don't have to just guess you can try and then see what works best so yes absolutely and to add with the whole stop word list there isn't a universal way of finding what a stop word is and it all depends on like your preference there is like like you can start off with using a very conservative pre-med stop word list and modify that and then you can start looking at like frequency time so just look at the most frequent words in your trading dataset to see but they're also the most likely words to actually have a effect on your model if they appear a lot so you can easily weed out otherwise that don't seem to really add anything Amiel related to stop words someone asked like why would not you why would we maybe consider or not consider those censoring um things like the xxx why would we why don't we consider those stop words so we haven't considered stop words because did you convey some kind of information so we have so we don't necessarily know what it is but once we start looking at multiple censor at a time we can start see oh this is a little a little bit like a censor date or it could be a name or credit card so we have a section on that in the book too where we explore a little bit what we have in the censoring yeah yeah um let's see I have there's a tweet question on from the spasi bad end so if you just want to share that with you it's an easy question if you go to the spasi website spasi.io and click on models that you have a list of the different languages they support right now and specifically this question did you have posties available and they're slowly adding more too but it's it's quite a hard problem to create these language models and you can you can use those through you can't use those other you just have to like download these you know things yeah and then you can ask them access them through r as well um there was a question about um so we talked about non neural network based models and there was a question about like hey what about neural network based models and like when are they good or not and um uh so I so I have used like in my like real life day job work some of these um neural network based models and um absolutely they are um useful uh they can learn um they you know like they can learn from language in ways that these other models cannot always um like like for sure they and and if you I don't if you've gotten a chance to look maybe not you're sitting here with us but if you've gotten a chance to look at um uh the website for our book like like that's the third section of our book it um is is talking about how to train those kinds of models and um uh and how to train and evaluate those kinds of models so for sure they're an important tool in dealing with um uh predictive models for language they they also have um they have like you know benefit like upsides downsides benefits and then uh negative things about them in terms of how long they take to train um how interpretable they are and um the for for a lot of purposes they're not always they're not always the best answer so they're great to have as options but they're not always the right answer for every um for every purpose anything you want to add? Yeah and we would touch upon where the different kind of models have their time to shine in the book too and like one one thing I didn't think of is like any kind of deep model starts to struttle once you don't have that many observations but all these town-based models we look at right now where it's very easily on short data so we did a before we end up this example I had a example of 200 uh abstracts and it worked flawlessly with uh geolaminate which will you probably have a very hard time building a deep neural network to classify that. There are a couple of questions about tuning what is it what does it mean um do you want to start or do you have a specific? I can I can I can start um so like what is tuning what is tuning so some some models um uh like like if we wanted if we started tuning or sorry if we started to train a model like um uh a a uh like a model with like for like a text data model and we were like oh gosh it's time to decide how many tokens to put in the model we could either just decide right like just pull some value out of the air and decide that or we can try many values and pick the best one um that's a pre-processing step um and the that process of trying many values and finding the best one is the process of tuning. There also are models that work that way um an example is you know we showed the geolaminate example there's also even if you've ever trained an XGBoost model um it's one that its parameters need to be tuned or a decision tree model it needs to be tuned because the the the um defaults often are not the best for any given data set you have to try lots of them and then um find the best one. There are some models that don't need to be um tuned like uh like naive Bayes right Emil like um yeah I think it has a couple of like uh a low-town uh adjustments but more or less it doesn't have any uh parameters. And so for like a text dataset you don't really need to tune it so you so it does not need to be tuned so tuning is the process of I have some model and I there are things in it I can't learn from the data itself so I have to try a whole bunch of options and see which one performs best so in so options you have are try a whole bunch and see which one becomes best or or some more sophisticated approach like a Bayesian like try this oh it's better go this way go this way you know like some sort of like um like smarter approach to try to like get to the right answer or whatever um so those are the ideas of tuning yeah um all right I think we're I know people we've been here a long time and I really appreciate um everyone who's been here there's um maybe for one last um I think we got most of these uh I guess there was a question of like why did we go to two categories and what would we do if we did not want to just have two categories um so Emil we actually talk about this in the book right yes so we have a session where we are doing the same thing as as right now but with trying to build a classified attempted into any of the eight classes yeah and talk about how that is different than just one yeah yeah yeah yeah so I think uh I know everybody's been here a long time so let's let's call that good for questions now if you have so um you probably saw the github repo I think there was one package missing from the github from the our markdown this there and so we'll we'll push fix that and push that up there thank you very much for the note on that um we also have some notes there if you want to work through this on your own and like some places to ask questions so if you have other questions that you want to get answered um Emil and I will both like be watching for that um uh I think um uh the organizer shared um the please not to forget to fill out the survey about the usr 2020 survey so please do that um and organizers do you want to share anything here at the end yeah um well we I think we have a a really great tutorial um thank you so much um Emil and Julia for this thank you so much to all the people who who come join us we have uh more than I don't know how to say this in English 130 can you help me juliana over 130 people connected today that's amazing that's amazing and so thank you so much we we will enjoy it again we are going to share all the links for the github repo for the slide for the our markdown with uh in twitter uh for all the chapters we are here today we are nine chapter for from argentina so don't worry you are going to get all the material yeah we see here today so thank you so much I think we all people here um we need an applause a big applause thank you so much thank you thank you uh please uh yeah we have a mini our ladies there yes it's like that yeah so thank you thank you so much we are recording this we are going to upload the video to the youtube channel for our consortium in in user 2020 um please feel the survey uh it's really important to improve uh this kind of conference especially because next year it's going to be global so maybe maybe we can have some of these events here in our region again uh so well I don't know is anyone else want to say something else please unmute uh well thank you so much everyone see you soon I'm going to end the meeting now