 Hello everyone welcome to our talk closures emerging data ecosystem and incomplete to her at the rebel My name is Ethan and I'm here with Sammy. Hi, Sammy. Hey, Ethan So I'm a full stack developer. I currently work at primary comm company that makes bright and simple clothing for children I also in the last few years have been involved in helping to organize the side clothes community which is a community devoted to improving working with data with closure and Also educating people about these tools and trying to be as inclusive and possible in terms of getting people involved So that's a little bit about me. Sammy. Do you want to introduce yourself? Sure? So I'm currently also Mostly a developer. I work with a company Sweden called KP system We develop tools for municipalities kind of map app-based tools. It's all on closure I'm also the founder of 8 bit cheap, which is a consultancy. We do all kinds of things around data I have also been busy with cycle for a couple of years now Helping organize stuff and and things like that Ethan do you want to talk a little bit about them what we actually want to do today? Why we're here actually today? Yeah, our goal today is essentially to showcase some of the libraries and emerging data ecosystem for closure We want to show what it's like to work with them how they fit together And give you know a sense of what it's like to work with data and closure We've called this an incomplete tour at the rappel because we aren't by the means showcasing all of the libraries that are available There's many wonderful libraries that we won't have a chance to work with but we've tried to select some that are commonly used and Central to doing data science. Yeah, and we're gonna do it by actually working with some real data We'll talk more about it in a second when it actually entails, but yeah, trying to sit by the rappel Show some libraries and work with some real data Should we get to it? Yeah, let's jump in so so we also already have a little bit of a start of a buffer here I'll just start evaluating Ethan if you're free to explain a little bit what we're seeing here now Sure, so what we've got here already is just some setup getting started We are right off the bat requiring a few libraries The library here is called node space and node space allows us to work in the buffer But also as you can see on the left of the screen get some sort of visual Tools that allow us to look at the data more more conveniently and we're gonna be working with this tool throughout the session today So you'll become quite familiar with it right now I think you can maybe demonstrate simple form of what it looks like and how it works right so me Yeah, so we just have this range then and immediately after we we have Evaluated it it shows here in the last about and actually when we go here and we save the file The whole buffer now we haven't gone through all of this yet But everything that we have in this buffer here can be seen here now This is sort of a notebook aspect of the node space And yeah, so we preloaded the data into the computer here I load it here Ethan you want to say a few words about what we're actually loading in what is this data with this? Yeah, so this is data that Sammy pulled from the Clojureans ZULIP API it's message data conversations that the Cyclone community has had on the ZULIP platform and In its raw form. It's a sequence of maps, so we can actually take a look at one of those and look at what's what's inside You can start to see how the node space context is useful because we have a nice presentation on the left What you see there is a map with a bunch of different keywords and this is the data And we might want to pause here right just to talk a little bit about the connect the data to the actual UI that we use And ZULIP right so many people may be familiar with Slack and ZULIP is similar But it has one at least one important difference, which is that within a given channel or stream as it's called in ZULIP Conversations are broken down into middle tier category of topics within which messages occur So within a stream you have individual topics and then the messages and the conversations happen within those topics And if we look back at the data, we can sort of connect this what we have here is a single message This map represents a single message. It's within the stream data science and the topic Is called here a subject and we can see that right there and in this case, this is a message occurring in the in the topic Hello So that's you know, that's kind of a rough description of the data we're working with here, right? Sorry if I missed maybe you said it already but the message lives here under the content right the message content is there Yeah, I did miss that Yeah, well, so I think what we want to do is start to explore our data a bit more and Yeah, before before we can do that one thing we may want to do is is think about extracting Some ideas out of the the basic data we have So a kind of feature generation How can we take, you know, some of the information like the timestamp? Uh, or the content and other pieces of data that we have already and define some other characteristics of each message one place to start that's sort of straightforward is we have a timestamp, but maybe we can identify Other characteristics in terms of time of each message. For example, what hour did it occur? What year what day and so on? And we can use tablecloth To add additional columns to the data set we have already I believe we can use a emerging library that I've worked on a little bit called tablecloth time, you know which has some convenience functions for modifying dates and Extracting components at a time So that might be one place to start and then we can think about other features to work on like You know, can we can we extract piece of information about the content? For example sentiment Can we think about how the the messages are related to each other? Is the next response a rapid response? Is each message part of an active conversation or is the conversation sort of limping along? That's true and that fits well in a kind of a larger Scenario if often in data science you have a few few steps you start with wrangling What you were actually describing the feature generation wrangling after that you explore you try to learn about the data and then Sometimes you continue with with prediction and we hope of course to show at least something of Although before we do that. Is it okay if we just look some very basic data about this data set that we have here now We can see them. Yeah. Yeah. So tablecloth shape is a nice little utility That's show them Size of it 18,000 messages 18,000 messages and five five keys. Let's look at one of these one of these Let's just look at the first message Yeah, that might be nice if we're just taking the first one we can and we can use tablecloth rows to do that But specify that we want a map instead of a data set row output for convenience Like this now here. We can see the first message Mm-hmm. We can parse html in our heads Um, let's also quickly look at uh, what do we want to know? We have we know now that we have 18,000 You said messages. Let's check the amount of people we have right? Well, that's just how many Cyclosurians do we have You can pull one of the columns just by using the key that the column is is named by then We put it into a set and we we get a unique number of Users and then we count Yeah, so that's let's save that so we have So this is just very basic basic stuff, of course, but okay, so we know we have 18,000 messages. We have uh almost 250 users Um, yeah, we don't know that much more currently, right? right To find out more to do more exploration. We need to start creating more columns, right? Should we start doing that? Yeah, so now we can maybe think about looking at What it would be like to start adding additional columns to our data set in order in order to build out additional features and characteristics of each message. So one place to start might be to extrapolate Some of the features of the time So let's let's try to add a column to our data set using tablecloth and well as you're writing here We'll use a function called add column And what what we can do is let's add a Well, we have a time stamp Well, let's convert let's add a column just for convenience that expresses the the local date time So this function it takes the data set As its first argument and one of the nice things here is we can use the the pipe expression To to transform our data set in a series of steps So so in this case we're using the pipe and our second argument is the name of the column Just whatever we want to call it and now the last argument is a function Which will take the data set as an argument and then what this function needs to do is generate The column data for this new column. So What we need to do here right is convert our time stamps to Local date time so we can start with a pipe another pipe inside this function And we'll pipe the data set through and now we want to select the time stamp column And now we can map over the values in this in this column. We've selected We need to convert our time stamp to Local date time and for that we can use a function that's in tablecloth time called milliseconds to any time Do you want to write that out same? Yeah, and this function it will Work it takes the time So in this case we use our special anonymous function argument value of the percent sign And then we want to specify what we're converting it to so let's specify that we're converting it to a local date time And we can use the keywords to do that So that that should do it. Um Yes, but you can you can define this beautiful little tricks because you can define this method that and now we have a data set here If I can spell it if you're lucky we might see so yeah We are but we were not so lucky. I mean 70s was a good Good decade, but but it's not when Cyclops started. I'm afraid So something's wrong with our eight time conversion I think what happened is that our original data our time stamp is is actually in Seconds and not milliseconds. So we need to convert Our time stamp before we call the function milliseconds at any time Yeah Now let's try a look. Okay. That looks better, right? There is something we could do right to Because one of the secret sources of tablecloth and The kml data set is that it's super efficient and fast, right? That's right underneath the hood even underneath the kml data set It's we have a library called d type next which allows us to very efficiently operate on typed data So because d type next knows the type of things It can it can Operate very efficiently and it gives us also some other functions that allow us to do things like mapping very well So here we've used the regular closure function map which works But we end up with a column that doesn't we don't know the type of it and so that you can get a performance hit So what we'll use is a function called emap which allows us to do mappings But doing it in a way that stays within this world of typed entities. So what we'll do is Specify here that we're working with an int and int 64 that there in our next step We actually convert a number An integer to A local date time which is a type of entity and so we'll specify that that's what we're going to get in this As a result of that So we're not going to see any results in the data, but this is the correct way and more efficient way Uh to process it, you know to map over a column and to do column transformations And you know, I have to say I worked a lot with deep Diplier in r and There are some similarities to r is a very sort of economic language, especially the tidy worse Aspect of it and and it has also a pipe and it's it's very intuitive and easy to reason about But I have to say this is it's actually nicer and cleaner and it's super nice to write and it's well. What do you think? Yeah, I agree This is a really fundamental as we know the arrow matters fundamental to the closure language and here We just can take advantage of it to do the kind of very Readable transformations of data where you're kind of layering on more and more Information, uh, that's related in a way that continues to be extremely readable. So, yeah, I really love it Okay, so next what we'll do now because this takes time and and even though we we tried to get an even longer slot The organizers wouldn't wouldn't accept that so we have to stay within 40 minutes So a bit like in a cooking show, you know most cooking shows you don't have to look at the oven for for an hour One and a half hours to see how it so we're going to spare you that part But we're going to generate some more features. Yes, exactly and voila. We're ready so Fast coders are if you if you need to hire say fast coders you can find two here. It's not do you want to sort of start Describing a little bit more on a high level what what all this is Yes, right so a second ago We created one single datetime column as you can see on the top and then we started to derive the other features Partly from that and then a whole bunch of other features as you can see Each of these calls is simply an add column Call that is it's the same pipe. We're piping the data set through and just adding column after column So it's a lot, but it's actually really easy to understand. There's not, you know, a lot of ins and outs So we can look We've we've packed this new data set we've created with all these columns into a new variable called message with features And we can look at what our column names are Generally speaking we have three we added three types of features one We already alluded to which is we were you know creating we created a datetime And then we thought we could pull out additional time components that might be useful Features to explore the data by another broad care category that we've added are what you could describe as Features that capture the flow of the conversations So one of the things we're interested in is, you know, how active is a conversation So we've added a bunch of different columns here. One of them, for example, is the next response time. So this is If I wrote a message What was the Time how quickly did the person who wrote after if somebody did? How what is the time that it took them to respond? So it gives us a measure of how quickly people are responding to a given message And from there we've and some other columns would define a notion of activeness Uh, so that's another broad care category Features we added and then finally Sammy generated some really lovely Measures of sentiment based on the message content. So These are different types of sentiment trust surprise joy positivity negativity and so on So, yeah, we have a bunch of new features here that are all derived from our initial data that may Be very interesting to explore and look into and that's what we're going to do as we move forward What do we do next next week? Now we have a kind of we've been we've been wrangling and feature engineering this kind of stuff And next we want to start to explore this stuff. But uh, I might have to go and Sleep between it's it's right But I do I do it quickly. I promise I do it very very quickly. Is that okay? That's fine. Okay. I'll see you. I'll see you in the morning. Okay, so I'm back. I've slept I haven't yet No, this this is this is the fun of working in a community like this because now it's kind of a nice sunday morning here Um, I'm refreshed. I almost finished the whole Pan of coffee But what is it there? You're in close to seattle, right? Yeah, it's midnight. It's been uh, so I need to go to bed But let's do a little bit now. Um, yes, and then we continue on your morning again on my evening, right? It's kind of like a relay race now. We want to do some exploration Yeah, so we'll look at we'll continue to use tablecloth and tablecloth time, which is a kind of extension of the idea of tablecloth with provide some additional utilities for thinking about time Uh, and then this new library visit the lj By shima punjewani that gives us a very simple way of Generating plots. So so this is a library right that uh, I'll require them Now here first so we can use them So the stack is that we're we're actually utilizing vega vega light and then there's this beautiful, um, library called hanami Which is a templating engine for vega light in in closure, which is super powerful And then this last library that you've been discussing is built on top of that. It just makes sort of quick, uh, plotting Uh, easy, right? That's kind of the value prop of that Yeah, and you can see a similar pattern here to the uh to tablecloth and and all the things that are needed So underneath tablecloth, we have tec ml data set And underneath tec ml data set we have d type next These things kind of become lower and lower level as you go down But at the top there's this, you know, the simplest possible api And this is sort of the equivalent. It's like a similar stack, but for visualization Yeah, and I just paste it in a very simple Random random numbers data set Nice be generated here and just to show how it worked first We need to declare, uh, if I can type one thing, I mean quickly maybe we should increase the font Oh the first thing we need to do is we need to format the data. So so with clj uh Can deal with it and then this is kind of the minimum To get something which visible if we're we're lucky here and then the next thing we need to do is Decide the plot right the plot type. You can see here. That's why we got this We required something from the hanami also because there's some nice templates ready made templates already, but let's just take a basic Point chart is probably the most basic thing we can do here. It's when we call the biz function here It's going to generate the plot actually. Yeah, if you're lucky Yeah, yeah, let's do some real. Let's start looking at the data, right? Oh, we did all this preparation work, but we haven't we haven't uh Let's do a little intervention here now. It's like what time is it there? What almost one a.m You need to go to bed But luckily, you don't sleep very long in in Seattle either. So I'll see you in just a minute, right? Okay, so we'll come back to it then. Have you have you slept? Did you sleep? I well, thank you very well. Good. Good Um, so you're refreshed. I'm not so much reasonably That's good. That's good. We did quite some work actually there last This morning last night for you. Yeah, we were not focused So we decided we we'll rewind a little bit back, right? And and we do it again What what do we have in mind? What do we want to do next? Yeah So maybe we could look at the the message volume over time Um, yeah So what we'll do one of the things one of the concepts in tablecloth time If you have a data set you would set one of the columns as the index So we'll set the date time column as an index and then what we can do is Modify the the frequency of this and look at the results So we can use a function called the just frequency that allows us to modify the time series By different sort of grouping so we can start with let's say maybe even hours So what we're doing here is we're saying, okay, we have a time series Let's kind of summarize it on the hour interval step And then we just want to do some sort of aggregation and this and essentially what we're doing is just counting the number of messages at this frequency Here we're using table cuts row count as our aggregation function So you can see that these libraries are designed to work with each other And then yeah, one final step here We have to do before we actually use this CLJ to visualize we need to adjust our so we can just use a Simple operation statement to to do that. We'll take the partial of a map and That should allow us to convert the column and then This needs to be a map. Sorry about that. Yes, that's right Now yeah, there we go. Now we have our aggregations. Yeah Yeah, and then we can plot this this time we can use a line chart so we'll specify that we want to use the hanami line chart template and And then we can add some color So we'll just call viz color and then specify that we want to use the year Oh, and then we also need to specify what our x and y are. Yeah Yeah, so in this case our x-axis will be the time column Add the daytime column rather Yes Great and with any luck if we add the viz viz on the end there No Oh, no So, oh, I think we need to specify that the daytime is a temporal value Yeah, okay. Okay. Now. Wow So this is kind of interesting because We don't see a lot of shape We kind of get a sense of the volume one one thing that might help would be to kind of, you know To look at things at a higher resolution of time so we could increase our frequency to say days Oh, so we're trying to see a little bit of a line there. Let's maybe even move out further weeks Yeah, we can do weeks ends. Yeah And then maybe even one more it does feel good feels like we're getting up It's like things going from being out of focus to in focus Okay. And now now we yeah, that's cool. So there's kind of one a big peak per sort of year on 2020 The peak was early in the year. Usually it's around this time It's it's actually a lot of work to format the data to get the plots to change the sort of resolution and everything That's that's super super handy. What should we do next a different plot where we just look at These lines we can look at the those lines super imposed on each other and you know, we've kind of pre-prepared that plot here Yeah This is interesting. They're pretty similar. Yeah, this kind of highlights that peak in march april of 2020 is a little bit unusual Because both the 2019 and 21 are sort of under the decline after march Into the summer month. So that's interesting. We did that year. Of course, we had our first cycle conference in berlin Energy, but also that was the start of cobit. So me and and as I recall there was some interest in in talking about the data Associated with cobit. That makes sense. It also looks like the summer months. There's always a kind of a slump there with that makes sense Okay, that's good. Okay. What next? We were talking about hopefully at the end we will have time to do some Modeling to do that. It's kind of good practice when you do analysis to split your data into groups Yeah, so when you in preparation for doing some sort of machine learning you need to Do a split often of your data into a testing set I'm sorry a training set which you use to train your model and then a test where you You know, you test whether or not your model is predicting the values in the test set very well so Split the data here, right and how does it look next response time is the time until the next person responded So it's a kind of measure of activity here We're doing a histogram to show the distribution of those values and we can see that it's quite skewed Probably we can get a better representation of the of the actual distribution by normalizing these Next response times by taking a log and that's what we'll do here, right? Yeah, let's let's let's do that But before we do, uh, let's just give a shout out to this. Uh, it's also christis library called tech vis That we're now using it's also using vega in this case to to build this histogram But it's kind of nice But now we're using this hiccup thing. Uh, it was mentioned before and placing this next to each other Yeah, it makes more sense now. It looks it doesn't exactly look like uh, like a ghost anchor, but much closer Next we could do some more exploration of our our, uh, other features. Exactly So so we are interested in that. Let me zoom in a little bit So what's going on here actually, okay? So we've got in our x column in this first plot and the feature that we're interested in looking at And then we're plotting it against the median next response time kind of some Aggregation that is showing us for the year of 2019. What was the average response to each message? And we can see there's quite a big difference, uh in in the median response time, which is to say an increase in in active Dynamic conversation joy. What can we say about joy? We see a little bit of correlation between a more joyful sentiment and the and a quicker response um Okay, this one's interesting joy. Joy is good, but Positive positivity. Yeah, that's yeah, it's interesting. Don't don't be too positive It's kind of boring. You won't get a response Yeah, you could imagine if somebody's very angry You want to stay away that that's intuitive. Yeah, here we have the different streams So there are yeah, there are differences in the different discussion themes But okay, if you summarize this we try to look and find some patterns So it kind of look yeah, we do we do get some insights by this and there is some connection It looks like there are some relationships in the data, especially in relation to this notion of Maybe we have a chance of Modeling something successfully. Yes. We were doing some more exploration and we found a plot that that looks really promising in terms of thinking about How we might start modeling things and it's based on what we the plots We're looking at what we found here is that if we look at the next response time as a relationship to The response time of the message that were that that that initiates the the following response time In other words, if I write a message quickly in response to something else, how likely am I to get a quick response? it's kind of tilted to the right and so we have a Kind of linear what looks like maybe a linear relationship It's the kind of thing that we could model with some sort of linear aggression One curious thing to note on this graph is that odd line across the top where it looks like You know, there's kind of hat on the data and that's actually what there is We put a cap on the longer response times because they there are some outliers They relatively infrequent compared to the mass of the data, but this allows us it's like the log It's a way for us to kind of remove some data point from consideration In order to get a better result. The response time is an interesting variable. It's let's try this Look at the correlation between those two as well So we see a relationship in the in the graph, but we could also measure that relationship with statistically using a library called vast math, which is also by the same author of tablecloth and Here we're using a function called correlation that gives us what's also known as the r value Which is a value that has this is sort of a range between negative one and one and when it's negative one it's a perfect not negative correlation and when it's positive one It's a perfect positive correlation and and and so if we run this statistical measure on the these two, you know These two variables we get a kind of, you know some kind of Relationship it's it's not a perfect one and we wouldn't want that actually, but it's But it's you know, it's point five. So it's there's some some correlation there and this means that we may be able to See a result if we try to model Generate a model that can predict the next response time as a as a from the response time of the the message the current message So let's go with the Linear regression because response time is a continuous variable. So then we will we'll do we go with Linear regression for that. We're going to use a tool here called the cyclosh ml Maybe you Ethan want to talk about that and I'll start typing one of these pipelines that you're going to explain How it works, but there's a bit of typing there. So I start I go ahead already and start some But yeah, what is cyclosh ml? Cyclosh ml is a library by karsten bering and it Defines some useful concepts for running models And defining kind of pipelines and the pipeline here So sammy's defined a pipe called regression pipe and then he's calling a function from cyclosh ml called pipeline And this pipeline function is roughly analogous to the arrow macro in that within the pipeline We can describe a series of transformations And the first transformation that sammy's adding is we're not using the tc alias Which was what we used when we were using tablecloth, but we're using mm, which is one of cyclosh ml's name spaces and These functions have just been modified to essentially do the same thing But instead of receiving a data set as their first argument They receive what we call a context and we'll explain what that is in a second But for the moment, we can just think about this, you know, essentially the same kind of call that we saw before We and so here we're adding two columns then we're going to Select these columns Because that's the only data that we want to feed to our model, which is the end of this transformation Okay, so now we've selected the columns that we're going to feed to our model but we need to prepare the The data set so that the model can know what we think we're trying to predict on So we set the inference target to the log next response time, which is what we're trying to predict Now here we're adding a kind of special value and it looks like a map and all we're doing here essentially is saying This keyword model is what we're going to use to look up our results later And then in our final step, we declare what it is that we're going to model with and we're going to model with the ordinary least square model, which is a linear regression and You should note that also these models that we're using come from java So one of the things cycloge ML does in addition to giving us the notion of a pipeline is it allows us to use Models that have been defined in this library smile, which offers a whole lot of different Modeling tools now here what sami has just pasted in is a really important step Okay, so now now that we are missing out. We need the percentage. I need to be awake. I'm I'm focusing more on typing than thinking but it doesn't help my typing apparently. Yeah, so that's what we did here is we actually We ran this transformation But let's look a little bit quickly at what it is. We actually did there What we did was we in our first step we defined our pipe in the second line We defined a variable regression trained context one We used our pipeline as a function and we provided it with what is called a context and this context is just a map It's a map that expects specific some specific keys and in this case It expects you can see metamorph data And that's how we give it our data set And then we also specify the mode We say that at this point because we're training our model We're trying to fit the data the model to the data when we call the pipe regression pipe this first time We call it in the mode fit. So that's how we use a pipeline and We've evaluated and now we can use that model keyword that we specified in a pipeline To pull out the model from our trained context and we can use this explain function to look at The coefficients of our linear the resulting linear regression We can see on the left in the node space that this is a our model has a bias of 1.72 Coefficients for the log response time of 0.47. So it looks like a linear regression. And so now what we want to do is Examine how effective our model is in predicting. So we will now run our pipeline one more time But this time we're going to run run it With a different context. So if we look at our what we've done there, we've called the same pipe on our but this time We're providing it with a modified Context we take the trained context from before We overwrite the metamorph data We overwrite the data set with our test data and then we change the mode to transform Which is the predicting mode and so when we run this we end up with some predictions and we can Actually look at what those are Yeah, sorry about that Okay, yeah, so here we can see the result of our our transform run. So we have our a series of log response times and then the predicted value of the next response time And now we can you know, we can we can because we have the actual values on the predicted we can we can score this and and Yeah, so now we have a big bunch of code that we've pre-written that will generate scores various measures for scoring the data Like that We got there Okay, so I don't know sami. Do you want to explain these different measures? The longer short of it is that this doesn't look very good Um, so a good good model if you look at the r r squared Numbers a good model should be close to one or sort of like a percentage of how how big part of the variance is explained Okay, so here we have for the r squared for the logs point two eight It's still not very good. It's not good. And then for the r squared of the actual values When we're trying to actually predict the the response time in seconds with outside of the log It's point negative point zero four two. So it's it's pretty bad. Yeah, it looks like Fail, so this was a fail, but no, it's not a fail. We we learned something We were seeing some possible correlations that we could use for prediction, but it didn't turn out We would have to work more on it But maybe we'll do I mean, this was the process but you can use this process for for More or less any any model. Um, it's quite standardized. So let's let's do that logistic one So now we now we have we talk about times length in seconds But we have already done the categorization of of the results as active or non active And that's of course based on a heuristic that we decided that a certain amount of seconds Is is is faster than an active and after that it's not but let's try that Maybe we can get better results there because my intuition is that it the problem here is that the Um with the long response times it really breaks the linearity and the distribution that we need to be able to use Is it okay if we jump to the classification part and now I'm not going to bore people with typing everything I'll just copy paste these models. Um, if that's okay All right, so we're going to use more pipelines We're going to define a different pipeline and this time we'll define a classification pipeline and the model We're going to use as you can see at the bottom is a is a logistic regression And then maybe the other interesting thing here is that instead of using the next response time is our inference target We're going to use this column called active that we defined which labels each message as whether or not it's part of an active conversation We've now trained and then run our test of our classification pipeline Okay, so now we have a new we have a new function that allows us to to score The the the classification pipeline and here we're kind of interested in the accuracy how often did the pipeline Did the model you know correctly predict whether or not the conversation is active or not? And so this I did this this time and also previously these these are just kind of convenience functions So we we can get all the all the numbers of the measures into into one nice map like this Yeah, okay. Oh, that's here. We have a confusion matrix showing Where where we had, you know the kind of relationship and then our accuracy measure Which uh, how would you evaluate that sammy? Let's just start by saying that this looks much better And it's interesting this confusion matrix is also Interesting, so let's look at that a little bit. So the false false and the true truths are the ones that we Correct ones. They are the accurate ones and then we have the false positive and false negatives there and there's Fair amount. Yeah. Yeah. Yeah, it's it's still not. I mean the accuracy if we would have in the 90s I would be very happy blown away But but especially considering that we didn't get much out of the linear regression. I'm pretty happy with this This considering we that we didn't spend a lot of time on doing this But what we would could do still if that's okay even we because now we're just using one Variable one column or feature to to do the to feed into the model So we're not using all the knowledge we have About the message messages to inform the the model. So let's try that still now instead of just the log response time Now we're adding a lot of these different features in the hope that that will increase it into something more More interesting. Maybe there's not so much. This is more or less the same We're using a kind of a helpful function here one hot coding. That's dummy variables where you change categorical Var variables into ones and zeros but we don't need to go into that in this this presentation so much But otherwise it's pretty much the same right even yeah, same basic structure We've just added some additional columns and transformations. It's necessary for them to be Understood by the model and now we do the training training part and I'll do a little bit more test part Yeah, and now we want to see some results Yeah, so we can call the same helper function we have to score this model Okay, so now it's actually yeah, is this okay because doesn't it look Exactly the same but okay if they went the same it I don't run that one again. Yeah, I'll take them here So we have the context one here. So that's 73. No, they're not exactly the same. They're just we're not seeing any improvement No, we're not seeing any improvement. Actually, it's yeah, they're very close. So So our feature all hard hard feature engineering work didn't help as much it it So the result here is actually that we have one variable so far that actually predicts the activity in or the Yeah, the active active discussions and it's the how fast you you reply. That's that's the sort of domain level conclusion of this Um, so far, um, I'm sure we would find much more if we would spend more time on it and dig deeper And maybe we will this is this is pretty interesting And this tool is quite amazing. Actually, if you if you look at how you can how elegantly the the the pipeline is is is built And how you can I like this a lot. Actually, this is this is super nice. Um There are maybe it's not for this talk, but there are a lot of advantages Which has to do with if you want to generate A programmatically a lot of models and compare these models and collecting statistics of these models and using them Which when you when you start to do real Heavy-duty modeling that's that's worth a lot. Yeah yeah, so, um this was it and It was a nice super nice journey. We should mention that This material was prepared in cooperation with daniel sudsky who also had a workshop So if you're interested in going a bit more detail into uh into most of the code that we use today as well It's it's in that workshop um Yeah, and here's a list of the um libraries that we used Please Research have a look at them. They are fantastic and the people Behind them are fantastic and we're very thankful for all the effort They did they made this journey that we presented today possible I should add that what is word stressing here is the simplicity of his approach and the functional approach Which matters a lot here. There were some details that we skipped in the during the baking phase when we had the code in the oven So to say that require looking into the data and at different levels of granularity grouping ungrouping things that you in the workflow for data scientists do all the time And but closure and tablecloth does is that it makes it very easy, which is not obvious at all in other platforms but also Let me add that this is um constantly evolving ecosystem. People are working on these and Yeah, we're always looking for more people to help help can be come in the form of writing a library but even more so to organize things to Work on the website to do all the other kinds of stuff that is necessary for for this kind of project So if you're curious about that science, if you're curious about these tools, please join us Zulip as we have seen is the main main place where we meet Here also is a link for with a repository that has all this code So if you would like to dig in more deeply Look at some of the the the code play around with it and possibly even Run a model that uh predicts better than ours. That would be great. That's not possible We did work hard. We worked hard. Okay. Thank you everyone. This was a blast. Thank you