 Thank you very much. So, yeah, welcome to this presentation today. I want to talk about data unit tests and what they are and Why you need them? So actually this presentation should take around 20-25 minutes. So what I'm re-looking forward actually is actually the Q&A afterwards, so Yeah, with that, let's dive into the agenda So I will begin with a short intro then I will define what data unit tests are and explain their importance especially when building data products Then I will present some framework you can use to perform data unit tests and also show some live coding and After I will display a bit how we tackle it at my company get your guide and yeah conclude and open for questions discussion Okay so My name is Theo. I'm a data science manager at get your guide and Get your guide basically is a marketplace of travel experiences That means that basically we In my team in particular, we are responsible of the ranking of the experiences of the platform and to do so we Rely heavily on data to achieve this or in both of ranking and basically that means that's what we call it it's like we're building data products and It's the real big reason why I want to talk about data unit test today so basically if you think about Data products, this is a combination of code with data. That's how you can build this data product It's actually if you think in classical Software development it it's common practice that to ensure the quality of your code You do have some kind of automated test. Everyone is pretty aligned about that But the focus here is on testing the code itself to ensure the quality of your data product You will need to validate the code of course, but also the data because it's part of what's end up being used at the end And one way to validate the data is actually to have test data unit test for example and Actually, if you think a step back a bit so if you work in an organization that already have multiple engineering teams Actually the codes that you own and you are changing on the day-to-day is usually owned by the team and They have hopefully full context of when they when they change it However, the data that you are using is very likely not produced by your team is produced by some other teams And maybe you have yet another team that is transforming Transforming it before you use it and They very unlikely they have the full context and how you are using the data So as an example for us, basically we combine around 10 10 different data sources for basic to be able to rank To rank all the activities and same thing We also have some recommendation system and we end up using the kind of the same numbers So it's pretty lot a lot of different sources of data that we end up using so What in this in a sense? Basically, what are data unit tests? So they are here to verify some kind of expectation that you have on your production data I want to make it clear that's There is a distinction between testing your algorithm for example is your function Logic able to handle null value. That's will be a unit test You want to make sure it will not break but testing that your data set That does not have null or doesn't have too many null in production is something that is a data unit test and Things that you might want to test as part of your data unit test is for example That's your mean max average of a certain column is Okay, like for example, if you are computing some kind of conversion rates or Clickthrough rates if it's higher than one, it's probably there is something wrong somewhere Very similar to verify me probably that you have no missing values or maybe Not too many missing values You might more verify that you not have duplicates in some columns You might also verify that the number of samples that you have is reasonable if you always get around 10,000 rows every day Random example and suddenly you get 1 million probably there is something really wrong happening Okay so I think now you have a kind of an idea of what data unit tests are and Now I want to get a bit overview of kind of the frameworks that I have I found Online so the first one great expectation It's also put some some stats around so that's the most active projects It supports most of the formats and also really I think it's Render some kind of data documentation. You can pick off so like human readable documentation of your data And so yeah, that's the most future complete the other one that I find also pretty interesting is pandera, which is also pretty active And That was originally built as you can guess from the name built for pandas to do some validation but then last year they introduced this high spark pandas Check and very recently also spark SQL Check so you can basically put your spark code Your spark data frame and be checked by this library But it's more focusing on validating the schema and there is no Data location of visualization like great expectation Then also like another one that pretty interesting the tensor flow Data validation, so it's part of this whole ecosystem of toss a flow extended Surprisingly not very active and actually pretty Tightly integrated with the TF ecosystem. So if you are not Doing tensor flow is probably pretty hard to integrate with your framework But they have some kind of data documentation data visualization on top and finally or so so soda I think that's a similar to great expectation But started more recently Yeah, that's a kind of think I would say Contender But here I think there is pretty clear winner with the great expectation and as the most Feature complete library and so that's the one we'll dive in now so Great expectation I will give you some kind of key concepts. The first one is expectations And which is basically an assertion about your data. So basically exactly your data unit test second The which is pretty nice about this is that the most common use case are already implemented So if you want to make sure that a column have a set below a certain percentage of null or have no nulls or Mean max etc. All these kind of common things are already implemented So you don't have to do it yourself But of course if you have some kind of custom logic you can also use their Extend their their object to to do it The second one is this data profiling basically what you can do is You have a data set, you know, it's correct because you manually inspect it And you can pass it through to to build already some kind of draft expectations They are not perfect. They sometimes not give the the right Expectations, but it already get you started with something basically instead of having to manually write everything of what you expect from a whole your data set Then they have this data validation So basically what you want to do is once you have all this Expectation on your data when when you have a new data a new data set. Well, you you want to to validate it and basically You'll have some kind of collection of all your expect of your all your expectations and if some or at least one is not Assert it correctly then you want to be able to be alerted via email slack and Finally this kind of data documentation, which I think is pretty really interesting is the kind of auto render documentation of the data what's your expectation you have on the data and So what you can think of it is like some kind of constantly updated data quality reports of your data okay so With that I think like code is probably better. You will get a better understanding. So let's dive into some code. So What I want to show you is How to we can create some data unit test with great expectation and You might have heard from my accent. So I'm French. So of course I will use some kind of wine example So yeah, this this this one is It is it too small or can I should I increase the the size? Yes Okay, so It's never too big. Okay. So basically what we'll do. We'll first install a great expectation and then we have some imports basically what we will have is to We have two data set we have a white wine and a red wine and First we'll load the red wine. We'll do some Write some test about it We're using this data set and then we'll load this white one and validate That's the the white data set is also correct basically. So We can import here so first we'll set up some Some great acquisition or data breaks. I do not want to Show all the things here For the sake of time I think we all want to focus more on this part of creating the expectation and then validating them But feel free to dive in there will be the link in the in the presentation. So what if you're interested? Okay, so what we want to do is Oh, sorry, where am I? So we want to load this data sets and It's reading a CSV should not be that long Okay, great. And so here is the table basically I do a display. So here you have different columns and so the quality is the one that we'll look at and Basically what we want to do is Setting up so great expectation in data breaks Then we can start building some unit as the first thing we want to do Then the quality column that we had on top here Basically from the documentation, we know that it should be an integer in the range from 0 to 10 So basically let's let's validate that's let's put that as an expectation that you have on our data and here Well, basically what this validator from great expectation and then we can expect Then we set can specify the set great looking good and then we'll look into the so fats and Here what we want to do is to perform some callback like Darversion test In other words what's in practice what we are looking at is the histogram of the values And if the histogram of we expect this histogram to be similar to the one that we'll build now And if it's different then that means that the distribution is different. That's what we are trying to do So here same thing So we'll build this Great and it shows some something looking good then now what we can do is save our expectation Now what I what I talk about and now you will get a better understanding is how we can generate and show some data documentation So here we are going to show this data and basically you get some kind of HTML so easily that you can then deploy here I'm just displaying it inside the Databricks but basically you have some HTML that you can then access and so here is like The list of the expectation that we have so we have to and what we expect is the quality to be in this range and the So fats to have this distribution Now, okay, great. We have all here. Of course, you would like to have more than that But here is for the sake of time. Let's go with just these two and Now we load this white wine data set And we have some if we have some wine some people we can guess Some wine lovers Yeah, this kind of property Maybe a bit different and so let's check that basically what we want to do now is to validate this data set to see if It's the same or have the same expected way. We want to validate the expectation and what it's telling us Okay, so let me Close this thing and they say, okay, it failed. Actually, we have one successful expectation, but one unsuccessful What happened is actually the quality is good. We indeed have the things in the values however, the so fats as you can see is we expected this this Distribution that was our expectation however observation was different. We saw this so it's too different given the constraint that we have so that's it we we have then basically you start your exploration by Having some kind of example and you see directly. What's what's wrong with your data set and what what could be? Where you have come some kind of head start when you debug Okay Yeah with that I Will go to the next part So I will explain how we are using using data unit test and get your guide, but First, let me share a bit some some kind some incident way that we had in the past that kind of trigger these initiatives So what happened in the past is for example, what once we have some kind of a change in data structure That led to have some duplication in our recommender system Then we also have a lot of we had some missing feature in our daily Prediction job. So basically the model was predicting at some empty some and for some feature that it was trained on was not present for the predictions and Also, we had some One one way use the wrong column with some mistakes. So basically the scoring was more or less random Of course as a following the good practice of incidents though We took actions to avoid reproducing them and so for example, that's how we introduce this weight expectation We use it to validate our recommendations and also we use it to validate our key event We also have introduced some health check as part of our continuous integration So basically we as part of the data set we verify that we have some property that are verified this kind of data we want to make sure that We have bookings for every platform that we support and think that that's in our data set before we Send it to our trade to train a new model with this and also more recently We are starting to exploring some SAS solution in the space. They are Pretty active But still in the valuation phase basically And yeah, so with that I would like to conclude so basically what I hope to Transmit during this talk is the that's the importance of the data data test and especially in data products I think it's pretty commonly accepted that you should test your data Sorry, you should test your code Not so much is discussed about testing the data, but I think it's as important or even more important And yeah, so I think it's an endowment that neither one is and yeah, I just Showed a bit in get your guide. So there is no silver bullet We just say like we'll roll out get your great expectation everywhere and we're all done. No, it's a bit more complex There are different use case and not everything is Yeah, we didn't find the silver bullets that can be rolled out everywhere at this point and Yeah, so we tested multiple approach and We'll see what leaders are pretty interesting to see how we'll see in the what's happening in the following year With that open for Q&A, I don't know how much time we still have but you'll have quite some time questions you can queue to these microphones actually and I'm taking the discord channel as well if you're Watching the live stream you can actually ask your questions from South Pole to be Okay, ladies first. Thank you for your talk first of all If I understand correctly you can use great expectations to identify data drift and the incoming data, right? Do you use it to retrain your models for predictions or like whatever you're using or in automatic manner? Or do you still go like your data scientists check out the data afterwards? So one of the sad solution is actually looking into these data drift in particular But in practice what we are doing and we have been doing for quite some time is Retrained regularly so as part of a pipeline retrain we have model that we retrain every day We have model that we retrain every week a Few that we retrain every month, but that's how we tackle this data drift what what we are more concerned I mean it on this white red wine like then seems like That drastic difference actually one more concern is the data drift that actually is like Big error like that go from things between in the range from 0 to 0.1 to suddenly the next day It's 0.9 to 1 that that's the kind of thing that that happened and it's more than drift basically I think Yeah, great talk. I really enjoyed it So my question is like can you talk more about the workflow that will happen? So as a data engineer, it's like the workflow for co-tests is like I kind of like merge requests and then you do all your tests, right? But for data testing is that supposed to be always on for a pipeline or is that like? Just the kind of workflow that like works best. Yes. Yeah I mean you have all this kind of CI things where you can test your your your code. That's kind of given testing your Testing your data. Actually, that's that that's where I get tricky You want to test your production data and you don't have that in your CI? So what what we do is actually we integrate the the CI by Sending jobs to so as you can guess we are using data bricks and actually we send job to data bricks And we ask them to validate this These things there and we also put that at parts of our pipeline that run every day as we say Like we run things every day or every week. So we also run them here Actually, we avoid some kind of incident this way with this kind of health check We looked at basically some really think simple things of around like net revenue and thing like that We like to say net revenue should be positive on average Sounds pretty reasonable right actually we had an incident that fell like we we avoid an incident to our system because there was something happening in the past with With conversion rate and so we had negative and are on average in our data set and the whole health check Failed in production. I'd never thought that this one will fail But it fell because of the incident and we avoid to deploy a model as well trained on negative and are basically so Yeah, that's will be my advice is try to integrate into CI if you if you can there are Sorry, there are definitely some possibilities and also put them in the production because if you have the code why not running it twice All right. Thank you. I Think you just answered my question to be fair Well, I was well wanted to ask is that when you when you test your code you tested during development And then the tests are kind of isolated from the production code while and also they assume that data You know, they assume the state of data and then the tests you test your code while data changes by definition So I was wondering exactly this like do you would you have these tests in your production code so that you hold? What you're doing if for example, if you have batch jobs Is that the way you will plan it? Yeah, so maybe to rephrase so Yes, we want to test it in the development and also in production But then you have to rethink your your way you are test like basically it's not like okay I will give this four lines and I expect having this exact sum of Booking or something at the end like you have to say, okay I will have then you take bigger example like sample of hundred or even thousand and You expect this number to be in this kind of range basically So you have to kind of shift and be more fuzzy in your in your test Or basically do things like common like common sense like natural venue should be positive or Conversion rate. I mean we expect what we do is basically we expect bookings for ranking to come from all different platforms So we have desktop we have M web We have the apps and we expect to have if we take a reasonable samples aside Several several normal size we expect to have at least some booking in every of those basically so that's how we change the kind of it's not really a Unit test in the sense, but we call it like health check and it was really useful that answer a question Yeah, I can have a quick follow-up. I assume that with this test you're still testing your code You're not testing that the data quality is that correct So are we also testing the code in addition of testing this this data? Yes, we also do that But we actually saw especially for this kind of data pipeline that it was much more useful and easier for data scientists to To have this kind of end-to-end health check that we maintain then having Unit tests where you have to put your inputs outputs thing like that So for some part, I think it makes sense. There are kind of we we learn we get much more in terms of investment in writing those and Confidence in the thing that we are putting to production is right from this kind of Health check and to end tests with their production data. Thank you All right, so I think you just answered my question. I was more interested in your health checks What do they actually check? So you run them against production database on the users thingies and stuff like that Yeah, so basically what we do with this health check is so in the CI what we want to make sure is that The code change will not lead to fatal error or like we are forgetting something so basically as we as I say like bookings per Per different platforms. We also look at the type of page that we expect of bookings from the different type of page that we have All this kind of thing that we are checking But for the sake of time because if we do that on the full production, it will take hours So what we do is we do a sampling in advance to still have a reasonable amount But not too much that it takes too long Okay, so we at our company we name things a bit differently. So we name it sanity check and Yeah, I run tests against production database So if user cancels the plan then some other table should also change and we check that kind of stuff So what are you doing with this sanity sanity check you check your data? So like we run it every day and we just check if some you know business logic wasn't broken So like if you know that From the data, we know that user cancelled the plan Then the user state in the table should be cancelled And you know, maybe some task in between was broken because you know, maybe we are running the cancelling task Just once per day and if that task fails, maybe just for one particular user We use this sanity check to then get notified. I don't know the next day that okay This user should actually be cancelled, but it isn't so The health checks you use are probably something similar or not Yeah, I We are simpler we don't want to put too much logic and what we do is not Basically, we read the tables that we have and we we expect them to be correct and What we care about is like we join different events together basically to have our data set and then we validate this data set with this health check So this is the kind of end table that we test. We don't test so much the the production table But I'm all from a data science perspective on my team where we are user of this kind of event table But actually our core data platform is also being like we are working with them to Also check these tables to have some checks there. So yeah, pretty similar. All right. Thanks for explaining And I have a question about time series data. Yeah, do you have Do you have experience with data validation of time series? especially seasonal cycles or seasonal trends or outlier detections Good questions so Yes, we have some I mean working in tourism I guess like can you imagine summer is a higher season So we we have some Some problematic for us in like we're mainly working on recall and Ranking what we do is of course the number of booking change a lot. But however when we look at Percentage, but we transform them in percentage what we care is like revenue per visit or convergence rates And this kind of thing do not I mean change a bit but not drastically to to To a point there where we have this kind of problem So yeah, we we don't really Test these things. But what we we do is Actually with the SAS vendor is having some kind of threshold and automatic threshold That the values change too much drastically and using sometimes hurry So you're looking to suppress the seasonal trends and perform your regular tests instead of making Really seasonal checks Yeah, no, we don't we don't do any kind of checked like things like If winter expect this thing if summer expect this thing Because actually it will not make sense one one year to the next I mean the the business is growing pretty big and like from last summer to this summer we add a Plus one percent Overall booking so it will be a struggle to to get this right number So we prefer to do it with this kind of performance based Hi My question is about And I think you touched a bit upon it that when you use usually run unit tests on your code You want them to be fast because you want to them to have this quick feedback loop And when you're testing data potentially your data can be very big And you don't want your unit test to take like hours because then it makes development really slow You said that you in your case you would sort of take a subsample of the data to make it faster Do you have any suggestions if you cannot do that? Like if your data has to be the whole thing like I don't know if you're on a map or something like that Do you just have a different scheduling? Is it something you would run every night or something instead? Hmm good question So luckily for us we can always sample basically we have some kind of Like some kind of core things that we can sample on like We can just sample on the some subset of our users basically and we just do all this operation We check that on this subset everything looks fine. And if we randomly sample all these users It's it's fine for us if you have to do it on everything First thing is have you tried a bigger machine? that's I mean trust me we we You can get a lot and you have some pretty huge machine and if it depends how long it takes for you, but Just doubling tripling and you can get some some pretty huge machine And when we do the cost or cost calculation is like is it worth to spend more on a bigger machine or is it better to let our developer wait more? I mean It's always the machine that is cheaper. So we always go with as big machine as possible to speed things up And for second parts I think it's great to have this kind of we have this as part of our schedule basically to have it every day So here you have it the the thing I will struggle with is basically when you develop You want to check that your code that you change now is still working with your data? And then if you cannot break it I'm not so sure we don't luckily we don't have these problems in my company we can always sample by activities or customers That makes sense for all like some kind of location or country So we didn't have that I would challenge. I'm pretty sure most of the problem can be Breakdown, but I don't know your use case. So happy to discuss afterwards Okay, thank you Any more questions or from the Okay, but the talk a lot I was wondering how you would do something like this if you have like way more data than you discussed in the examples So I often work with like distributed systems on like exabytes of data in the cloud But I would be really interested to maybe if you do like spot checks or sample checks or something like that Yes, so we sample the data when we want to have the quick things but also I mean if you have exabytes of data Are you using everything? So then probably you want to check at some point not not part of our ci But you want to check that on the whole thing you might still have some expectation of precision or some columns null values so Should probably check on your whole data set as part of your Big pipeline where you use your data but yes for quick feedback developments, and we also have this kind of Subsets that we just use and to speed things up what we also sometimes do is basically compute once this subset By taking some random sample and then every time to redo this subset We can just use this this part this small subset that we can quickly iterate through Do that answer your question Yeah, I think so. Yeah, yeah, maybe to give another thing that's like we mainly use with spark And so the we have terabytes of data, but no more than that and with with spark basically we just Get bigger and bigger cluster. That's a pretty pretty easy answer Thanks Do you think this can also be useful for a non data products? For example, we have unit tests and this unit has use test data files to load and process to check that this is still working and whenever we Microwave code We also want to make sure that the test data is updated and not that a developer forgets this And also that the test data also in future contains all the edge cases that we want to cover Yeah, good point there this kind of Test file test data's test sample that you are having your code like really a 510 then you want to use that by unit test. I actually didn't I mean I do not like them because what you hand up is like, I mean what I've seen in the past and it was really annoying It's basically you you have your test set you make it work for your use case You think it's working because on this subset it's working But actually when you go to production is like oh, actually this column now they change it's not an it now It's a float so everything failed now and it was working on your machine But actually you didn't test with your production data. So I think it's having this Real real time data makes more sense and maybe The production system when they go from int to float But then it makes sense For us to then suddenly or test fail because suddenly your system is not working and you need to fix it So I think there there doesn't make sense and I that's what I don't like with this thing Same thing and what I also struggle a lot with is like how do you get the test file? Basically, and how do you make sure you have all your edge cases? present in this kind of test file, so We also have some but I I prefer this kind of automated approach to go with some production data Let's answer your question. Yeah, thanks I I have a question. I'm happy to get say how are you doing it? I I Mean How are you doing the same thing? Like that's that's a question. So, okay. So someone else is doing the CIA Yeah Thanks I would like to thank to the participant for queuing and asking the questions You can still catch up with this Speaker and the discord channel as I guess you're a python discord channel You can ask your questions later on and I would also like to thank to the speaker for this great informative talk and Thank you Thank you