 Okay, so thank you once again for joining. Let me share my screen. So let me first say a little bit about what we're going to do in this course. Well, obviously it's about data validation, so what we're going to do front to back is find out all kinds of ways in which we can check our data. You can find all materials on the GitHub page, so please, you can either clone this page so you have everything there or you can download it using the download button on the GitHub page and you get a zip file that you need to unzip. The directory you will get after cloning or downloading will be the folder that you will use as an RStudio project if you work with RStudio and then from there you can access all the scripts, the PDFs and everything you need, some example data as well. I also invite you, at least people who are attending live, not people who are watching us afterwards, to please join the Tut Validate channel on Slack, because later and quite soon already we're going to do some exercises and then I'll put everybody in breakout room so you have some chance to discuss with your breakout room partners what you see and what's happening. If you have questions to us during the breakout sessions then you can post them on Slack and I can actually come to your breakout room and help you and so can Edwin. So if you haven't done so already, please take a look in the Slack and add yourself or enter that channel. So in the top you see the topics in the first hour which we have about 45 minutes left we're going to do an introduction to validate just to get you up to speed with kind of the workflow of the validate package. And we are, yeah so that's what we first want to do in the second part so I will take the lead basically in the first part and second part Edwin will go deeper into expressing all kind of data checks so the first part will be kind of how is the workflow and thinking a little bit systematically about how to validate data. Second part will go with deeper and different types of checks you can find in the Validate. The third part we will see how to automate data quality checking and how to follow a data set while it's being processed using the Lumberjack package and in the fourth part we take a step back again and think about okay we have all these data quality checks that we want to do but when are they valid, why are they valid, those rules because every rule implies an assumption on your data and so basically you would say that data quality checks themselves also have a kind of a life cycle and we have some tools to work with that and ways to think about that. We both Edwin and I are very much proponent of active learning so during this course we really is our attention to put you to work so you're not to be just sitting there listening to us but also really get some hands-on experience and so for example in the first topic we're not going to start with a presentation giving you the theory but we're first going to let you explore the package using a script which has instructions together with some people in breakout rooms and then only after that you will get some background so we give you some in-depth information what you just did and maybe you have questions. If there's time after that we have an extra more like theoretical assignment and that's how we're going to treat most of the topics so you get started practice a little bit and you see how the package works after that we give you a bit of background what are the ideas behind the package so it helps you with the reason about better about what you just did and after that depending on time and discussion we can do an extra assignment. So one of the things I'm going to ask you when I am going to put you in breakout rooms later is to look at a script that you're going to run together with somebody in the breakout rooms and then I also want you to write down any question yet you might have things that are unclear things that come up and then when everybody comes back from the breakout rooms I would like you to put those questions in Slack or if you don't have access to Slack put them in the chat and then Etrin and I can treat them before the more in-depth presentation or even before that but then we can hopefully cater you as well as possible. Yeah we have a lot of references I'm sorry for the small letters here but there are basically three papers and there's a book in the GitHub page you can find links to all of them so the three papers at least are available for free either on ArcScive or they are published in an open access journal the book is something that you'd have to buy but basically anything on validate you can find in these papers as well. On the left you find papers that are more focused on how to work with the software especially the data validation cookbook that comes with the validate package actually and if you're more interested in this theoretical side of it then the paper and the book on the right might be suitable for you. Okay so without further I do I would like to invite you to download if you haven't done so already the materials from the GitHub page and open this file called intro validate. Okay so let's have a look at a little bit of background in data validation what are kind of the the main thoughts behind how the package is built up and why. So one of the the main lines of thought we have when we are producing any statistical output or any output that you create of data is that we tend or what we propose is that you think in terms of a sort of a value chain so whatever arrives on your desk as data that's the raw data that you get and that's the raw data that you have to work with so it can be that this comes from outside your company or institute where you're working somebody mails you an excel file but it can also be something that you scrape from the web it might also be something that a colleague from the same company hands to you or you get access to a database but whatever it is for you this is raw data so the first thing you probably have to do is somehow clean up that data in such a way that you can read it into r right so you either have to put it in the right format so you can read it with for example read.csv you have to make sure that all the rows represent exactly one entity so one person or one product or one whatever it is that you're working with and every column represents one variable so that's what we call input data and this is where you do kind of technical cleanup right you make sure that all text is stored as text you make sure that all numbers are stored as numbers so you can do calculations with them so the sort of shorthand or like rule of thumb we always give is that your input data is data that you can read in a single statement it's in the correct structure and data is of the correct type and those are the basic requirements and then two other requirements would be you can recognize every row so you know who is in every row or which object is in every row and you know what variable is in every column the second step is that you want to check whether the content of your data is actually correct so for example you might have an age variable but that may be negative because something went wrong while measuring it or turnover variable that may be negative so all these things there may also be connections between variables so you saw the example where you said okay if a company has staffed there should be staff costs right so that can be relations between variables have to check out so that's where your what we call data editing or data cleaning comes up but in any case after you do all those steps you have something what we call valid data so the data can be trusted to the extent that you think it's good enough to extract your statistical statements from or base statistical statements on and draw conclusions that are statistically valid right so you first look at the technical part then the sort of the content and from that you create statistics and after that you create output right statistics is just the numbers that you create totals summaries models you know coefficients maybe a trained machine learning model but those are your statistics and also there you expect your for example modeling output like if you do regression coefficient maybe you expect the coefficient to be in a certain range or at least to be bigger than zero for example so even on the output you will have like things that you want to check and then the last step you do formatting and reporting and make sure that everything becomes readable maybe it goes into a report maybe it goes into a website or a dashboard or something like that but that's where you just format the data to make it consumable by humans let's say so the idea here is that each of these five steps we should think of them as a product right raw data as a product it's a product that you get from somewhere but it satisfies certain data quality requirements for example if it's web straight you really wanted to satisfy the requirements that is html format so you can process it further for example with rvest or xml2 the input data you also have some guarantee level of quality you want it to be for example a valid csv file where text can be read as text numbers can be parsed as numbers and where factor variables only consist of valid factor levels for example then there's consistent data where all the content should be where you sort of use domain knowledge to define your quality demands statistics you also look at domain knowledge but typically on a bit higher level right you look at aggregates you don't look at microdata and output again so in principle in each of these steps you could use validate to define data validation rules and to check step by step whether your data actually satisfies all the quality demands that you want fine so that's kind of the way you're thinking so um I like this idea very much for several reasons I mean conceptually I think it's quite simple and another really good thing about this idea is that it scales really well in my experience so when I'm doing a data analysis or some modeling I get data from somebody I put it in a in a raw data folder right and if it's data that I have to retrieve by a script for example from an API or get it from a database then in that folder I put a script that gets me the data and dumps the raw data in that folder right um and I'm assuming here we're working with data where you can do that it's not too big for this you know you would get data small enough that you can move it around I have a second folder called input data that reads my raw data does some technical uh things to it maybe uh put all the strings to utf8 for example you know encoding cleanup uh rename variable columns if necessary um and then write that in the input data folder and so on and so on so basically I would have five folders and only in my output folder I would probably have something like a markdown file it creates a report by reading outputs from the statistics folder right so you can do this on your own um and if um you know if you're a little bit uh have the same life as me then once you have some first results and you show it to the people that gave you the raw data they will tell you oh but I gave you the wrong data or you are missing a subset or a couple of the wrong variables right and the only thing you should you have to do is overwrite your raw data and and run script by script again everything again right I think we have a new result quickly so that's why I like to do it personally in our office this is also this way of thinking is also the basis of setting a much larger production systems so we have department where we have dozens of people working together producing a set or sets of statistics and all these production systems are designed with this as the basic principle like you might define a number of intermediate steps where you have a very well-defined level of quality right so then the idea is that when you design such a thing uh you start by defining the quality of your output and this then results in what kind of quality your statistics should be what kind of quality your consistent data should be and so on and so on so we like to say the data travels from left to right but your data quality demands travel from right to left right and of course in reality things go around a few times before you're really ready but conceptually I think this is a good way to think about it to organize your thoughts around setting up a production system okay now since um we are talking about data in different levels of quality here it's good to have some kind of standardized way and defining what data quality means and that's what we do with these data validation rules and that's kind of the core of the validate package as well so let me give a definition first of data validation that we use in our office and actually and this is like a sort of a international definition that's agreed upon by european statistical officers where we say that data validation is an activity in which one verifies whether a combination of values is acceptable so in principle this could mean there's somebody sitting with a rubber stamp looking at the data written on paper and stamping giving a red stamp if it's wrong and a black stamp if it's okay or green stamp if it's okay but we like to automate these things right so I think the definition is is not difficult but you'll see that's kind of subtle because the idea of combination of values can mean a lot of things right it can mean a single data point like is h non-negative you can check that by looking at a single data point but you can also check those turnover minus cost equal profit so there I need already three data points Matt you have a question I think yes I think when you you talk about your life in official statistics do you have like a workflow where at these examples of these rules I often have a problem that I as a data engineer have to define these rules and then people come back and say oh it's not the right rule you gotta adjust that but it would be great if there was a workflow or a process where let's say people who are borderline are users could define these rules do you I've seen in the first tutorial part or in the first with the first script that you can actually use a reader file from the outside in do you use that workflow that you tell your colleagues with a less programming experience or less a background in that can you define this set of rules for me and then they do it does that work did you manage to to get that going or what's your experience on that that's definitely the whole yeah the idea yeah so it is that people who don't know are do not need to see the validate package but especially for simpler rules they should be able to write them down and what we very often see is that people work for example with spreadsheet software or they write something maybe in access and they actually do you know very consciously a lot of checks on their data but they're all hard coded right and they're hard coded in SQL or hard coded in like creating some special Excel tables where they look at and see if everything is okay make certain plots so what we try to do is have a conversation with people like can you make this explicit like try to get all the domain knowledge out of their head and try to you know solidify it into data validation rules and people are usually very happy with that because they it takes a kind of getting used to somehow or that you can actually separate concerns here that you can separate like defining data quality from processing data right and this is some sort of a step that's not very it's not something you think of first you know when you start you know working with especially in spreadsheet software yeah great great to hear that I I mean I could imagine variable names could be a bit of a problem but yeah it's great to hear it's working it's working and sometimes what we do is I mean we are working on making a demo for for one department where they they all said so well they didn't do hard coded checks they were more looking at important aggregates so we said okay we will create a number of rules for you first we'll show you how it works and then they get enthusiastic and said oh you can also check this and this and this so once they see it how it works you know you can give a small demo basically what you just saw in the in the script right so here are some rules here's your data and it's important to show people their own data so they can relate to it and then and show them the results and I usually get some quite enthusiastic and like I didn't know it was that easy and then often that gets them inspired to come up with extra rules and extra checks and be right down what is also a good thing is that you can document those rules so you can provide extra metadata so you can describe why this rule is in place for example that helps a lot because sometimes you have a technical check but there's an intention behind it so you want to check it because for example turnover can cannot be negative otherwise it would be a non enterprise or it would be so you can describe why this rule is in place yes yes that that's what about was a was about to say like when also that you can just use comments and very important that you point people to the fact don't comment on the what but why of course you may do the what as well to like kind of give a big chunk of rules and another chunk of rules that that relate like to section so to say but very important why do you do that because they got the domain knowledge and you don't exactly exactly yeah I agree okay so let me come back to the presentation well thanks for the question by the way so a combination of values right so age you can do that with one data point turnover minus cost equals profit you need three data points already if you check something like is the average profit positive that means you need the whole column of data points okay so it seems a simple rule but you actually need quite a lot of data and then you can even look at does that mean profit ratio profit to turnover ratio differ less than 10 percent from last years that means I need two columns at least from my current data set profit and turnover and I also need two columns from a data set as from last year so you need even more data points and we'll come back to like looking at this with more systematically so the main point is that this definition we think it covers almost basically all your data validation needs it's a very general definition okay so it's a definition words you can also formalize it we don't have time to discuss that and find it in the references well a few of these things like why data validation rules you saw invalidated we really separate like where you define your data quality demands like in a sort of a list and the validator object and you can work with that as a list actually making selections and things like that and then you release those rules on the on the data but why would you want to do that so I think there are four three or four important reasons for that first is that if you have those rules it's very easy to communicate what the quality of a data set is right it's very nice if I sent you a CSV file and tell you okay here you have the age in common educational level of 100 people but if I also get sent with you a set of rules and I tell you okay I have 25 rules here and then I guarantee you that this data satisfies all these rules right for me that means that data is suddenly like worth like two three maybe ten times as much right because now suddenly you know something about that data you know something about what quality checks you have done so even here we're talking about a value chain of data but even just performing data quality checks you know enriches your data set with some metadata you can say something about it like I've checked that all these things I've tested it against all this domain knowledge right and that and even if it fails with all of that domain knowledge you also know more right so it gives you important information but you can unambiguously communicate your data quality demands once you formalize your data quality demands in these in the syntax try to do it in email like saying page must be positive but do you mean positive or do you mean it cannot be non-negative do you also accept zero yes or no right this is something that people often mix up in written language if you put it in a script and as a rule it's a hundred percent sure what people mean right just something to talk about another thing is that these rules themselves have a life cycle that's something we talked about in the last section of this tutorial so every rule that you create somehow imposes an assumption on your data set for example we can say turnover cannot be negative right this is true when you ask a company for their turnover as a statistics office like we are however for the tax office turnover can be negative because they maintain a different definition of turnover okay so that means for example if the tax law changes that the rules that we have to use on our data sets might change so it's a very simple example so if you if you separate your rules you put them in a file you describe them then once every while you can have a look at them or you can give the rules that to a colleague and ask you know I created all these rules with explanation maybe you want to have a look at it as sort of a peer review do you think these are reasonable am I too strict am I doing too much I'm missing something so they have a life cycle and that means you would like to treat them like data so maybe you want to select them maybe you want to select all the rules that hit a certain variable right then that's one thing that you could do if you have a big database of rules you want to throw out maybe rules and you would do some maybe things that to treat them like code like having version control what was the rule set last year who changed it why did they change it right everything we know and love about git you can apply to rules once you separate them from your code and then the last one is something we're not going to see much of today is that they are input for algorithms that improve data quality so we also wrote some packages that can use the rules from validate give it the data you give it the rules and it tries to adapt the data to fit okay i'm going to give one last one last slide and then we're going to have a break so that's the reason and the whole idea behind the validate package I would say to separate thinking about data quality from actually measuring the data quality and so you can separate between like what Matt said between your domain experts on what you as a data engineer or data scientist does so the idea is that with validator you can read rules from our command line like you did in the example or from a text file you can also use a structured yaml file to add metadata or from a data frame or database you what you get is a validator object and that's basically a list you can index it with square brackets the elements of the list are named so you can use the names or just logical or integer indices to two selections you can concatenate them by with a plus so you can really manipulate them as if they were data because they are data invalidate their first class citizens you can extract some information from them like what variables I hit by which rule for instance well once you have a validator object a list of rules you can confront that with data in a data frame or as Edwin will show you later with a data in a database and what you get in return for that is an objective type validation that holds all the validation results all the truths and falses and then you can extract information from that using various functions you can summarize you can get the raw values in an array structure you can do s dot data dot frame to get a data frame back so then you're back to you know then you have data which you can filter again and things like that and you can use a validation object for example to select all the records that violate at least one rule so you have kind of a work list right okay so that's the basics behind the validate package okay so welcome back after the break so remember before I gave a very short overview over what the ideas behind the validate package were I told you that you know their rules can somehow need various data points to be evaluated right thank you could use one data point only age must be positive for example for many data points when you look at rules like the turnover to profit ratio should not differ than more than 10% between this year and previous years so we're going to see if there's a way to reason about or it turns out that there's a way to reason about rule complexity and that's what I want to tell you about now so the intuition I try to develop until now is that a rule is complex hard to evaluate somehow if I need a lot of different type of information to evaluate it that's the intuition behind the story I'm about to tell you now so then the question is what do I mean with different information right we have to somehow quantify this you can actually make this mathematically completely sound but what I'm going to show you is just the results of that analysis but first you have to but the idea is actually quite simple so we have to think about what does it mean different information and to talk about like the different types of information that you need you have to think about how do I label actually the information how do I label a data point like but how do I know when I give you a number for example five what that number means right what's the metadata there this was it was if I know what the metadata is I know how to vary it and then can measure somehow what different types of information actually mean so let me make this explicit all right so what does it mean when the question I'm trying to answer here is if I give you a data point or value say number five what do I need to tell you so you know exactly what that number means and we have a small model for that and it's the drawing you see here so the idea is as follows right and how does it data it's about the idea is like where does a data point come from so we have some kind of population and at some point time t new so as in the population you can think of all the people in a country for example or all the companies in a certain region at some point t new a unit is born into that population for example a person is born an email is sent a product is put on the market or a company is created right something I get well at some point a unit in that population is born and from that time on that unit has certain properties and actually the properties that the set of properties that it have has defines it as a part of that population so the property we call x and during the time that the element of the population exists that value of x might vary right so that can be like the income of a person or something like the turnover of a company whatever properties that unit of the population can have that's something we want to measure and at some point in time a point labeled tau here we measure that value and we get the number five for example and then after a while the element of the population might cease to exist or leave the population for some for some reason right so this measurement here at time tau of a certain element in the population element you at that time we measured some variable x and this all these things give us actually all the information we need to know what this data point means so let me summarize that in the next slide so the intuition here again is that if we want to know what a data what what a value means is that what we do is we create a data point and a data point is a key value there where the key actually labels the point such so that we know it labels the value so we know what the value means and the label should consist of four pieces you have to say from what population you created a measurement you have to say when did you make the measurement for tau you have to say who did you measure which person or which company or which email was it from that population and you have to say which variable was measured and once you know those four things the claim or the intuition is that you know exactly what that value means that's the idea so this is sort of the minimal set of metadata that you need to supply to exactly label a value and to say to know what it means now there may be situations where it's very obvious what the population is so you almost never do this completely explicitly for example if you know that a whole table is about people in a certain city then you're not going to label every data point with that city again you know that this the whole data set pertains to that population but in principle these are the things that you will need at least so we have a sort of a mnemonic for this if you say utux if you can remember that word you sort of can remember what what metadata elements you need so that's the idea and now we can sort of start labeling how complex a validation rule is and we do that by asking for very simple questions and I will start at the bottom because that's the easiest way to start if if I have some some data validation rule and if I have some quality demand that I express in a rule and I want to evaluate that rule to see if it's true or false for a certain data set do I need one or more variables for that for example do I only need h or do I need cost turnover and profit if I only need h what I do is I denote an s per single and if I need two or more variables I denote m for multiple so we are going to count like the first people on the world you can't start counting they counted as one zero one many or one too many basically but we are going to have one or many right these are the only two things we have so we're going to look do we need one or more variables we can also see do we need one or more population units for example to check that turnover minus profit equals cost I can do that for one single company so I need only one unit but if I have the rule that the average profit must be larger than a thousand for example then I need a lot of companies to compute that average so I need more population units you can also ask do I need one or more measurements so do I only need the turnover of this year or do I also need to turn over last year to compute my rule do I need information from two separate measurements of the same variable and if you need one you denote an s again if you need more two or more you denote an m and the same with populations do I only need entities from one population for example do I only need people or do I also need information about companies right this is maybe true if you do very high level economic studies for example so you if you want to compare entities from different populations so these are four questions you can ask now for each yes you can denote an m after each no you denote an s then the number of m's you find is we call it the complexity level of a rule so that's the that's the idea and so here is the the rules that I gave in the beginning so here we we worked out what the complexity level is so h larger than zero as complexity level zero because everything is s a single variable in a single unit measured at a single time for a single population so four s's so level zero there are no m's turnover minus cost equals profit it's this is about a company a single company three variables so one s a single measurement because this is only about one time and a single population so one m three s's are complexity level one mean profit again you need multiple companies here but only one population one time of measurement and one variable is being studied so only one s complexity level one and then there's a very complicated one here where we look at the mean profit versus turnover ratio in time t and time t minus one and this has three m so it's complexity level three and so the idea is that I think there are two things one is that I think the complexity levels sort of nicely correspond to your intuition like how complex a rule is and another thing is that you tend to check the the simpler rules in the beginning of your production chain and the more complicated rules in the end of the production chain and doing this sort of analysis allows you to also for example compare if you have two production chains which one is more complicated if one production chain you have to do all kind of checks that all have level three rules then it's probably also much harder to clean the data right because you have to involve many more different data sets to compare a lot more complicated data handling and a lot harder to find out what actually is wrong right I mean if the age is not larger than zero only age can be wrong if the mean profit versus turnover ratio minus the mean for time t minus the same quantity t minus one is not smaller than five that can be something wrong in the profit of column of time t turnover time t or the profit column of time t minus one and the turnover column times t minus one right so things can go wrong in many more places when a rule is not satisfied of high complexity right so I'll feel the braces hand so maybe you can ask your question here I see that the uppercase u is always single and I was wondering what are the cases when we would see multiple that are not like something you can get one level down into a characteristic the population yeah so a case where you could uh would compare two populations as for example if you would want to know the number of companies per person in a in a in a city for example or the number of people per dwelling like how many people on average live in one house or one apartment right so then you have then you need to have information about people numbers of people and you need the information about apartments or dwellings does that answer your question yes but then the question is if I have to for example persons per dwelling I can just add dwelling as a new column to our uh individual data set does that change it into uh getting the uppercase u into a single again well it depends a little bit um well if you have a data set where you have persons and for each person you also have the the dwelling where they live then you can sort of by uh using a kind of a group by analysis to uh try to uh compute that average and compare with something that you expected to be like is it in a certain range it depends a little bit how your data is set up it might also be that you you have two different data sets and you need to get data points from the two different physical places let's say I think my oh sorry go ahead no no just go ahead I was just going to say that perhaps that group by uh abstraction is what uh determines the whether or not this s or m I think like it's it brings you the level of abstraction into the single data set but still keeps the relationship from one uh to many I think Edwin also tried to say something yeah well I think the the example Mark is giving about the apartments and dwellings and people's is quite good especially if extra information on each dwelling or each apartment so for example the number of toilets or the number of rooms in these apartments um and you have also some extra informational persons and you can compare those so then you are comparing two different populations or combining two different populations and see if there are conflicting or not the information you have in both populations does that answer your question absolutely thank you yeah that the yeah we're talking about populations in IT uh in databases people talk about entity types so whenever you need to compare data that from two different entity types that's when you would have like two different probably and and over there yeah but it's less common of course I mean these are very complicated rules uh typically um and these are often these are also rules that people do away in the end when you have when you have statistics right you see the growth in turnover in uh in one sector of the economy and maybe you want to compare that let's say something in agriculture and you maybe we want to compare that with uh the growth in the number of of uh livestock for example right so that comes from a completely different field that's um like um computed somewhere completely different um but you are then comparing two different populations well in demographics also happens so for example income on households or dwelling is is commonly used but also personal income and that those can be in conflict when you have two different data sources okay um yeah the last thing I just want to say is that not all uh combinations of m's and s's are possible uh for example um one unit if you say I only need one unit from one information of one person for example and this person comes from one population you cannot have one person that comes from two different populations or one uh I mean you can have several populations and so on that's not relevant in this analysis so um these are the actual added levels um combinations that you can find I just want to refer if you're interested in this uh in this theory behind it and you can find more information in that in the reference you see here um I think the main point is that there is more to be said about these data quality rules and that you can actually label them into how complex they are and why you should and you typically organize them from simple to more complicated I should go from beginning to the end of your production chain and I think this sort of gives a background back why that is there's some kind of natural classification in that sense okay so Mark introduced validate so this is a brief overview of what validate overs um so you can see validate as a domain specific language for a rule definition um so you can define actually any check on your data using just plain r and the only thing is it has to evaluate to a logical so it has to be true false or na that's the idea behind it furthermore we treat rules as first class citizens so you can create rules um read them update and change them or write them and you can also do all kinds of summarizations about these rules make plots about rules uh and confront them of course with the data so um when you have these rules you can apply them on a dataset you can store them separately and do also kinds of plots on the confrontation that's what you commonly do so most of the times you will do some registration on the confrontation or plotting on the confrontation that you can also do some things about the rules so for example you could see how many rules are uh about a certain column or a certain variable for example so it also allows a bit of reasoning about uh how many errors are about the data quality of your variables separately so how many errors are on uh a certain column for example on h or or corporations of the arrow or which rule is commonly uh uh broken that that's something you can do so uh i will skip this so for example if you look at this this this code so validate uh uh and we apply this on retailer so there is a function called check that and validate and check that is just a very easy syntax to to apply directly rules without loading them or specifying them on your data so you can uh check uh if to turn over with other revenue is equal to total revenue and apply some other checks on the retailers and do a summary so we use uh chaining over here so the makita operator the pipe operator you can see that the summary gives you uh an output describing which rules are uh where evaluated these rules are automatically named in this case so you can name them by yourself but now you can see that they are named fee one fee two fee three you can see the number of items that are checked in this case these these are the number of records if you have a rule which is more the whole data set or on groups these numbers will be different from the number of records you can see which records passed the check so and the first you can see that the first rule checked 60 records and um 19 passed and four failed and 37 at n a um and you can see if there are any errors or warnings for the checks these can all be switched on and off so you can also do a very strict checking that n a is also considered an error or for example and furthermore um you can see that a validator rewrite some of the rules so for example the first rule turnover plus other revenue is equal to total revenue is rewritten in validates as absolute turnover plus other revenue minus total revenue is less than 1 to the minus of 10 to the minus 8 and that's because there can be um writing errors these can all be tweaked so you can make this tolerance bigger or smaller or you can also make it as or you can switch off this rewriting but this this is can be often be a problem when doing the hard checks that sometimes you need to run off the the numbers so the the the uh at work by executing these rules so so we saw the check that syntax but normally you will would use the validator syntax so that you can specify these rules by yourselves you can also put some names in front of it that was a question in the slack channel okay mark could you check the slack because my screen space is a bit limited right now so is there any questions just interact me with the questions that were asked in slack there's uh no uh question right now well there's a there's a question from Matt whether there's yaml files and he's asking if you can also read from uh json files yeah uh not at the moment i'm actually writing something on there as well yeah yeah okay i'm not at the moment no but i think you can use other tools to transform yaml into json and and back yeah that's true but maybe okay thanks mark so what you could do you can also make plots of these summaries so if you do a confrontation and you do a summary you can also make plots so this is very simple you can see how many records were valid for na or where failed on the rule so this this is just for three rules but you can imagine if you have multiple rules or many rules this can be an easy overview of which rules are violated and which are not you can also put these rules into a separate file so for example this is plain r file which you can document with r comments so for example staff turn over an order if you have to be non-negative uh count balance check and others come and stuff and this is just plain r and you can really read these rules with the following syntax so fail data dot file is my rules dot txt so this is just a very simple format and the yaml format we will see in a couple of minutes it will also be in the in the exercises so this is a domain specific language so one of the nice things of r is that you can quite easily create extra language things just like this validation language so for example you can all do all kinds of range check range checks in r so for example if job is in yes no if turnover is non-negative you can combine all kinds of multifariate checks multi road checks and if there are logical implications and as mark told you already so the last one the if statement if staff is bigger than zero and staff cost is bigger than zero this is rewritten into a factorized statement so we will check on staff and costs and at the same time you can also see this in in when you do summary so and again that's in the exercises so the validation language allows for all kinds of comparisons of course because all these statements result in a logical in r so we can do comparisons operators the latest version of validation validate also has some extra functions for checking for completeness so you can check if if one variable is complete or if combinations of variables are complete which can be quite uh which is a check which is often required in many statistical processes of course all kinds of boolean operations so all the standard r operations with not not all any and and or and if else of course you can also check for text formatting issues so that's more helpful in doing all kinds of technical checks so uh you can grapple on your data so check if there's data available um you can also check for field length or field format so if if your column complies to a certain format can also check for functional dependencies so you can check if city plus zip code you uniquely identify what is street name and you can build your checks with using the dot operator as mark already showed in the beginning there are some extra features in the validate 2 it can also create some intermediate variables or transient as assignment as we call it using the dot is operator which is a bit familiar when you're used to a data table but this creates intermediate variables which are used in your computation can use this to in your other rules so you can create an intermediate variable and use this in the rules to define what kinds of other stuff so for example you can define a medium and turn over and use this median in other rules and these assignments can be as complicated as you want them to be these can be just our statements so that's kind of nice another feature mark already mentioned is that you can define variable groups so for example often you have requirements per column that are valid for a number of columns so suppose we have these columns staffed you know for other revenue total costs you can define these as a variable group g and you can specify that g should be bigger or should be non-negative and then this rule expands to four rules meaning that staff to know for other revenue and total cost it should be non-negative so in fact that these are expanded and it creates these four rules so this is synthetic sugar for helping you not to do the to specify all these rules explicitly there's some error handling in validates so suppose use specify the wrong variables or let's say we check that women the women data set in our just contains two variables height and weight and suppose we mispronounced the height variable so say height and it will throw an error that height was not found and that this rule was errored so that's something different than it failed because fields is that rule was correct but the data record was full wasn't correct and this is an error because the data it was confronted with didn't contain the variable height which is quite helpful if you have large data sets because spelling errors do occur so you can do naming of the rules thanks mark so for example invalidated you can name these rules just by prefixing them with the name so two positive or positive balance are the names for these rules and if you print these rules so do print the rules the rules the rules will be decorated with the name so two or balance and these names are stored within the rules objects themselves you can also do rule selection so the rules object itself is worked like a list it's not equal to a list but it's somewhat somewhat like a list so you can select these rules by a number so the the number the location of the rule we can also select them by name if you have named them and all the standard selection properties of lists are available on the rules list so these rules have metadata so if you select just one rule by using the double square bracket operator you will see that this is an object of class rule you see the expression so how this rule is defined so turn on this other revenue equals total revenue you can see the name we also provide a label so you can a more expanded name so the name is more just a short reference point which you often see in prediction systems so now offices these names are quite short they are just two letter or three letter deviations of rules and the labels are a bit more expanded they can also provide descriptions so the y of this rule I can also see the origin of this rule so in this case is the rules defined on the command line or in the script itself but if it was loaded from a separate file an external file this will be noted where the rule was loaded from when it was created and we also have a slot for some extra metadata you can provide your own metadata if you want to so if you have some very specific metadata you can put this in the meta tag you can also do combine validators where you can add two rule sets and create a new rule set so if you rule the validate object with just one rule x bigger than zero and a validate object with x should be less than one you can just add them and it will be a new rule set so you can combine all kinds of rule sets you can also change these rules or save them store them into a data frame object and load them from a data frame with this syntax so validator allows for loading rules from a file with a dot file syntax or argument which can also read and write rules from data frames so some of our production system systems in our office allowed rules from the database for example they use this syntax to load the data so they have a separate table describing which rules the data should comply to and they store these rules into a database table you can there are also all kinds of knobs on validator you can say that it should stop at each error so what we saw we saw an example of where we checked for women and we misspelled height so the rules incorrect but the rules were checked anyway and just reported that there was an error you can say that it shouldn't catch these error but it should just stop so that's there's an option for that if you set the options raises all the checking just stops instead of reporting that there was an error which can be helpful when you're developing the rules in a production system you typically don't want this to happen because it means that the whole script stops and often you're only interested that there are errors in just one rule and it checks for the other rules you can also some other options and I was talking about is rewriting linear equality so if you check if if you can specify what the tolerance should be for the rewriting linear equalities if the setting lin equal apps and in this case it would rewrite an equal statement into a different statement that allows for a rounding error of point one point zero one as well yeah maybe uh give a whole one back so it's maybe it's good to mention also that you can either set these options generally so if you do the options raise equals all yeah options na.value equals false and this will be true for every rule set you create after that yeah yeah thanks to be true forever so for the whole r session say and if you but what you can also do is attach those options to a certain rule set and then it will be only valid for that rule set that's what you do with the second uh in the second statement and in third you only do it during the confrontation and that's even more local yeah thanks thanks that's one thing and the na.value is so you saw maybe that if a value is missing so for example if you want to check that turnover minus cost equal profit if the profit is missing then your return value will be na as well but you can actually tell that a day to know if something is missing i just want you to return false so more options but these are a couple of thanks and the other thing is is you can also store these options in in the yaml file if you want to so so what mark is saying is you can also attach these options to a rule set and if so these options also can be stored into uh in an external yaml file the thing i was mentioning so this is the last slide is that sometimes your data is big and is stored in a database there's an extra package called validate db and these can execute most validate checks so the the the record based and all checks that are translatable by db player these can be executed on the database and many checks are that simple so that helps and these checks are automatically translated into sql code and executed on the database and there are also kinds of features of extracting these checks or and getting these checks within the database itself okay so that was a brief and quick overview of the validate package