 That's part of why I change jobs all the time. So yeah, I'm going to talk about validating big data and machine learning pipelines. I think it's a super important topic, and I wish more people validated their pipelines. Oh, I need to speak up. OK, cool. I'm going to try yelling, and this is going to be a long day. So if my voice starts dipping and you can't hear me, let me know, and I will start yelling some more. Yay! OK, my name is Holden. My preferred pronouns are she or her, and I'm a developer advocate at Google. I'm on the Spark PMC, which is why most of my examples involve Spark, even if it is a really bad idea. But that's OK, because you can do the same tools with other things. I'm a co-author of two books on Spark. Neither of them talk about anything in this talk, but that should not stop you from buying my books. That is the more important part. One of the things that I've started doing that I think is kind of cool, if anyone is particularly interested in how projects like Spark do code reviews, is I've started doing code review live streams, where you can watch me review open source pull requests live and try and not swear. I fail at the second part, but I still think it's fun to sort of watch and join in. OK, cool. In addition to what I am professionally, I'm trans, queer, Canadian, and America on an H1B work visa, which they're debating whether or not they want to keep. It's a really great feeling to not know if you can renew your work visa or not and part of the leather community. And this is not directly related to big data or machine learning. However, I think for those of us who are building machine learning pipelines or even just tools with data, it's really important that we try and make diverse teams. And this includes us in the open source community. If we don't talk about where we're all from and if we don't build diverse teams, we're just going to recreate yesterday's problems more efficiently. And that is not really what I want in life. I want us to find new solutions to new problems. And I think diverse teams will help us do that. OK, that being said, not going to talk about that. So we're going to talk about how to avoid having everything catch on fire. Why you should do it, given that you're here, you might be fairly convinced. And I promised at least one cat picture. And at least one picture of my scooter club, which is only tangentially related. But I do wish to try and expense my gas. So I'm working on making it more related to computers. So hopefully you're nice. Maybe you're familiar with, is anyone here familiar with Scala? Yay, friends. Thank you. OK, cool. And if you're not familiar with Scala, that's totally fine. How many people are familiar with something like Spark, Beam, Flink? OK, that's a good number of people. If you're not, it's OK. These same techniques apply to other systems. Generally they're a lot easier to do in non-distributed systems. So if you happen to be working locally, your validation tasks become way, way easier. And hopefully you're here because you want to make better software. And if not, I can't convince you. OK. So validation is really important. Driver's license tests can be very similar to how we test our software, which is to say, they're better than nothing. I would not want my friends to start riding a motorcycle without a license, as my friend is doing here. But at the same time, even if someone has passed their motorcycle license test, they're probably maybe still not the safest driver. And it's just like your code. Your test covers the basics, but it's not going to catch everything. There's always going to be that strange SUV operated by some random drunk person, or you're going to have some null records in your CSV files randomly. One of the two is going to happen, and your pipeline is going to fail or you're going to crash. And ideally, you want to know when something has gone wrong so you don't make it worse. So our tests are not perfect. We are eventually going to get on the fail boat. And at some point, you want to minimize the impact of this. Does anyone here have to carry a pager? That is very, very few people. You are all very lucky people. I am kind of jealous. Someone else in your organization may be woken up at 3 o'clock in the morning to do a rollback from your data pipelines. You probably don't want them being very angry with you, because you're going to need things from your operation staff. So it is still worth saving other people being woken up at 3 o'clock in the morning, even if it is not saving yourself being woken up at 3 o'clock in the morning. So I did a survey for how many people have had spark jobs, causes serious production outage. 15% said yes, 50% said no, and 30% were like, I didn't have to update my resume, but we did lose a few million dollars, also known as depends on what you mean by serious. And so you really don't want to be in the 15%, and you don't want to end up in the 30% either. You don't want to have your pipelines cause really bad failures. And I have a survey if other people want to give feedback on this. And as more people deploy their pipelines automatically into production, and as more people start doing streaming data, you don't have the same time to do manual verification and validation that you used to. So when scooters, it could be going home after an accident rather than checking on your bones to see if they're all put together, a bit of a stretch. You'd probably check notice if your arm was broken. But with computers, we might not notice that we broke a feature and cost our employer a few million dollars. I've done that. It was a really awkward one-on-one the next day. Like I updated my resume. I did everything right. But I was really stressed about that. And you don't want to have that experience. There was another time, much less stressful. I just assumed that everything was a coffee shop. And my only test query was the word coffee. And then my boss got upset that we were returning Starbucks for when he was trying to find a steakhouse. And that was an awkward call, but not that awkward because we were a startup and we didn't make any money. So it was just like, man, whatever, it's jokes. And then there's other ones. Some words can have multiple meanings. And those meanings can be really awkward to have to explain to people. And if you have tools which do a good job of keeping minors from seeing inappropriate content, you really don't want those tools to break. And it's really easy to have that happen when your data changes. With money things in America, the Veterans Affairs Agency couldn't pay a whole bunch of people because of data validation pipelines, Bank of America for clothes, sorry, because of not validating data. Bank of America for clothes on a whole bunch of people's mortgages. And like other generally really bad, terrible things have happened to specific people from us not doing a good job of catching data quality issues. So hopefully, this is enough that you care and you'll pay attention. And if none of these problems are things that you care about, it's OK. The internet is working somewhat. So let's do some validation. Yeah. OK, I'm going to take the unicorn horn off now. So another thing that we might want to validate is, are our slurpees drinkable? Do you have the concept of slurpees here? It's an ice drink with more or less sugar and some food coloring and something resembling flavor. And occasionally these machines are at gas stations and maybe not the most well maintained machines. And you might not want to drink something from a machine which is fundamentally just a giant Petri dish when it's not operating correctly. So a good validation rule that we might have for consuming slurpees is that our slurpees should have the food coloring in them. If they don't, something has gone wrong and I probably don't want to get sick. Maybe that could happen. But the new slurpee ghost white gummy flavor is going to break our validation rule because now there is a slurpee flavor without food coloring. And this is OK. I think it's totally fine to have validation rules that break occasionally. And I think this is perhaps different than tests. We really don't want tests to break occasionally. We want our tests to be deterministic. But with validation rules, since they're being used as sort of a second last catch, it's OK if they're sometimes wrong. They can't be so wrong so often that people just start to ignore them. But if they're occasionally wrong once a month and people don't start turning out their alerts, it's OK. So how do we make validation rules? So hopefully, at some point, you've had software that worked. If you do not have any software that works at all, validation is not your problem. It is time to go and fix your software. But you've had your software that works and maybe you can collect some metrics about how our software is working. And then we can look at future iterations of our software and if it's not looking like the previous iterations, maybe that's a thing that we can do something about. And we can also do similar things for our inputs. We can look at do-todays inputs, look similar to yesterday's inputs, or is the rate of change between the days of inputs within reason? Right. OK. Does anyone in the audience have something equivalent to this where you load some data, you try and parse it, but maybe the schema doesn't apply, maybe it's missing a field, maybe something set to null, and whatever, just throw away the bad data. It's OK. We'll just keep the good data in our pipeline. Does anyone do this? Am I just the one bad person here? OK. So a lot of people are raising their hands and they're only half raising them, which makes me think more of you do it, but you don't want to be caught. And this is OK, right? Especially for anyone who has to work with JSON data, if you require that all of your inputs were completely correct, you would never produce a result, right? That's just not going to happen. JSON data is garbage. So we're going to have something like this. And that's OK, but the problem is we might throw away 99% of our data when, historically, we've only been throwing away 1% of our data. And then if that 1% now is no longer a representative sample of my users and I'm training a recommendation model, I might make some really bad recommendations or other kinds of decisions based on this. So we could go ahead and we could write a check. Is it valid? And then we can go ahead and we can count. Is it valid? And we can say, if we have less valid data than bad data, we'll put some special business handling logic in here. If you're in Java, throw an exception because everyone likes exceptions. And just do something. And this is technically a validation rule. It's not a very complex one, but it's a good start. And it's better than nothing. And at this point, right, your special validation rule can trigger, apparently, all of you are very lucky and have operations people's pagers. And they'll come and take a look at it and then yell at you the following morning. It's less fun, though, right? Like this code is less nice to write. And that's the thing I care about. In terms of working in Spark, there's some performance problems with doing these two counts. It's a little sad. And it's similar in other systems where essentially triggering two actions can be kind of not great. Beam is different because it has a sort of whole program optimizer to a degree. So if you're in Beam, this last part doesn't apply. But yeah. OK. But we could use counters. Yay. Counters. And the other thing is we don't have to define all of the counters, right? All of these systems keep track of some metrics themselves already. And these metrics can actually have a lot of really useful information that can tell us whether or not our job is operating normally. We can look at the number of bytes that are being shuffled around. We can look at the number of records that are being read. Execution time is super basic. But if you've got a machine learning pipeline that takes three hours to train, and today it takes 15 minutes, you do not want to push that model to production. There is something going on. I'm sure you just sped up your code and you're amazing. But maybe it's time to spot check this one. And we can add counters for things that we ourselves have either had catch on fire or suspect might catch on fire. And we can still pretend it's nice functional code. We can just hide all this mutation and counters just inside of the underlying systems and pretend that we're writing functional programming code. For those of you who care about writing functional programming. And right, yes. So counters solve our problem in the same way that regular expressions can solve our problems in that they produce a new problem for us to solve. But it's different. So it's like they solved their first problem. So we have a happy counter and a sad counter. And we see how sad we are at the end of our job. And if we're really sad, we won't do anything. We'll just go to, like, you don't have Taco Bell. Pizza Hut? Pizza Hut. We'll go to Pizza Hut and we'll get some pizza and we won't push our model to production. We'll take the rest of the day off. And that's better than nothing. So OK, this is still really tightly coupled to our code. And this kind of is less than great. Oh, right. And there's a bunch of problems with counters. Some of the problems are that beams counters are implementation dependent. So you can change the runners for beam, but doing that will change the behavior of your counters. So if you've got something that's working locally and start using it on a cluster, it's totally possible that the validation rules that you've been working on locally will just stop working. And that'll be a great experience. Or if you're trying to change runners and you want to use your validation rules to validate that your new runner is performing reasonably, it might just not be and you won't know. And Spark counters have their own problems, too, with data properties. But where do we put our counters? Fundamentally, we have to understand our problem domain. And this is very similar to the problem domain of running a queer scooter club, which is everyone loves glitter and bubbles and scooters, but it turns out that putting a bunch of soap on the road was a bad idea. Who could have foreseen that having a bubble machine would cause accidents during a parade? No one, obviously. But if we spend time to think about our problem domain, we can add some tests like, hey, or sorry, validation rules like the road should not be more slippery than I'm used to. If it is, maybe let's turn off our bubble machine and stop writing for a little bit and think about what's going on. So what do people do in practice for making their validation rules? Really depressingly, I ran a survey and it turns out that the only thing most people validate is execution time and the number of records they read in. They're like, you know what? If it took three hours today and it took three hours yesterday and it read six gigs and today it read seven gigs, that's fine, right? Everything in between, whatever. It's probably the same. It took the same amount of time. How many people think that's enough? What, oh, please tell me you don't work at a bank. Oh dear, oh dear, at least not my bank. Okay, well, whatever, it's not like I have a lot of money. So not a big problem for me anyways. Maybe you'll lose my mortgage, that'd be great. Okay, so Spark Validator. So I did a proof of concept and I actually have a second proof of concept as well. This one is integrated into your SparkJob and I have a second one which integrates into your Airflow Pipeline and then you can just go ahead, put in all of these counters and then you can define historic rules on your counters and say how they should be related from today to yesterday over time. And then it's nice and you can have sort of decoupled stuff. There's input schema validation is really cool. We can write our own input schema validation. We can look at the percentage of data that's changed and there's also, come on. Okay, validation rules can be separate stages. Right, and this is important with the difference between Spark Validator and the second proof of concept that I made. To an extent, we can do data validation in parallel in a separate process. And that's to say, not all of our validation rules have to be about how our program is behaving. It can be about do the summary statistics of our data look similar to today as yesterday because that can be a really good sign that something has gone wrong. And then we can run that in parallel without slowing down our main job, which is really cool too. And in fact, there's a tool to do that that runs on top of Beam. It's called TFDV. Despite being called TensorFlow data validation, you can use it for things other than TensorFlow. It's also open source and you can use it to compute some basic statistics about your data and compare it to the previous statistics and also find anomalies based on your schema. And this is a really useful tool. There's some limitations with the places where you can deploy it right now, but it's open source and you can take the ideas and apply it to your own system. Software changes too. You should, when software changes and your data hasn't changed, run your old software against your new software and see if they look the same. It is really simple and can save you a lot of time. Even when you think that only unrelated changes have happened because your unrelated changes might change your FORTRAN libraries, which, oh dear God, are somehow related to Spark. Okay. And we can put it all together in Airflow, yay. So if you use the second proof of concept that I have, you would put your Spark submit operator as a data validation task and you can define it as dependent on the important business logic task. If you were doing parallel stuff, you wouldn't define the dependency graph that way. You would instead just have it run parallel and then you can use the batch operator to just call the TFDV stuff that we showed you here. And that's all kinds of fun. Some ending notes. You don't have to be perfect. Just do something. Something is better than nothing even if all you wanna do is execution time. That's okay. It's better than not doing anything. Just start somewhere. Here's some related links. Here's some books. Here's some books. It's unrelated, but that should not stop you from buying this book. I also have a project to teach distributed computing to children. Probably not the children that you like. But the children who don't have your cell phone number, but you still wanna convince them to join us in this wonderful adventure that we call programming. And this is not a joke. It sometimes gets confusing. Okay, so this is, okay, thanks, bye. I think you said there were five minutes at the end for questions, right? Cool, so I've got time for a few questions if anyone has them. Yay, question. So the question is, I've been talking a lot about machine learning being more and more integrated into distributed systems. What about algorithms which are not easily parallelizable? Do I see distributed systems fitting into that? And I think that, yes, even for algorithms which are not easily parallelizable, my friends at a company which will remain nameless, because they know where I live, use distributed systems to do all of their feature prep and they also do down sampling on their large data and they construct representative samples and do their training locally with non-parallelizable algorithms. And honestly, in real life, you tend to spend, I shouldn't say in real life. In real life, I tend to spend more time doing feature prep than cool machine learning stuff. And so you'll still have all this big distributed system problem, but then at the end you're still gonna want to rent a really giant node for like the six hours to train your fancy model. Because there is no magic wand of parallelization. If there was, I would make a lot more money. Or less, I don't know actually, if people wouldn't need me. One of the two. Okay, question number two. Oh, yeah. So I should have, sorry, I was rushing because I got the like, you're at a time thing. But I'll come back here. So this data validation task, oh, right, sorry, okay. The question was, I showed an example in Airflow. Why was I doing the validation after the important business logic rather than in parallel? So the Spark Validator example that I have, it looks at the counters that are output by your Spark job to do the validation on it. So it doesn't like look at the data independently. TFDV looks at the data independently. The Spark Validator tool that I made looks at the metrics produced by your Spark job. And that's why I have the dependency chain in the way that I do here. Awesome. Cool, I think I'm probably out of time. Thank you all. Um.