 Cool. Hi everyone, thanks for joining me this late in the day for my talk. My name is Niels Bentelen and I'll be talking to you about enforcing data quality in data processing and machine learning pipelines with Flight and Pandera. Before I hop in, just a little bit about myself. My background is in biology, dance and public health, but I kind of did the little pivot and now I'm a machine learning engineer at Union AI, which is a Seattle-based start-up. One of my, as part of my roles at the company, I'm one of the FlightKit maintainers. FlightKit is an open source package that's part of the flight ecosystem. I'll get into that in a little bit. I'm also the author and maintainer of Pandera, which is also the subject of this talk. It's an independent project that I've been developing over the past two, three years. In my free time, I like to build machine learning and data science tools and also training whimsical models just for fun. The outline of this talk is as follows. I'll start with a sort of a problem statement phrased more as a question. First out of curiosity, who here is just a show of hands familiar with data frames as a general kind of data structure? Cool. Everyone knows what a data frame is. I'll then get into Flight as a type safe data pipelining tool and Pandera as a statistical data testing tool. In combination, I'll show you kind of a high-level architecture for an example pipeline for training a model, putting both of the two tools together, show a little demo of this example pipeline, and give you some takeaways and some next steps both for you and me. So what's in a data frame? Since everyone here knows what it is, it's basically a table. Generally it's two-dimensional, but it can really be n-dimensional, and it's a table generally with human-readable or maybe sometimes non-human-readable labels for the axes of the data structure. And the intersection of, in this case, columns and rows gives you a value, each value depending on the library. It could be columnar or row-wise. You can think of this as a table that you're all familiar with. So just to ground this conversation, my background is in data science and machine learning, and I do most of that in Python. And so in this talk, I'll be mostly focusing on Pandas data frames as this in-memory data structure with a lot of little niceties to make it really easy to do statistics, information, plotting, all that stuff, given that you have kind of data that fits on whatever machine you're using. And in recent years, there have been several projects that have been trying to scale up data frames. And these projects are ones that try to emulate or at least reproduce the Pandas API just to make it easy for new users to onboard and, you know, minimally change their code to scale to multiple machines, maybe GPUs. So these projects are, as you can see, Dask, Moden, Koalas, Polars, Rapids, and there are probably many more that I'm not mentioning here. There are also certain projects that don't conform to the Pandas API, but are nevertheless offer a data frame API where, you know, you can do similar transformations at scale. So just to sketch out the problem statement a little bit more, just take a look at this innocent piece of transformation code. Super simple. It's a function called transform data. It takes in this poorly named thing called DF, data frame. And since this is Python land, you know, anything can be anything. And so here we're assuming, though, that DF is a data frame, and the transformation is super simple. You know, you're assigning column C to the sum of A and B, so that's an element-wise operation. And then you're doubling C to give you D, and then you're returning the data frame. So the contents of this data frame are integers, you know, the add and multiply operations are defined, and we get, you know, the expected output. So C is literally A plus B, and D is literally the double of C. And all is well. This is like the intended, you know, thing that I wrote this code for. But again, in Python land, if the contents of data frame are strings, well, we don't get any runtime errors, you know, the code happily executes, and this is because the addition and multiplication operators are also defined for strings. And this might be fine, but, you know, in this toy example, I didn't intend for strings to be transformed in this way, but this is a completely legal thing to do, at least in pandas. Now if the contents of A are integers and B are strings, then we run into a type error at runtime. So we execute it, and the addition operation is not defined for operating on, you know, ints on the left-hand side and strings on the right-hand side. So here we run into trouble, and the middle case really kind of highlights the things that I've been burned by when doing, you know, ad hoc data analyses or even code that I write that might be going into production at some point. The middle indicates, you know, errors that run silently, and I don't know until way after, like, let's say, training a model or doing a statistical analysis, that the data is not what I expected it to be. So the central question of this talk is how might we enforce the quality of data as they flow through data processing pipelines, especially when they involve data frames? And data frames, again, in the pandas case, they're kind of schemaless. The dynamic nature of Python and pandas makes it such that I can assign and kind of mutate things at will, and this can really bite you, you know, at some point if you've ever used these libraries. So the first part of the kind of the solution or the way that we can sort of avoid these types of silent errors is through flight. I'll get into Pandare a little bit later, but as I said, flight is a data processing and ML orchestration platform. You can see here, you can pip install it. Flight kit is the Python SDK for flight. I won't get into all the nuances and all the features of flight, but I'll just say that flight is like the stratosphere. It has many layers. It can mean many things to many different people. So you know, if you're more on the infrastructure side, you might think of it as a Kubernetes first workflow orchestration engine. Maybe if you're like a data engineer, you might think of it as an automated data lineage tracking platform, which it certainly supports. But as a machine learning engineer who trains models and creates the data sets that might supply the training data for those models, I think of it as a type safe data pipeline language. And I think this is very powerful because you can avoid many kinds of errors that one of which I highlighted earlier. So just a quick tour of flight, I'll be referring to these simple toy examples, the transform data function here that's just A plus B and then C times 2. And then here I'm just introducing another function that takes the output of that and then aggregates the data by getting the mean of each of those columns. But the first thing you do with flight is you can write tasks. And tasks are sort of the main workhorse of the flight platform where you can write isolated pieces of data processing. Under the hood there's a lot of stuff going on, but you can kind of think of everything under each app task decorated function as isolated within a container, a Docker container that is. The second thing you can do is compose workflows. And workflows are tasks connected together in sequence or with some conditional logic. I won't get into all of the things you can do with flight and composing workflows. But in this simple example, I'm calling transform data on some input data frame to give you transformed DF and then aggregating that and returning some output. So you can see here that I'm kind of using Python's type annotation here to just kind of indicate what the input and output types are. But that's not really, I mean, it's informative in that, you know, you expect to input and output a data frame, but it doesn't really give you more than that. I'll get to this later. But for now, I'll explain more a little bit about what flight provides. So with flight, you can develop locally. So all the code that you saw earlier, you can run locally and it executes as regular Python for the most part. Then you can deploy to scale. And as I said earlier, I won't get into the deployment and the back end aspects of flight. I'll really just focus on what the local user experience is before you try to scale things. So as you see here, I'm calling pipeline this workflow that I defined earlier. And now I get this result. I get ABCD with the main, median, and standard deviation of each of those columns. There are a bunch of more capabilities. So flight locally and in the back end caches task outputs. As I said earlier, it does track data lineage. So you can see as your data passes through your workflow and tasks, what the input state of the inputs and outputs are. You can also recover and resume partially completed workflows and integrates with an ecosystem of data tools that I won't really get into too much. And just to highlight again, what I really value it for is that I can build type safe DAGs. And so what I'm doing as an author, as a developer, as a data scientist is this tool helps me to think about the types so that in the future, I don't forget what the types are or my collaborators can see exactly what the intent of the code is. And that's very valuable. There's kind of maybe this trend to adopt typing and type annotations in Python more. And as that happens, I think many people in my data science machine learning circles will see the value in this. And really returning back to the rationale and the reasoning is high-quality and clean data is critical for data analysis and modeling. So if you get certain, if your expectations about your training data or your analysis data are broken and it passes through silently like a numerical column say having negative values when only positive values are valid. Then if you're making business decisions or you're deploying a model off of training data with these invalid things, then you're kind of wasting time training models. And it's also challenging to isolate and trace the origin of bugs in these kinds of analyses and models. You have to kind of go backtrack all the way potentially to the very beginning of your pipeline. So again, flight is a strongly typed data pipeline framework. So here I'm trying to define a flight task, but I've ignored to provide a type hint for the DF data type here. Oops. And here flight will complain. So type empty is not supported in FlightKit. So you have to provide something. It could be just pd.dataframe, but at the very least you're saying what the expected thing is of what DF is. And again, composing tasks with mismatching type signatures, which is something that has also burned me many times. So here I'm taking transform data, which has no return signature. And I'm trying to hook it up to aggregate data. And here type flight will complain. So it says cannot pass output from task transform data that produces no outputs to a downstream task. And this is another thing that it's great, this is a great feature to have because now that you have type information, you basically have function types for your functions. And your functions now can be analyzed to see what is a valid set of operations. So you can assess your workflow for validity just on the basis of the allowed input and output types. So when I said pd.dataframe, that's great. And everything, it doesn't really give you more information. So flight does have a built-in type for a data frame. And this is called flight schema that I'm importing here in this code snippet. And you can express the columns and then the types of those columns. So here you can just read for yourself, a equals int, b equals int. And then the aggregate output type is, or they're all floats since we're collecting aggregate statistics, the ints will become floats when you aggregate them. But what if I want to validate data frame properties beyond data types? And this is where I'm going to get into Pandera, which is a statistical typing and data testing tool. As you can see, just install Pandera and you can get started. Pandera adds guardrails to your pipeline. So if your pipeline is this suspension bridge that looks kind of dangerous, and your data frames are these people kind of traversing this dangerous thing, Pandera are the guardrails that help you know when someone's fallen off, hopefully that doesn't happen. But it helps you keep your data frames as you expect them as they flow through the pipeline. So I mentioned two terms, statistical typing and data testing, and you might be familiar with these, but I just want to kind of flesh them out for you. Statistical typing, it's something, I don't want to say I coined it, but it's something that I've seen other terms for, but I think statistical typing is a good phrase for it because it captures the concept really well, and it's specifying the properties of collections of data points. And so as opposed to if you have a single data point, say it's a row on a table and it has certain fields, you can do a bunch of really useful checks on it, and this is very valuable still. You can check the data types, value ranges, if you know them ahead of time, allowable values for free text fields or string fields, you can do some kind of regic string matching, you can check whether the nullability assumption of the field is upheld or not. But you can do more when you have a collection of data points. First of all, if you have a tool like Pandas or Dask, you can apply atomic checks at scale. All the checks I mentioned in the previous slide, you can, you know, if you have some vectorized implementations of them in some underlying framework, you can apply all of those on potentially very large data sets. But then you can also do things that you can't do with atomic validations. So you can check for uniqueness in a column. You can check for monotonicity, you know, is a value increasing or decreasing all the time. And you can check for statistical descriptive statistics and check, you know, maybe they're within some certain range that you expect. Maybe a column is a normal distribution, a standardized normal distribution, so you'd expect the mean to be roughly around zero. You can do other statistical distributions and also like fractional checks of those atomic validations, like you say 90% of my data are not null. So the second term I mentioned that Pandera offers is data testing. And this is a term I've also kind of seen in various data science and machine learning blogs. I'm not sure how widely used the term is, but I kind of see it as validating not only the actual data but also the functions that produce that data. So maybe in production in the real world, you're getting, you're grabbing data from the real world, some raw data, you're transforming it, and then you have some transform data that you want to use for some downstream process, and you want to apply validations on that, that transform data. But then in continuous integration tests, you might have some test cases to see, you know, what are some valid code paths for this transform function, what are some invalid code paths for this transformation, transform function. And essentially by providing these test cases and applying validations on the output, you're making assertions about the function and what it is doing to that those test cases. So maybe it raises an error, you know, so you can maybe try testing that. So Pandera in practice involves encoding assumptions about data frames as schema types. You can extend schema types with custom validation rules really easily. Easily integrate data frame types into your pipelines. Then get informative errors if something goes wrong. And unit testing pipelines, you can unit test pipelines with hypothesis strategies. I'll go into each of these aspects in a little bit more detail. So as you can see here, we have the types that you saw with the schema type. The syntax is slightly different, but if you're familiar with data classes or Pydantic or that type of syntax for declaring Python types, this might look a little familiar to you. So the transform input has a column A and B. Both of them are integers. You kind of have to use these generic types. The syntax will change as the library evolves, but right now you have to specify the series of integers. A few things to note here is that Python's inheritance semantics work out of the box. So output inherits from transform input, as you can see. And so all the properties you define for transform input also apply to transform output. This is convenient to reduce redundancy, kind of make your type definitions a little bit more concise. Since Pandera primarily is a pandas data frame validation tool, the scope is slowly expanding as the adoption expands as well. But pandas has this notion of indexes, which are basically keys to the rows of your pandas data frame. And finally, it supports regex column matching. So here I define this property called columns in aggregate output. But then I can provide an alias. And usually this alias points to the actual column name. But if I wanted to turn it into a regex pattern, I then provide regex equals true. And this validation rule, which just says these columns should be floats, are applied to whatever matches the alias. Extending Pandera's built-in validation checks is super easy. So it's literally just a method. And you just reference the property that you want to validate. And the method signature, it's a class method. So the first argument is class, CLS. And the second argument is series, which is the column that you're trying to validate, in this case, A. And this validation rule just checks whether A is even. Since this uses Pandera as the underlying validation engine, so to speak, you can go crazy. I mean, as long as you have the series, there's a global check as well. That's a data frame check. And so then you have access to the entire data frame. And so that's kind of like a way to encode assumptions. Maybe you want to condition a rule on some other column. So this gives you a really flexible way to define validation checks that are custom to your case. Just to highlight, they're built-in checks. So this checks whether each of these columns are greater or equal to 0. And then, as I said, custom checks are just methods. You can easily integrate with your pipeline via a function decorator. So this works both with methods and just regular functions. If you decorate your function with check types decorator, at runtime, it will apply the validation rules to both the inputs and outputs that are annotated with the syntax of this data frame generic supplied with the schema type. So one of the things I really wanted to get right with this library is to get informative inputs, because I had used similar libraries in the past. And when there's an error, it might not be super informative or maybe it's a data structure that is custom to that library that's like a list of dictionaries or something like that. So in this case, I wanted to design something at the birth of this library to just be like you're working with pandas. So the errors in the failure cases should be really easy to identify. And when you can, give the failure cases as a pandas data structure so you can further manipulate and inspect it. So in this case, if a column has unexpected data types, it will complain. So in this case, the validation rule applies to the entire column. So in this case, it just says series A is expected to be an int, but I got an object. It'll also complain if a column is not present in the data frame, an obvious feature to have. And if any of the assumptions in your specified schema are violated, then in this case, the presence of negative numbers and odd numbers are in this data frame, then, as I said earlier, you basically can inspect the error and the dot failure cases property gives you access to all the failure cases in detail. So it gives you the specific failure case. It gives you the check that caused the error. It gives you the index in the data frame. So quite neat. Cool. And then one of the really kind of unique parts of pandera is its integration with the hypothesis library. So hypothesis in other languages might be called quick check, but it's kind of the Google term is property based testing. And so instead of manually hand crafting test cases to test unit test your functions, property based testing allows you to here specify the types in your test case. And the hypothesis will generate data according to that type specification, basically the contract. And then you can feed that to your function, in this case, the add function. And then you can make assertions about the result. So in this case, a really simple example, adding two numbers, and then you just test whether the result is as you expect. So just to reiterate, the problem of manually written test cases are subject to happy path bias. So I don't know about you. But for me, I like testing. But I can get lazy and just kind of test the happy path and maybe test a few of the error cases. But here, hypothesis will really scour the entire domain of the contract that you've specified in this at given decorator. And it'll easily find bugs for you. And so the way hypothesis puts it is it finds cases that falsify your assumptions about the test. So this helps you further refine both your tests and then the implementation that you're testing. So Pandera hooks into this functionality. And so here, instead of hand crafting data frames, which can be quite tedious if you've ever done it, I don't recommend it. It's probably not good for your health. But here you just call this .strategy method with the size that you want it to produce, the size of the data frame you want it to produce. And you can go crazy testing various data that it generates that are valid under the constraints of your schema. So in this code snippet, I've deleted the function body and I'm just returning the original data in my transform data function. And here, obviously, it immediately finds a failure case. Like probably the first thing it generated didn't have columns C and D, which it expected. So here you can find the error. You can see what was wrong. And then you can go back to your source code and fix it. One thing I wanted to call out here in the Pandera project is the limitation is that it operates on Pandas data frames. And so that comes with the limitations of Pandas itself. And so if you want to validate very large data frames or data sets, it's currently really not possible. You might come up with a solution that distributes it with some other framework. So for example, fugue allows for this. So you can apply the schema to the partitions of your distributed data frame and scale that way. Coming in at the next release of Pandera is a support for Moden and Koalas. That'll be, it's a work in progress. We'll see where that goes. But this is a way to apply exactly the same schema specification to these other data frame libraries. OK, so we'll put it all together. And to build robust data pipelines, I'll just show you the only meme of my presentation. This is like the epic handshake meme. Flight in Pandera. And to do that, I'll train a very simple model with sklearn on the UCI heart disease data set. This data set contains 13 predictor variables and then one target variable. The target variable being one if there's disease, observed disease, and zero if there's none. And you can see the other attributes and their various distributions. I did a little bit of homework before this presentation, obviously. So I came up with a schema that I'll show you in a second. High-level architecture is pretty much like a cookie cutter machine learning training pipeline. You have a process that fetches raw data from some source. It might cache it or it might store it locally somewhere. So you don't have to keep fetching it. Then you parse the data. You do some kind of processing to get it machine learning ready. And then you split it up into a training and test set. You train the model on the training set. You get a model back. Then you evaluate the model in the test set. And you get some metrics that you care about. OK, so now I'll show you Pandera in flight. If you're interested, there's a Google collab notebook here I'm going to pop in. Hopefully this doesn't break the presentation. So I'll just kind of go step by step through this. I won't go through every single aspect of the code. But just to give you a sense, hopefully the demo gods will smile on me today. But traditionally, you might document your features or your variables just in some kind of markdown or some kind of thing where you write down each variable and then what they mean if it's some kind of non-human readable code. But in this case, I can sort of encode what the data is expected to be straight on my schema type, Pandera schema model. And we're working to make this even better, so to add a title or a description to each of these fields. But for now, you can get a sense of what the raw data is expected to be. And this is a kind of artifact that you can use to enforce the assumptions about the data. So just looking at age, being an integer, and providing some kind of reasonable bounds on it, and doing the same for the rest of these variables. Typically, what you want to do is you kind of want to mess around with the data, visualize some stuff about it. But for the sake of time, I'm just going to go through and define this fetch raw data function. So what it does is it grabs the data using Pandas read CSV and fetches it from this URL of somewhat clean, but still needs to be preprocessed data. I have a note here seeing a demo. Remove these lines to see what happens. So this is actually kind of a case study in data that isn't as you expected. And you see I've annotated the output here with the raw data data frame. Let's see what happens. Okay, so invalid literal for int with base 10. So there's a string here, 0.0, that Python cannot coerce into an integer. So okay, this literally happened as I was writing this demo. So first I was like, okay, let me convert these two columns. So this CA column, but also this Thal column had this problem. So I'll just tell you that. Oh, okay. So there's this question mark in what is it? One of these columns. Oh, yeah, the error message is not super informative, but just trust me when I say that the question marks are in these CA and Thal columns. So let me do this. Okay, so I'm trying to convert a null value to int, which in pandas is not allowed. So I'm just gonna drop those rows since I can't really train on those data points anyway. And okay, cool. So based on what I know about what is defined as valid data, this function passes. And this kind of was kind of like a clean, very clean narrative just to show you here, but it's a lot more iterative than what I just showed. Typically, you want, like as a data engineer or data scientist, you kind of build domain knowledge about the valid domain and ranges of your various variables in your dataset. So moving on, let's look at the parse raw data implementation. So in this case, all I'm doing, if you notice, this target has values that are not just zero and one. So actually, the raw data had some more granularity to it. So actually, it had values from zero to four. And those have some meaning in the original context, but in this case, one through four were some varying degrees of heart disease onset. But in the case of this machine learning task, we're just binarizing it. So now we express, we write down this parse data that inherits from raw data, but overrides the target to be just between zero and one. And just to scroll back up here before it was between zero and four. So I'm overriding this value. I'm here transforming the data so that the target is actually binary. I'm not gonna do what this demo prompt here. Suggests, I mean, okay, fine, maybe I will. So if I just return this, again, as you saw earlier, it's gonna complain. And it here shows a bunch of the first couple of failure cases, and that makes it really clear. Oh, there are values that are not zero and one in the target column. Excellent. Just a little aside, for machine learning, you kind of want the labels to be balanced. In this case, they're balanced-ish, so I'm not gonna do anything else on the modeling side, except for training the model. The next thing we do is we split the data. I'm just gonna run this first. So all we're doing here is splitting the training data into a training set and a test set, specifying some test size here. And here you'll notice a feature of Flight, which is since you need to express all the types of your inputs and outputs, you can use the name tuple if you want to output multiple things. In this case, I'm outputting, I'm creating this name tuple of the training set, which is in test set, which both have the same type. So since they're just splits, you could define different schemas for the training and test set, but in this case, that's not really necessary. And the second to the last step is to actually train the model. We have a helper function here that just grabs features and then targets separately. Notice here within that the task, like this get features and target is not a flight task. And so here it's just Python code. And so you don't necessarily have to type annotate it. And here I'm just training a random forest classifier on the data. I have this really nice convenient job lib serialized file type. And so I'm gonna train this model. And it outputs a model file and under the hood flight determines where this output is gonna go, but just to prove to you it's a model, I'm just gonna load it up here again and you can see that it's a random forest classifier. And the last step is to evaluate the model. So the input is the model we just trained and then the second input is a test set. And again, here you'll notice I'm just executing this Python code here just on this collab notebook interactively. Okay, cool, we're just looking at accuracy. It's probably not the best metric to use in this case, but it was just a demo. So accuracy on the test set is 82%. I don't know whether that's good or bad. I'd have to look at the prevalence of heart disease and probably a bunch of other things to make that determination. And then finally, here's the pipeline that composes all of these tasks that we just defined and puts them all together. The inputs of this pipeline are some random states, random seeds for both the data set splitting and the model initialization. And this will run the entire pipeline. You can see that it's deterministic because of the seeds. So I'm getting the same accuracy each time. If I were to change it, let's see. I get a different accuracy because it's a different model. It's a different train test split which results in a different model and also a different test split. And then finally, I'm crossing my fingers to hope this works, but this is sort of like a mock. So you can imagine this code snippet being in your PyTest or unit test suite. Here I'm creating a positive examples and negative example schema. This is a little bit hacky, but this is a way to ensure that there's a data set with at least kind of balanced labels. In the future, there might be ways of using a single schema with a validation rule that says the labels must be within some distribution. But all this does is this train test, I test my training model. So here this, before I would have to like handcraft a data set with all the columns, but in this case, all I do is use hypothesis and it'll generate the data for me. And I can at least test that my train model function will run and I can generate predictions from it that are within the domain that I expect which is predictions of zero and one. Okay, so let's run this. Really the best demo would be to show you an error, but this is gonna take some time since I'm also testing different random states. So I'm looking at different, the hypothesis routine is looking through different random states, it's looking through different data frame types and hopefully maybe we'll return to this. But tests will complete. Okay, are you done? No, it's fine, trust me, it'll work. So that's really it. Using both these tools, you can have type safe data pipelines and be more confident in the correctness of your code. So to summarize, flight is a flexible type safe DAG, but it's more framework for orchestrating data processing and machine learning pipelines. Pandera provides an intuitive interface for specifying complex data frame types which includes statistical tests. And used together, data scientists and ML practitioners can be more confident about their code and when data quality checks are violated at runtime. Next steps for you, if you're interested, you can give flight and Pandera a try today. Next steps for the flight union and team and myself. On the flight kit side is, continue streamlining the user experience, add more integrations for libraries that you might wanna use in your data processing ML workflows, supporting additional data frame types like Dask, Moden and Koala's data frames, perhaps. Improving the flight kit Pandera plugin, so that's the thing that kinda glues them both together to support improved error reporting and visualization. And on the Pandera roadmap, as I mentioned earlier, to scale up data validation. And in the process of doing that, making it an extensible API for validating non-Pandas compliant data frame libraries. So there are many out there where the abstractions are still not quite there, but in theory, you could separate the schema specification from the validation engine, the thing that actually does the compute. And so that, once that's fleshed out, you can support even like lists of dictionaries of JSON objects. And then a PyTest Pandera plugin for tighter integration with PyTest give you some nice little nice convenient features like better reporting for when error happens in your test suite. And then finally, a pedantic and fast API integration. These are two kind of popular libraries that would, I think, benefit from Pandera integration. Cool. So there are gonna be two more talks at this conference relating to flight. The first one, you can just see the information here, an integration between flight and feast, and then integration between flight, Spark and HauerVot. And that's my talk, thank you for your attention. I'll be around for questions. Just to call out the flight and Pandera repos are over here. We are doing Hacktoberfest 2021, so if you're interested in participating, you should head over to the flight slash, or flight dash or dot slack, Hacktoberfest 2021 channel and we'll help onboard you and see if you're interested in contributing. Thanks so much.