 Thank you for the kind introduction. Hello, I'm Holger Peters from Leander, and I'm presenting to you today how to use scikit-learn's good interfaces for writing maintainable and testable machine learning code. So this talk will not really focus on the best model development or The best algorithm it'll just show you a way how to structure your code in a way that you can test it and that you can Use it in a reliable way in production For some of you who might not know scikit-learn scikit-learn is probably the most well-known machine learning package for Python and It's really It's really a great package. It has all batteries included and This is its interface. All right, the problem In general that that I'm talking about is that of supervised machine learning In this talk and just imagine A problem we have on the left side here on the table we have a Table with with data It's it's a season. That's spring summer fall and winter We have a binary variable encoding whether we have a day that is a holiday or not Each row is a data point and each column is what we call a feature on the right hand side. We have some Some variable that we'll call a target It is closely associated to the features and the target is a variable that we would like to predict from our features And so features are known data targets Are is the data that we want to estimate? from a given table on the left in order to do this in Order to do this We actually have one data set where we have features and targets matching features and target data And we can use this to train a model and then have a model that predicts So the interface is as follows we have a class that Represents a machine learning algorithm It has a method fit that gets features named x and a target array called y And that trains the model so the model learns about the correlations between features and targets and then we have a Method predict that can be called upon the trained estimator and that Gives us an estimate y for the given model and the given features x and This is the basic Problem of machine learning there are algorithms to solve this and I'm not going to talk about these algorithms. I Rather would like to focus on how to prepare data Feature data x in this talk and how to make it in a way that is both testable reliable and readable to software developers and data scientists so if you I'm sure you want to see how this looks like in a short code snippet and this is actually quite succinct So in this example here we generate some some Datasets x train x test and y train y test Then we create a Support vector regressor that is our some Algorithm that I take off the shelf and so I could learn We fit the training data and we predict on the test data set and in the end we can obtain a Metric we can test how well is our prediction based on our input features x test and So this is a trained model and so I could learn it's very simple very easy And the big question now is how do we obtain or how can we best prepare? Input data for our estimator because that table that I showed you might come from an SQL database or From other inputs. It sometimes has to be prepared for the model. So we get a good prediction and you can't think of this preparation in a way. It's a bit like a some some Preparation as it's done in a factory. So there are certain certain steps that are executed to prepare this data and You have to you have to cut pieces into the right shape so that the algorithm can work with them One typical preparation that we have for a lot of a lot of machine learning algorithms is that of a Got normal normally distributed scaling. So what we Imagine that your data has very high numbers and very low numbers and But the algorithm really would like to have the values that are nicely distributed around zero with the standard deviation of one and Such a such a scaling can be easily phrased in Python code so x is an array and We just take the mean over all columns and Substracted from our array x. So we subtract The mean of each column from each column and then we calculate based on this we calculate the standard deviation and Divide by the standard deviation. So now each column each column should be distributed Around a mean of zero now with a standard deviation of approximately One and I've prepared a small sample for this. So you can see above an input array X and it has two columns Let's first focus on the right most column That would be a feature variable with values 32 80 and 31 of course in reality We would have future race but for the example a very small one is sufficient And then we apply our scaling and in the in the end that column now Has values that are based around zero and are very close to zero And now I put in another problem that we have in data processing We have a missing value. So just imagine I showed you in the first slide I showed you an example where we have weather data. Just imagine that the thermometer that measured the temperature Was broken on a day. So you don't have a value Here, but you would like that your estimation You would still like an estimation for that day and in such cases we can we have ways how to Fill this data with values on strategies. So One strategy is just to replace this not a number value with The mean of this feature variable. So you could take the mean of temperatures of historic data to replace such a missing temperature slot and Because if you apply our Algorithm with the mean and dividing by a standard deviation what you'll get is just a Yeah, in this example, you'll get a Data error from our code because another number values will just break the means so I've prepared a bit Code that is does a bit more than our code before so before we just subtracted the mean and divided by the standard deviation and now we would like to replace not a number values by the mean and The reason our code failed before was that Taking The mean of a column that contains a number values numerically just raises not a number So here I replaced our NumPy mean function by the function of py non mean which will yield even with None values in our array X it will yield a proper value for the mean Then we can Substract again as we did before the mean and divide by the standard deviation and in the end We'll execute a function NumPy non to none to num which will replace all not a number values by zero and in our Rescaled data zero is the mean of the data. So we have replaced not a number values by the mean and So how does this new code transform our data and It actually seems to work pretty well The same data example with the new code has a resulting array where both columns are distributed around zero with a small standard deviation and And So this is an example of some data processing that you would apply maybe to your data before you feed it into the estimator and Yeah, this This small example actually has a few properties that are very interesting. So I said that we If we go back to our example We actually transform our Array X and take the standard deviations of all columns and the mean of all columns Before we call estimator predict But what about The next call when we call estimator predict. There's also an array X that is feed fed into and we have to process data that goes into this predict Accordingly as we have Transformed the data that goes into fit. Why is this? Because our estimator is has learned About the shapes and correlations of the data that we gave it in fit. So the data has to Look has to have the same distributions the same shape as the data that it saw during fit and How can we do this? How can we Make sure that the data has been transformed in the same way and so I could learn has a concept for this and that's the transformer concept a transformer Is an object? That has this notion of a fit and a transform step so we can fit data Fit and as transform we can train it during With a method fit and we can transform it with a method transform and there's a shortcut Define a second learn fit transform that is both at the same time What's important about this is transform returns a modified version of our feature matrix X Given a matrix X and during the fit it has to it can also see a Y and So now we can actually rephrase our code that did the Scaling and not number replacement in terms of such a transform on so I called this I wrote a little class It's called a not a number guessing scaler So because it guess us now replacement values for not numbers and it scales the data and I implemented a method fit that has There's caught the mean calculation as you can see and it saves the means and the standard deviations of the columns as attributes of the object self and then it has a method transform and Transform does the actual transformation. It subtracts the mean and it divides by the standard deviation And it replaces not number values by zeros, which are zero is the mean of our transformed data and And Using this pattern we can fit our not number guessing transformer with our training data And then transform the data that we actually would like to use for predict We can transform it in the very same way and another opportunity here is since we have a nicely defined interface and For this we can actually start testing it And I wrote us little tests for our class I think you remember our example array and I create a not number guessing scaler. I Invoke fit transform to obtain a Transformed matrix and then I start Testing assumptions that I have about the outcome of this transformation And now the issue That this test actually this this test finds an issue our implementation was wrong Because if I calculate the standard deviation for each column and I expect the standard deviation for each column to be one I realize there the that the standard deviation is not one and That has a very simple reason If we look back at the code I calculate the standard deviation of the input sample before I replace not a number values with Zero with the mean so In this example The standard deviation of the input sample is wider Than the actual distribution of the data after replacing not number values with with the mean and Because the mean is in the center So and we map not a number values also to the center of the data and that makes the distribution kind of smaller and So in a way if we want to fix this code we have to We have to think about this transform method and the solution is actually to Make two transformation steps at first we want to have one transformation step that replaces not number values with the mean and Then we want to have a second transformation step that does the actual Scaling of the data So We want to transformations and Psycho learn has a nice way to how to do this it offers ways to compose several transformers several transformations In this case we use a building block and I apologize for the low contrast We use building blocks that are called pipelines a pipeline and a pipeline is a sequential is like a change a Chain of transformers and so During fit when we have when we are training and learning from a feature matrix X We use a first transformator transformator one and invoke fit transform to obtain a transformed Version of the data and then we take our second transformer also Apply fit transform with the result of the first transformation and finally we will obtain a Transformed data set that was transformed by several steps It can have an arbitrary number of transformers in the predict when we have already learned the properties of the data Like in our example the mean and the standard deviation. We can just invoke transforms and Get a transformed X in the end from building a pipeline In fact learn we can build them pretty easily. There's a make pipeline function and we pass it Transformer objects and it will It returns a pipeline object and a pipeline object itself is a transformer That means that it has the fit and the transform method and we can just Use it instead of our not a number guessing scalar that I just presented so we could go back and Rewrite this class Into two classes one doing the scaling and one doing the not number replacement or The question is maybe there's actually some someone has solved this for us already and Indeed Python has batteries included and so I could learn has batteries included So we can actually also use Two transformers from scikit-learns library One of these transformers is called the imputer because imputes missing values and So here not a number it would be replaced by the mean and then we have the standard scalar that Scales the data that is distributed in this example represented by the red distribution to one to a data set that is distributed among around zero and These two transformers can be joined by a pipeline So here you can see this we just put together the building blocks that we already have we saw make pipeline We use make pipeline here and pass it a imputer instance and I send a scalar into instance and Then if we Fit transform a our example array we can actually make sure that our assumption Holds true that we would like to have a standard deviation of one. We could hear also check for the means and From other tests We have wrapped the data processing with those scikit-learn transformers and And we've done this in a way where we can individually test each Building block so assume that these were not present in scikit-learn we could just write them ourselves and the tests would be fairly easy and Yeah, I think that this is the biggest gain that we can have from this So if you're leaving this talk and you want to take something away with it something away from it if You want to write maintainable Maintainable software if you want to avoid a spaghetti code in your numeric code Try to find ways how to separate different concerns different purposes in your code into Composable units that you can then combine and you can test them individually you can combine them And then you can make a test for the combined model and that's Really a good way to structure your numeric algorithms so In the beginning I showed you an example of a machine learning Problem where we just used a machine learning algorithm within scikit-learn estimator that we fitted and predicted with Now I extended this example with a pipeline that does the pre-processing Make pipeline we use the imputer we use a standard scaler and we can also add our estimator to this pipeline and now our object est does Contain our Whole algorithmic pipeline it does contain the pre-processing of the data and it does contain the machine learning code and Also, it does contain all the fitted and estimated parameters coefficients that are present in our model, so we could easily serialize this estimator object using pickle or another serialization library and Store it to disk or send it across the world into a different Network and then we could load it again restore it and make predictions from it and And Just so to summarize What scikit-learn and these interfaces can do for you and how you should use them? We found that it's really beneficial to to use this these interfaces that scikit-learn provides for you if you want to write pre-processing code and And you can use the fit transform interface for the transformers use them write your own transformers If you don't find those that you need in a library If you write your own transformers Try to separate Concerns separate reps responsibilities Estimating or scaling your data has nothing to do with correcting other number values So don't put them into the same transformer just write to and compose a new transformer out of the twos for For your model in the end If you keep your transformers and your class is small they are a lot easier to test and if tests fail you will find the issue a lot faster if they are simple and and Use the features like serialization because You can actually quality control your estimators you can store them You can look at them again in the future. It's really handy and In this short time I was not able to tell you everything about the compositional and the testing things that you can do with scikit-learn So I just wanted to give you An outlook on what else you could look at if you want to get into this topic There are tons of other transformers and other meter transformers that compose In scikit-learn that you can take a look at for example a feature union where you can combine different Transformers for feature generation and also Estimators are Composable in scikit-learn, so there's a cross validation building block the grid search in scikit-learn that actually takes estimators and Extends their functionality so their predictions are cross validated according to a statistical methods So I'm at the end of my talk. I thank you for your attention. I'm happy to take questions if you like and If you also if you want to chat with me talk with me you can come up to me anytime Could you please describe your testing environment to use a like a standard library like unit tests something like that too? well Basically, we we use unit testing frameworks Like unit test or Python. I personally prefer Python As a test runner and we structure our tests or suck sir the tests Like we would unit tests in other situations. So in the most basic form testing numeric code is not Fundamentally different than testing other code. It's it's code. It has we tested you have to think of inputs and outputs And you have to structure your code in a way that you don't have to that in most cases You don't have to do too much work to get a test running and so Yeah, we have some tools to to generate data and to get more Tests that are more going into the direction of integration tests But in general We just use the Python tools that Non data scientists also use other questions You apply also the transformations Once you have all made all the training Yes That is so if I understood the question correctly the question was If we also apply the transformations to the test data So you're talking about the data that I passed to predict right in the first example Not the one that you use for the yeah, so Sorry here you're talking about Yeah, exactly Yes, we do this is this is the purpose of splitting in the transformer Into those two methods. I'll just pull up the slide again The whole purpose of splitting fit and transform here is That we can repeat repeat this transformation in transform Without having to change values for them those estimated parameters mean and standard if we would execute the code and fit again, then we would Not get the same kind of data Into our algorithm that the algorithm expects any other How do you track your model performance over time? So in in some of our applications? We have like data going for for years, and we have models that are built up And then for instance that model the assumptions underlying probabilities of the data So we're using mostly be Asian models and the underlying probabilities are changing and we want to revalidate to see how on previous data sets or versions of data sets how The models that are overfitting or underfitting depending on on what we have so are you doing anything across versions of data sets to make sure that You know your assumptions aren't missing stuff or adding a new stuff that you have didn't have before okay, sir You're asking how we actually test the stability of our Machine learning models Well, this is done with cross validation methods and We Yeah, we we have for for sample data sets. We have reference so reference scores and if the reference scores are going getting worse in the future Then tests fail Basically, and then if that happens one has to look into into things Why why things are getting worse? There's not really a better way than using cross validation methods Yeah, it's more of a monitoring thing. So so this talk was more about Actually testing testing the code Whereas your question was rather about Testing the quality of the model So I think these are two different concerns You say They're complementary Yeah, definitely So and I just got curious when you do this what do you work in I mean do you work in an ipython notebook? Or do you do it a separate scripts or what do you what you do for this? Yeah, I'm Personally not using ipython notebooks that much. I just use I write tests in test files and execute my test runner on them and then Use continuous integration And all the tooling that is around unit testing Yeah, I Personally well ipython notebook is no environment. It's not I meant that it's really great at exploring things But it's not a environment for test-driven development and So there's no test run an ipython notebook and I personally think all the effort that I put into Thinking about some test assertion that I could type into an ipython notebook if I put it into a unit test And check it into my repository. It's done continuously over and over again. So I really prefer this over Extensive use of ipython notebooks. I do use it if I want to quickly Explore something This is just an add-on. So no question Your talk was about the testing stuff and this is really great with this Modules, let's say small units, but of course, it's also important to have reusability then because then you can really Yeah change a model or apply it to different Problems reusing parts of your pipeline Any other questions? Okay, thank you. Thank you very much