 So yeah, hello everyone. It's really nice to be here today Yeah, so this was dr. Containers for machine learning. I like to introduce myself myself a little bit Recently graduated my bachelor's degree in computer science at Delft University Currently doing I'm a master thesis in data science at Delft and I work as a machine learning engineer at ING Bank I'm you bank is a is a Dutch bank, but we operate globally and this is my very first time standing here giving a talk So it's a quite nerve-wracking and but it's awesome to see all of you here So what I'll be talking to about today. I'm I like to give some context So what really doesn't mean to run machine learning in production these days? Some concerns about machine learning in production and maybe some concerns about machine learning in general And then I like to take you on a journey from a very simple model and we'll gradually use Docker to containerize this This model and then finally make this image also distrelas so Machine learning in production. I think many of you have once made a machine learning model maybe some of you have gone the The length to also encapsulate is an API and then expose this API so that you've got a service running and whenever you send a Request to this API your model makes a prediction. This is this is awesome, and and Yeah, but at large origin organizations, this becomes more of an issue So if you've got many teams and many models all running at the same time, how can you really manage this? so We've got tens of maybe hundreds of models all running at the same time and each model has their name and Yeah, you want some uniformity here. So The way we solve this is and making some kind of platform a platform where data scientists can Send their their model Either this means on Python code or some Python code along with their pickled Perimeters and this will go down a specialized pipeline and in this specialized pipeline will outcomes rolling a Docker image we can can run on top of the platform and Through some service discovery we can reach the right model now each of these models they really would run in their own environments and This will be an excellent use case for containers. Maybe some models are made for TensorFlow some models may be made for scikit So yeah containers are an excellent solution for this Though there are some concerns about machine learning and this is that machine learning models could or tend to handle quite sensitive data Some of these features using in our models might identify people and this is very very concerning So we really want to be more aware of where are we actually running our models? Are these containers we're using actually safe to run and As much as we try to make data anonymous, this is extremely difficult On top of that a machine learning model itself can contain sensitive information think of parameters Think of some war to fact model that has a dictionary mapping maybe a name to some feature which Is not quite desirable So what I really want to talk about today is how can we make sure at least the environment in which we run these models? To will be a bit more secure instead of just taking a random Docker image So our little model I'm gonna I'll be using a scikit flask and then of course Docker to take you guys on a journey a journey in which we have a simple model and slowly build this model further and Eventually landing in a distressed Docker solution So this is our little model. It's a random forest classifier. We use the iris data set extremely exciting Yeah, so if you don't know what a random forest class for is don't worry Think of it as a simple machine learning model if you don't know what the iris data set is just think of a simple data set Now we use flask to expose this now obviously in a production environment You wouldn't really want to do it like this. You would do some validation You would perhaps come up with a schema you would use data frames as a way to communicate You would use some libraries, but I want to keep things simple So we do this instead. It's it's simply an API exposing the slash predict endpoint in which we can post an array And with this array we can make a prediction It works. Well, at least when I tried it, it works if I use the curl I would get back to prediction of our model Um, so let's dockerize this So we would start by picking our base image in this case. We'll use the python 3 base image We will copy in our requirements in this case scikit-learn and flask and then run pip install these requirements and then also copy in the files We need and do the exact same thing as we did before Yeah, if we run this you also notice a minus speed This is because docker doesn't expose any ports and used to explicitly tell docker to do so So in this case we map port 5000 inside the content container to port 5000 outside the container Here you go. I could have just had a copy the slides, but this is really what happened so Now we would like to talk a bit about how can we actually Say anything about this image. How can we scan this image? How can we analyze this image for security vulnerabilities and there was many many tools available for this There are tools that do dynamic analysis So it looks at your running containers and Verifies if there's anything going wrong or we can use static analysis So before we even run our image We can also do some analysis on the file system itself inside this image because in the end an image is simply a zip file So we do we use Claire Claire is a way to perform static analysis and I like to reiterate This is not the way to do it. There's many ways you can do analysis and In this specific example, I really like using Claire And there's a nice integration called Claire scanner and Claire scanner basically allows you to to run clear in the background And then you can perform a nice little Command and here we simply specified which application we like to scan Now this is the result So we see severe I mean vulnerability space on severity This is for the Python 3 image we use earlier and what we put on top Now this doesn't necessarily mean that the doffer container we're using is is Vulnerable or or it's unsafe to use however as a larger organization and compliancy and whatever You want you don't want this you don't want to explain this So instead what we could do is reduce our image we can strip down this Python image we were using originally and Take only the things we really need which leads me to Yeah, so it also some nice other things is that well not so nice things that the size is quite large 1.1 gigabytes and Someone can also attach a shell to this doger container and execute commands inside this container now This is difficult to prevent but you don't really want this Ever So this for this right so this for this images are basically images that Try to X try to have the most minimal set of absolute needed Applications services dependencies such that your application can run This is a quote I took from Google container that the distrelas repository and Here they also mentioned that okay. It doesn't have shells. It doesn't have package managers, etc So I'd like to go further. I like to take the model we had earlier and now use a distrelas image so if we do this we use the in this case we use the The Google supply distrelas image. This is not the image to use Most of the time you would actually want to make your very own image You would want to make your own distrelas image, but for these slides. I'd prefer to show a quick example So here we use the the Python 3 distrelas image and when you run pip install, you'll notice that pip is not found so Because we don't have a package manager suddenly we it's more difficult to get these dependencies in there Now thankfully docker has a very nice way to solve this issue We could use multi-stage builds so a multi-stage build is simply where we we take one image I mean we take one container Do all the stuff there and I have another container and copy all the files over that we need into the new docker file And it looks a little something like this. So we take The again a Python version in this case, I'll use Python 3.5 to be more specific We copy in the requirements again, and now we can run pip install because Python is originally installed in our bigger base image and then Again, we use the distrelas image, but instead of running pip install we copy over the files Inside the other image into our newer distrelas image Because these distrelas images are so small we might run into some configurational issues. So here you also see that We have to set some flag for utf-8 otherwise flask run run not so nice But yeah, if we scan this now we we get this which Happens if you use my plot and you don't plot any data. That's because there's no vulnerabilities That's not to say that this this image is super safe and you should totally use it But it's nice because if if Clara would find one vulnerability the noise is kind of gone We a noise turns into a signal. We can use a signal to see what's wrong with this image and Furthermore this image size has been reduced quite a bit. We've gone from 1.1 gigabytes to 250 megabytes, which is quite a significant reduction and also Manipulating this container has become a bit more difficult because now we certainly don't have all these nice help or commands inside our Docker image so as you can see we can still attach a shell, but List files is not found But we can do a bit better and this is a bit experimental and it We could perhaps Make it a bit smaller image because Python modules themselves could be vulnerabilities. What if we just also instead of reducing the set of Linux dependencies or Python dependencies. We also reduce the modules themselves so for this we can use pie installer for example and This is a bit experimental like I said because By installer is right It's a bit difficult to work within a production environment and it also generates executable files Which might flag some security scanners as the pattern pie installer generates Flags some security tools because it's been used a lot maliciously sadly So coming back to that I yeah a small change here I didn't really like the flask run bit. So I just create a main method and yeah, so Now it's important that we also upgrade PIP. It's important that we upgrade setup tools For pie installer at least otherwise we get into some issues that the existing dependencies in the Python 3 container fail Now if you run this We suddenly run into an issue that pie installer. It doesn't always spot all dependencies. So here you see that Siphon blast is not found now this is a bit difficult because We we can't keep going on and specifying all the missing dependencies However, in this case, I did do that. So we can help pie installer a bit by giving them giving a specification file So here you can see that we help pie installer find some missing dependencies And in this case, we've got all I went to the process of running at five times and every time it gave me something missing so in this case you can see that I Specified some more files and now we run it all works great and certainly the image has shrunk even more to 97 megabytes Now we can keep going on and on Because we can also perhaps Strip down the distrelas image itself a bit more. We could only package python inside our pie installer But this becomes quite a complex process Yeah, so we can use a scratch that's got a scratch image where you completely build your own dishless image all the way from the start So lastly some Docker tips I've been running my containers as a route. This is not smart. Don't run your containers as a route Please don't do that. That's perhaps the most most important thing before even considering going to this release Don't run don't run as a route ever Also use image hashes instead of tags So use the the shot digest instead of python colon three use the the hash that you can find online in your container registry and Don't use the existing dishless images build your own distrelas images and building your own distrelas images is Quite a hassle, but that's a talk on its own And also perhaps if you're a larger organization, you might want to sign your Docker images Validate who made these Docker images so that you can verify that these images are actually made by someone within your organization So to summarize Be very careful in which images you choose for your models You don't want to use any model. You want to be any any container You want to be a bit more careful about this selection? You might want to make your own images to ensure that this is all this this is in control and Yeah, by using dishless images, we limit the surface on which we have thrown abilities So I thanks so much here are some of the tools I used Yeah All right, fantastic. Thank you so much. Do we have any questions? Hi, thank you for your presentation I wanted to ask what if it will be really hard to strip out of our dependencies because if we include the NumPy then we have to have installed a blast for it to works faster and Can we even do something then to? To strip it all so yeah You could entirely build your own dishless images from the ground up and this is quite difficult because Someone has to manage this right if you're in a team someone has to take this Responsibility of making sure this image is up to date It's in the end an image is just a massive zip file which has entire file system It's definitely possible to make these very minimal machine learning dishless images Which is probably which are specialized from perhaps TensorFlow or NumPy, but again, it's a lot of effort But it's definitely possible, but this this little side effect where you need someone to manage this Okay, thank you Hello, thanks for the presentation one question. You suggested using image houses instead of tags Yes, why is that? so right now if I take Python version 3 from The docker hub. I wouldn't know which docker image. I'm using it. It's for reproducibility so someone might update Python 3 because Yeah, so someone might upload push a new image to this So-called version and a name of the image So what will happen is that maybe it will work once? But someone pushes a new version and I run it again and it doesn't work anymore because something changes in between But if you use the hash you only point to the specific version means specific build of this image Does it uh? Thanks for the talk very interesting. What do you think about? Alpine based Python images. Yeah, so Alpine are very small images very nice. They actually get pretty close to what the dishless GCR images offer I Also ran some scans on them. They also contain very little vulnerabilities I wanted to maybe show some also in the slides, but alpine images are also really nice They are not dishless, but they are very small So I guess if you cannot use this for us perhaps use alpine images or smaller images like Python 3 slim Which are just reduced images, but yeah, I I do like alpine images and I use in person II Yeah, you can definitely use them But they're not truly dishless in a sense that you completely strip out everything that's needed because inside alpine images They're still a shell. They're still all these things You need to to have some functioning operating system and they also have vulnerabilities, of course Like the recently that this this thing where shadow where you could become rude inside your local container So, yeah Any more questions? Okay So I wanted to ask so when you mentioned like building it from scratch How different it is to build it from scratch then from this through less image like what? What more does the dishless have so the dishless has a lot of nice things like SSH stuff like certificates Users privilege a lot of very nice features And if you were to have to use them get them yourselves And there's just a whole list of things you would want to copy inside your your from scratch image But it doesn't stop you from making a scratch image and just really carefully looking through what you need So do you need users just just know and just slowly build this list But if you were to use a scratch image then you can still use a multi-stage build where you build All the things you need copy them inside your scratch image But like I said you need someone to really maintain this. It's a it's a large job Thank you for the talk another one more question if you don't mind so I think we understand like it takes a little bit of Additional effort to slim down your image and also to secure it But in an enterprise environment like what's your like what's your perception? What was your experience? What's the right balance between taking that additional effort and Like would you just take it to step three or what like was that like a step? Yeah step two step three Where would you stop like what's the right balance? So I think the right balance is For a larger organization for sure to have their own managed Docker containers their own base images or distrust images It's whatever you want to call them As they have the resources to do so and for compliancy for all these others things that come with large organizations You you definitely want at least these images to be Well, I'm not the pie in solar thing because that's kind of experimental and also you would use your own scratch images Because you need to know what goes into this image and you need to be very strict on what what is allowed in this image And what's not allowed in this image So the yeah for larger organizations Anontea miss Okay, thank you Do we have any more questions? I'm just curious what kind of use cases you're deploying these machine learning solutions for it. I and G is a bank right so they Yeah, can you tell me a bit about that? So I presume it's hosting some kind of API that is called Yeah, so we have a machine learning platform and in this platform You will be able to approach approach Say like, okay, I want this model. I do this prediction with this model And we could be running tens or 100 models at the same time But the use cases for these models are very extremely For example, we could be looking at some natural language processing or Yeah, they for a bank. They are All over the place. I could go into very specific details But then I can talk to you Maybe we talk after. Yeah, okay. Thanks Any more questions? Okay, then let's give another round of applause for Thomas