 Thanks, thanks for joining me this afternoon. It's really wonderful to be back in Berneau for DevCon. My name is Will Benton and I work at Red Hat where I lead a team of data scientists and engineers. So that's all you need to know about me. I'd like to know a little bit about you. Who in here considers themselves comfortable with data science or primarily a data scientist or working on machine learning? Okay. Cool. How about people who sort of spend most of their time developing software or as a software engineer? How many people think of themselves as infrastructure experts or SREs? So we have a really great mix of people in the room. The title of this talk is container workflows for data science and machine learning and it's a brief talk. My hope is that it will have something to offer each of you. If you're really comfortable with data science you'll learn about why containers are useful for what you do and some high-level tools that will make your life easier. If you know a lot about containers, you'll learn a lot about how you can integrate data science and machine learning into your environment and really benefit people who work in those spaces and make it easier to collaborate with data scientists and machine learning engineers. So I want to start though by looking at to provide context for people who aren't as comfortable with data science, looking at what data scientists do. Fundamentally, data scientists just use statistics and machine learning to identify patterns in data and solve problems. But let's look at what that looks like with a concrete example. So how many of you are already looking at a web page on your computer or phone? Be honest. Okay. So if you're looking at a web page, are you seeing an ad? Okay. So the company that's showing you those ads is using machine learning somewhere to try and improve the chances that you'll click on those ads so that they can make money and I guess keep showing you ads. Let's say that we're a data scientist and we're working on this problem. We want to decide whether or not to show a particular user an ad for a bicycle. So we'll need to solve a few problems fairly early on, like how to define what succeeding at this problem looks like. Does it mean that we have a model for who's going to click on this ad that's better than another model that's better than random chance? But once we figure out how to evaluate our results, we can continue. And one of the fundamental problems that data scientists face is called feature engineering. This means taking all of the information that you have about a user and encoding it as a point in space, ideally so that similar users correspond to similar points. We call these points in space feature vectors. And really the information that we're encoding could be anything, right? It could be just the content on the page where we're going to show the ad, or we could be surveilling the user across the entire internet and building a profile of them so that we could target the ad to that individual. I hope we're not doing that in this room. But once we've figured out how to encode a user or a page or whatever information we have to target this ad as a point in space, we can continue with the process of trying to understand the data. Ideally, this encoding is going to expose some structure in our data so that when we look at users who are ultimately going to click on an ad, which we have as red points here versus ones who aren't, you can see a clean way to separate them in space. And in this case, we have a group of clicks in a certain area of this space of possibilities. So given a way to map from users to these feature vectors and some historical data about what the users who clicked on an ad look like versus what the users who didn't click on an ad look like, basically a set of examples, we can construct a pipeline that processes the historical user data, generates feature vectors and trains a model, which means using a machine learning algorithm to automatically generate a function that can identify feature vectors that represent users who are likely to click on an ad and then, you know, by extension ones who are not. So given that model, we can apply it to user data and identify which users we should actually show this bicycle ad to. It won't be perfect. It'll identify some users who aren't likely to click on the ad. And of course, not everyone clicks on ads anyway, just by principle, some people don't click on ads. But if the model were perfect, it would mean that we just memorized the data we'd already seen. And we wouldn't have generalized from that. So we want to be able to make predictions about the future, not just the past. I mean, it's great to be able to predict the past accurately, but it doesn't really do us a lot of good. So we want a model that's, you know, we'll trade a model that's perfect for one that generalizes. So we can put all these steps together in a pipeline that winds up looking a lot like a software development pipeline. We start by codifying the problem and identifying the metrics that we care about. We'll move on to collecting data that's relevant, cleaning that data so that we can process it, and then turning it into future vectors. As we go into the process of actually training the model, we may decide that a different feature engineering approach is going to get us better results. So we may go back to earlier steps in this pipeline at any point based on what we discover in another point. Once we've trained and tuned the way that we're actually generating that model, we'll go on and validate it to make sure that it works well with data that it hasn't already seen. And again, this can sort of cause us to reexamine earlier steps. Finally, we'll put this into production as a service that an application depends on, and we'll continually monitor it to make sure that the data that we trained our model on is still representative of data we're seeing in the real world. For example, if we're targeting bicycle ads, something about the bicycle industry might change, causing a person who's likely to not click on an ad at one point to click on an ad in the future. Maybe cycling becomes more popular because of a popular cyclist, or cycling becomes less popular because that popular cyclist is busted for doping. So every part of this workflow can sort of feedback into an earlier part, but the first few phases are sort of special because they are sort of focused on this experimental really understanding the data and identifying patterns. They're sort of before we put things into production, and data scientists often use a special environment for this kind of work. This is an interactive notebook. Maybe some of you have seen these before. Basically, the idea is that we have a literate programming environment where we can combine pros, explanations with code, and we can execute this code independently. We can run functions and see what they return, and then we can go back and sort of edit these cells, modify the code and rerun it. So it's really an interactive way to experiment with things and see how they work. It can be a really productive environment for the sort of exploratory work. You can also get plotting in these notebooks, which is really handy if you're doing numeric work or you're evaluating experiments because you can put some results and visualize them pretty easily. We'll talk more about notebooks a little bit later. Okay. So now let's talk about containers. We talked about data science, and now we're talking about containers. Who here uses Linux containers in their work? Keep your hand up if you're willing to give me a single sentence definition of a Linux container. Just a couple of people. So I want to get everyone on the same page and just make sure we're all talking about the same stuff. So let's quickly cover what Linux containers do and how they do it. If you run a regular Linux process, like installing a library, the Linux kernel is going to maintain some data structures with information about that process, like the binary that you're running, the command line arguments, the environment, and so on. Each process is going to have a set of identifiers for files it has open. Since different processes can open and access different files, this is going to be special to your process. My file number zero is going to be different from your file number zero. Each process also has a private view of a certain portion of physical memory on your computer. Your CPU and your operating system are going to cooperate to give each process a virtual view of this part of memory. So what this means is that if I have a given address in one program, it's not going to point to the same location in memory as that same address in another program. Does that make sense? All right, so this actually turns out to be really important because it means that processes can't interfere with each other's memory because they have no way to refer to it. If you used microcomputers in the 1980s or if you've been to a computer museum, then you're aware of a world where misbehaving programs could crash your whole computer, right? Because every process could access all the physical memory and overwrite memory that another process depended on. But once virtual memory became ubiquitous, personal computing became a lot safer because of improved isolation between processes. If you have no way to refer to the physical memory that another process is using, you can't read or interfere with that other process's state. So containers sort of take this idea of having your own private view of the inside of the process and extend it to things outside of the process. Some of these kernel resources that we can access are namespaced, which means that we can restrict what processes see. The root file system is what your process sees as the slash directory, right? In a conventional process, that's also the root of the whole system, meaning that if you access the root, you're accessing everything that anyone else on the system can see. The process table controls how your view of other things that are running on the machine. And again, in the conventional process, you can see everything else. Finally, network routes control how you can interact with the outside world. Can you connect to anything? Can you use any network interface? Can you accept connections? And so on. And in the conventional process, you get access to everything. When we run a process in a Linux container, we're essentially saying that we want to do something interesting with those namespaces. We want to restrict how the process can interact with the outside world. Maybe you only want to access things within a certain directory or only see a certain set of processes or access certain network routes. You can also impose limits on processes running in containers saying you can't use more than a certain amount of memory or more than a certain percentage of a CPU. So this is sort of low-level kernel functionality to actually use these things to make applications. We need to have some interface built on top of it. And container platforms provide ways to build container images that package up everything we need to run an application. And so we can put it in a directory that doesn't have anything else in it and run that application. So we start with a base image that provides just some very basic functionality, like a C library and a shell, for example. Then we run some configuration recipes to install their dependencies. And then finally we build the application that we want to deploy in that environment and that goes on top. And all of these things are layers that are immutable and you can refer to each layer by itself, which means you can always replay your application at some point in the future. When we combine these containerized processes into an application, we usually use what's called a microservice architecture, where our components are these lightweight services that only communicate via well-defined interfaces. The cool thing about these microservices is that you can scale them up by running them on the same machine or scale them out by running them on different machines and you don't have to change anything because they only use these interfaces to talk to each other. They don't depend on it running on the same machine or having access to the same files. If one of these turns out to be a bottleneck, you can actually replace it with a bunch of copies of itself running behind a load balancing proxy. So it actually set up the configurations that our applications will use. We'll use what's called a declarative configuration. And this means that instead of telling our container platform how we want to set up an application and what we want to make it do, we're just going to say this is what it's going to look like when it's ready to run. These are how these services connect together. This service needs to be able to talk to that service. That service needs to be talking to another one. And we'll let the container platform do the work. This enables creating an application in one place, like on my laptop, and running it anywhere, like on the data center, on someone else's laptop, in the public cloud, wherever you want. And the cool thing is that you can define connections between these different services that are inside your application, but you can also say I want to publish one of these services to the public internet by exposing a route to it. So if I visit, if I visit a URL, I can say that that's going to go to a certain part of my application. So the last thing that container platforms provide that's really interesting from a data science perspective is continuous integration and continuous deployment. How many people have developed an application and given it to someone and it hasn't worked? So yeah, maybe never. I mean more disciplined than I am. But continuous integration means that basically when I'm working on my application, I commit it to source control and there's an automated build in the controlled environment. If my application builds successfully, then I can create a container image and a deployment specification and automatically put that into production. So it's not a matter of it works for me, but it doesn't work for you. It's a matter of it works for the continuous integration system so we know it will work anywhere. And we have this deployment that we can just install somewhere else. So containers are awesome for applications because they provide these sort of primitives that are really useful, but they're also really useful for data science workflows. And in the rest of the talk, we're going to look at why this is the case. But first I want to talk about what we're not going to do. A lot of material on containers for data science starts with the assumption that the only thing missing from a data scientist's life is a bunch of YAML and Docker files. I think actually most people who work in data science don't want to become infrastructure experts. So we're going to focus on high level tools that make these sorts of workflows easier. And a big problem we're going to look at is the problem of reproducible research. So we talked earlier about how data scientists do a lot of work in notebooks. And notebooks are really great for experimenting, but the notebook itself is not a great artifact for sharing or publishing. I mean imagine if you did software development by sending someone a screenshot of your terminal window as you're editing a program. That's basically what you're getting with a notebook. So what we want is instead of this notebook that I might give to two different people and get two different results, or maybe an error, I want something that I can build once and send to anyone and know that I'm going to get the same results everywhere so that this is actually reproducible. So a really easy way to do this is to use a service called mybinder.org. It's a hosted service. You don't actually need to do anything to use it. You could just give it a URL of a Git repo. It builds a container for you and it starts a deployment on Kubernetes that you can use. And this is an open source so you can use the service that someone is hosting out of the goodness of their heart, or you can run it inside your own company. And just by putting a notebook in a Git repo, you can share it with anyone in the world. Binder is great for that reason, but a source to image build actually provides a framework to make services like binder in a flexible way. So basically the idea is we have a builder image that knows how to turn a Git repository of something, whether that's source code, notebooks, or whatever, into a container image that we can run on Kubernetes. And in this example, I'm going to use a builder image that knows how to take a repository of notebooks and turn them into a notebook server container image. So by using this image, I can package up notebooks and make them reproducible without having to become a packaging expert. So let's see that's in action. Do those builds, and here's what it looks like. That's the source image tool. So I can use this S2I command line tool to build images no matter what Kubernetes distribution I'm using. I'm going to specify a source repository, a builder image, optionally a tag, and then I can actually push that, like I would have pushed that image to Quay. But because I'm using OpenShift here on my laptop, what I'm going to do instead is I'm going to use a source to image build strategy, and that'll save us some steps in a quick demo. So I'm going to start this up, and then I'm going to watch the logs as they run by, because I want to make this notebook reproducible. I'm going to start by downloading the entire universe and putting in a container image, and this is going to run for a while, and then we're going to have a notebook that we can run anywhere. Now the cool thing about source to image is that it's not just a way to do something like this. It's a general framework for making things that work like this. So we'll see ways to be more general than just serving a notebook. The other thing I want to call out about this is that the builder image I've used here was developed by Graham Dumpleton, who is a red hatter. He's an OpenShift expert and a notebook expert, and because I used the image that Graham built, I could benefit from his expertise without bothering him, which is if you're an infrastructure expert, you want people to be able to benefit from your expertise without talking to you. Anyway, so this gets sort of results of publishing a notebook. So we just saw how I could use source to image to create an environment that has similar functionality to a binder service, publish a container, publish a notebook in a container on my laptop and not have to write any code or YAML or Docker files or anything to do it. But source to image is more general than that because we can do anything that involves going from a Git repo to a container or to a service. So a couple of examples I want to call out. The Selden project is an open source project that has an S2I builder that takes serialized Python objects and just turns them into S services. So you can operationalize a model by having an API on top of it. But let's look at a builder that doesn't think a little more involved. Let's say we want to operate on a notebook that trains a model, loading a Git repository containing that notebook, post-processing the notebook to see what libraries it depends on, post-processing the notebooks more to generate a script that trains the model, then running that script in an environment that will have the application ultimately running in, saving the model, and then creating a rest endpoint to serve the model. It's kind of a lot, but I don't want you to get lost during the demo. Everyone following? Okay. So here I'm just going to fire off a build again using the build strategy and this is an S2I builder that you can download from Quay. Basically what we're doing is we're getting a notebook that trains a model. While we run this, we'll look at what the notebook looks like. So I'm going to fire off the logs here and then tab over to examine the notebook. So this is a notebook that you can view on GitHub. Basically the idea is we're going to specify our requirements in a designated variable so that we don't have to do any really exhaustive analysis of the notebook. We just sort of say these are the libraries I'm going to depend on. We have some basic code that trains the model. This is a K-means clustering model. You don't need to know what that is. You just need to know that it takes a two dimensional vector and returns a small integer. And then we have two distinguished functions called predictor and validator. And basically the predictor actually uses the model to make a prediction. The validator verifies that the thing that you've passed into the model is the kind of thing that the model expects. So in this case, we have a two dimensional vector. We want to make sure that it's not a one dimensional vector or a string or something else. So I'm going to use the insomnia rest client to interact with this thing that I've just deployed. I'm going to pass in some arguments, pass in a two dimensional vector and get back a small integer. But wait, there's more. I'm going to pass in a different vector and get back a different small integer. Now, what do you suppose happens if I pass in a three dimensional vector? I get a descriptive error message. So you can use this sort of thing to operationalize models and enforce that people are doing error checking and really sort of go from an exploratory environment to production very easily. So frameworks like source to image may get super easy for developers to publish applications, but they aren't just for applications, they're really for a wide range of things. All of the examples we've seen so far today are focused on developing analytic techniques rather than putting them into production. And to talk about production concerns, let's look at these last two steps of the process and what it looks like to put machine learning into production. We've trained and validated a model on preexisting data, now we need to put that into production and continuously monitor its performance to ensure that it's still working as well as we expect. Predictive models are typically deployed as components of larger applications and we need to manage their life cycles at the same time that we've managed the life cycle of the other components of the application. And these applications are built by cross-functional teams who work on different parts of the problem. We've already spent some time talking about what data scientists do to find and exploit patterns in data and that's an important part of the picture but they depend on the work of data engineers who make streaming and batch data accessible and consumable at scale and of application developers who really tie all of this work together to provide a robust scalable application with a great user experience. Some teams also have machine learning engineers who focus on scaling the machine learning parts of the pipeline themselves and making and model management. So the first community I want to talk about is called RadAnalytics.io which is targeted at application developers and provides tooling and a downstream distribution to make it really easy to develop intelligent applications and containers on OpenShift. One of the cool things that RadAnalytics does is it provides actually an S2I builder that takes a specification for an application and will deploy in the same OpenShift project both the application and a Spark cluster that it can use. So you have a compute cluster that's scoped to your application. When your application goes away the compute cluster goes away too. It's not like you're borrowing cluster time you just use OpenShift and manage all of it in one place. Who here went to Vacek and Stevens talked about the OpenData Hub this morning? Okay, so watch the recording if you didn't raise your hand. This is a cool community project that shows how you can build a platform for data intensive AI workloads in a cloud-native environment with a completely open-source stack. And it sits downstream of some projects like Strimsy which makes stream processing manageable on OpenShift, Ceph which provides scalable object storage and Jupyter Hub which provides an environment for collaborative notebook editing. So you can deploy the OpenData Hub on your own cluster or use it in the public cloud. The final community I want to tell you about is Kubeflow and if you went to Marcel and Pete's workshop yesterday you're already excited about Kubeflow. Kubeflow is targeted at machine learning engineers and provides a wide range of libraries and tools for deploying machine learning workloads on Kubernetes including TensorFlow, Jupyter Hub, Selden, and PyTorch. And this is really a great way to make these reproducible environments. Right? So if you want to deploy a machine learning workload on your laptop and reproduce it in the public cloud or across several clouds it makes it very easy to do these things. So I want to wrap up by just quickly reviewing what we talked about today. We reviewed how Linux containers provide isolation between processes and discussed how container application platforms make it possible to develop applications composed of a bunch of containerized microservices. We learned about a data scientist workflow to provide context for how containers can make this easier. We talked about source to image and some high-level tools built on source to image which provides a flexible way to benefit from the work of packaging experts without having to become one yourself. We saw how source to image is general enough to support many use cases including reproducible research, model serving, and compute cluster deployment. Finally, we learned about three great open source communities that are working in this space to enable intelligent applications, data engineering and machine learning workflows all in containers and all completely open source. Right analytics.io the open data hub and Kubeflow. So I hope you're excited to get started with these projects and learn more about this space. If you scan this QR code with your camera phone if I don't know what your camera phone will do but it may or may not take you to my personal website which has some articles related to these topics and you can find me on either Twitter or GitHub as at will be or email me at willbeatredhead.com I think probably have time for questions offline. Right. There is no talk up next but please don't ask me 30 minutes of questions but I'm happy to take questions. That's I mean that so that's just a flask app with I mean it's it's sort of a running joke in the machine learning community that you're just going to put a flask app around your model but that's that's just what that demo did. You could you could do something more scalable or or more resilient definitely in production but just sort of to show that S2I works for that application and that you can do a little more than people might think you can do with S2I. Thanks Zach. Good question. Yeah absolutely. So I think this is this is a really so the question is how do you monitor a model in production and verify that it's working as you expect and I think this is an interesting problem for a few reasons. Right. One I mean the sort of fundamental reason is in a lot of cases you have no way of knowing the truth. Right. Like you have no way of knowing whether you showed an ad to someone who is likely to click on it because most people don't click on ads. Right. You know so so you have to sort of monitor you probably have to monitor something other than comparing results to the truth but there are a lot of things you can monitor. You can monitor the distribution of the data that you're seeing. You can monitor the distribution of the predictions that you make. Right. And you can say well if these things change over time significantly if these diverge from what I expect you know and there there are metrics for measuring whether distributions diverge then I know that the data that I'm seeing in the real world are no longer similar enough to the data I trained the model on so that I have a problem. Now how can we do this concretely? Well a container application platform like like Kubernetes or OpenShift you can use there's this great support for metrics and monitoring metrics and time series so really it's just a matter of exposing those metrics that you care about like whether it's what predictions you're making from a given model service that you can look at them as a time series or look at the distribution over time or whether it's some way to characterize the data that you're seeing. I don't have a demo of that but that's that's sort of how you would go forward with that with things you already have. Yes. Yeah so that's that's a that's a good question so the question is like with with container images you have reproducible research because you always have an immutable image but what if you're depending on data that's coming in from outside that container image so I mean there are a few solutions to this problem I think there's a really cool open source project called Packarderm which basically provides a versioned data store on top of another distributed file system so that you can say you can refer to something by a hash and know that you're getting a particular version of a particular dataset there's a project called Quilt that's that's a sort of for managing datasets and and versioning datasets. I think in those cases it makes sense to rely on something external and have those as metadata in your container but for small data you can certainly bake it into the container image as well. Right so it's not it's it's in general when you're using randomness so the question is how if you if you start with a random model and you want to sort of optimize the weights in the model so that you get to something you can use so typically when you use randomness in a program you're not actually using a source of true randomness right it's not truly random it's it's a function that has a very long period it's a pseudo random number generator so you start from a given value or set of values and then you can replay a stream of numbers that satisfy certain statistical properties and we we call them random colloquially right so so if you use the same seed you'll get the same results basically yeah and so this is this is actually I mean it's come up because we've had we've had notebooks that we distribute with with output right and if there's some randomness in the notebook we've we've always wanted to putting a seed in because otherwise people are very upset when they get a slightly different answer you know that's then that is in the output that was already in the notebook when they rerun it thank you so much