 Okay everybody. So let's welcome our next presenter. So it's met my scene who is a senior evangelist on Linux and containers And he's going to tell us about containing complexity with reproducible workloads. So Dobrya polodny mosh the queen That's all of my check. Thank you very much Yeah, I have a big long title a senior evangelist blah blah. It really means I talked to people about what they want to do with Linux And then I show people what some interesting things that people are doing with Linux are And we're going to talk about containing that really should be for reproducible workloads, but details. So How many people in this room know what containers are? Good, we're in the container track. I'm in the right room Probably most of what you've heard today on the internet Docker con at kube con at red hot summit at fud con wherever you go you hear microservices dev ops See ICD These are all the things that most people talk about when they talk about container workloads What goes inside a container three tier applications 12 factor of indications 12 steps to? CICD and massive scale with open shift right this is not what I want to talk about today I want to talk about the batch job And I don't mean the batch job from 1960s mainframe computing exactly. I'm talking about single-run compute processes that do a thing on data and then exit They could be massively parallel but they're generally one-shot jobs and then Repeated deliver a single set of output. That's some sort of data result. That's a web page whatever that is And these are quite common sorts of jobs talking about things like financial reporting Quarterly jobs that have to be run to make sure that the books balance to put some sort of audit trail in place Some sort of data analysis and modeling data science is huge right now Lots of data science goes on and we'll talk about how we can use containers to run some of these jobs the same science of bioinformatics genomics scientific computing in general Petrophysics oil and gas we're starting to see a lot of these these same sorts of jobs or anything that does modeling anything that takes data and spits it back out and generates pretty pictures like fluid dynamics some of my Old customers from the US Department of Energy These all have the same sets of base problems that we talk about when we talk about containers for web apps Right. We've got library dependencies. We've got hardware dependencies We've got sprawl within various different kinds of development environments to do these things because primarily These things are written in Python or an R for the most part And so, you know, how many people here write things in Python? How many people have run into place of positions where you know your virtual end works to a certain degree? But then you run into dependency issues within your virtual environment that you set up just to Your dependency issues With these sorts of environments many people resort to virtual machines to contain All of the dependencies and make sure that they get a consistent environment We've got some additional complexity with these sorts of workloads mostly because of the class of computing and the results that they need Environmentally these can be very sensitive to changes in the environment We want to make sure that our results are the same on the same set of data no matter what So minor differences in versions of libraries Differences and versions of interpreters could have massive issues with our reproducibility The reproducibility in itself in and of itself Makes for a strong requirement that these things will run exactly the same and produce the same output Works on my machine for a data scientist could burn At least a day if not a week simply trying to reproduce A result from an experiment they got from a colleague We we have portability Requirements we need to be able to run these jobs in multiple different environments possibly run by different entities We need to be able to take an experiment built and run by Let's say a I don't know Observatory somewhere in the US and then those results need to be shipped to Tokyo in order to be run against a similar set of data to see how the analysis works Somewhere else we want to make sure that those run exactly the same in both environments our users don't necessarily know all of the ins and outs of Building virtual machines and building systems. They'll generally possibly know how to install packages via something like pip or for R for CRAN But they're generally not going to be able to build a virtual machine manage a virtual machine Which puts a lot of that on the developers of right on operations to make those available It can be difficult for them to enter a virtual machine in order to try to do their work And one of the more interesting ones then is also peer review there's a movement a few years ago for a idea called executable papers so that when you publish the paper and you publish your data you also published and a Executable version of your code that way it could be reviewed and peer reviewed The problem was all of these things about environmental changes and whatnot would affect whether or not that was reproducible If we were instead shipping Docker containers with our code and then perhaps a Docker container with a volume for our data This would be much easier to do a peer review that carries along all the necessary requirements Reproducibility is important. There's a recent study done on a drug study that found that around 53 there are 53 published papers on new regiments in drug treatments for cancer They went back through the original data. They went back through the original experiments and found that only six Were actually reproducible from the published materials, right? So we have 53 landmark studies That cannot be reproduced. This is a huge issue in many many different environments for very very different kinds of compute workloads that do some very critical things in order to see how we can use containers to solve some of these problems we're going to use a basic pattern to build our platforms and build our containers and we're going to leverage the Basic idea and the idea and the concept that we've got immutable layers that we know run Over and over within our environment Specifically today. I'm going to be talking about Docker and Docker build But this is true for any of our container environments, right? if we're using Cryo if we're using rocket The image format doesn't matter the runtime engine doesn't matter. We're just using Docker today as the example So first we're going to start with our base. Our base is generally upstream vendor provided Could be something that if you have an environment that needs very specific sorts of hardware enablement Maybe something that your operations team might put together But this is where we're going to start with a very very limited set of Runtime environments just enough to get off the ground From there we'll add what we're calling our platform So our platform this is where we start adding in Most common required Framework elements for say Python This is where you'd add the things that are in most common use for your particular environment And then from there and users can then go ahead and use that container as their downstream For their application specific dependencies and also their code Right fairly common pattern anyone already using this for general web applications No, okay. Well, hopefully we've learned something then So today we're going to talk about specifically some scientific computing from Python And I'm going to build a certain number of containers and we'll walk you through this and so what the Docker files look like so commonly for for a Data scientists right we're going to start with Start with rel because when I have access to it and to I built this for another customer Generally, there's two competing competing is a bad word, but there's two different options out there that are majorly used for linear algebra libraries One called Atlas one called open blast They this would be a complicated Decision tree for a scientist to have to make in order to know how to select a linear algebra system within their code so generally winds up being a place where Different virtual machines would be made available to support the different linear algebra systems They both use numpy. They both use sci-pi which are very common Python mathematics libraries We'll then add a layer for scikit-learn scikit-learn is a machine learning Library within Python is actually it has some Traction in data analysis and some other languages because of the way that they do matrix math and some other things And then we'll go ahead and we'll drop a data analytics workload Into that as an end user. All right, so we've got our vendor provided platform All right, this is where we'll install certain levels of required things. We'll go ahead and we'll build our Internal platforms. We'll talk a little bit about inheritance and how we can Layer these in order to provide different Experiences for things and then it will go ahead and we'll talk about and we'll show you how a end user might use these to Do their daily work as well as do some production Value work as well Okay So most of this is is going to be a demo because it's not a particularly complicated story to explain But hopefully we'll show you some value and we'll show you how this works This is all in a github repo NZ Wolfen is how you find me on github and this is called contained science in my in my repo So if you want any of the code, it's there. I am using Anyone heard of Kaggle? No one good. So Kaggle is a public data science competition environment people publish various different sets of Data that they want to have an analyzed. They'll put competitions together And it's also a place for folks who are doing data science to share ideas to collaborate on various different environments I'm using a red hat Competition that closed in September of last year and that was actually not intentional I went to Kaggle looking for scripts that I could download for Python and for R And it just happened to turn out that the two that I found that were Apache license that I could reuse and remix were Red Hat base. So this wasn't intentional on my part looking for red hat material so there's Two sets of there's a set of data It's just got training data and we're trying to figure out some Area under the curve sorts of analyses on behavior of people based on on some training data and whatnot All right, so let's let's switch over to a terminal and We'll show you what some of these things look like live I can get the terminal over there Do not come to the front. Okay. Sorry. I forgot to Unpause the virtual machine one second. Everyone can see that. Okay. Sure. Bake it bigger bigger. I can do bigger Okay, so this is the basic project. We'll go ahead into the Python directory and take a look at some of our Docker containers So this was our one of our base platform containers so For this particular project, I just went ahead and we're using the rel 7 upstream directly So I don't have a anything fancy here And what we're going to do for this particular user is we're going to go ahead and install these packages This is our NumPy sci-pi in our atlas. That's just distributed as as blasts within the current Red Hat family and we're also going to go ahead and make sure that everything that we need for pip so that way we have an easy way for our Developers to go ahead and add additional libraries to their environment without having to worry about whether or not it's available Or you know installing GCC and these sorts of things We're just going to clean up after ourselves and then we're going to go ahead and Our Docker environment is going to just say that our entry point which is what we're going to run is the Python interpreter And our command here just have something there is going to be able to spit out the help command So if I were to go ahead It just spits out Python help, right? We'll go ahead and we'll talk a little bit about why we do that in a second So our next layer was then to say that we're going to add the sci-kit learn packages So that we can do a little more Deep analysis on the data files that we're going to do later So we're going to go ahead and this inherits everything that we've got in our Atlas container That's we've got pip available Based on our previous container so we can go ahead and just say we're going to install pandas we're going to install sci-kit learn and XGBoost because they wanted to use a matrix math. That's a little bit different than what's available in the machine learning kit so we can again use common things that that are Develop that our data scientists or scientists may be aware of they know how to use pip in a virtual environment But they don't have to learn yum They don't have to learn any of the operating system things that that they would need to if they were working a bear virtual machine environment or something along those lines and Just to make things simple. I've gone ahead and we're going to go ahead and say that the environment is the same Since I'm not changing that an entry point We could just inherit that From from upstream from the Atlas environment, so we now have two layers of available platforms and this is what our Experimenter environment would look like right, so we've got a pair of Basic test scripts we're going to run for Python. We've got those tarred up into a single experiment delivery file and very simply we're going to say from That side kit that we just built We're going to go ahead and we're going to make our work directory slash temp We're going to go ahead and add our tar ball, which we'll go ahead and expand that and make those available And you'll notice this does not have any sort of entry point. We are going to use that straight from Our side kit so what that allows us to do then is to do things Along the lines of and this is all copy and paste. So you excuse me a second So we know that we have Python installed on this particular system right But helps if I hit the right things this laptop this virtual machine does not have numpy on it This is not a suitable environment for For anyone to do development work on However Apologies you are not supposed to watch me have to type all this I was supposed to copy and paste all this stuff But we can see within our atlas particular Container we've got all of our numpy configuration. You can see what's going on here for For what we've built and how we've built it I've already gone ahead and built an oblast Which is the the other alternative that we had for our linear algebra systems So we've got now here instead of where we had Usual of 64 atlas. We're now using the the lib open blouse Installed by source and I'll show you that in a second So we've got options and as far as the data scientists are concerned They don't have to worry about switching back and forth between atlas or open blouse in code. They can just select the right Docker container to start from when they build their experiment and go ahead and use the right libraries So so far this is all fairly standard No big deal with docker containers very small docker containers and we're using things that are freely available within the standard sorts of rpm repos and that's not very interesting is it So We'll take a look at open blouse so open blouse is actually something that that we're going to go ahead and Compile from source from upstream and we're going to go ahead and take a look at how we might manage some of those Docker environment issues on how we would provide multiple versions of this to a to a user So the first thing again, this is just pulled straight from the rel seven base But the first thing I want you to notice are these three args up top People familiar with the the argument idea in a docker file One two, okay, three. So arguments are exactly what they sound like It's a wafer within a docker file to pass things on throughout the rest of the the Docker file Without having to try to screw around with the environment variables and getting those picked up So these are all locally scoped to within the docker file We're gonna go ahead and use those to set up some labels so that way we can inspect the docker file later And we know what version we've actually produced because that's going to be important We're also going to go ahead and use that In since we're going to go ahead and build these actually straight from source The other place we're going to use it is when we clone these from github We're going to go ahead and make sure that we check out The tag that we're looking for and this is all We're installing all the things we would normally need for to build this from source GCC Fortran set up tools all that kind of good stuff get gets installed here And then we're going to go ahead and make sure that that all these things Get pulled from source we get the right versions so that we can tie together The correct version of open blast the convect version of numpy and the rect version of scipy so all these things as they get built together Will be useful because that's always a big issue And again, we have the exact same Python entry point that allows us to get that Open blast just runs that python help file or help command rather As it works So the other thing that the build arg allows us to do back here Is normally that much faster than I expected these are all cached. So it's not going to actually do anything So there was my docker build, right? So normally you would build docker Throw to dash t you can see that there's a variable show up as arguments And that we we grab and we clone the right version of of all these things And and go ahead and and go ahead and build them Now it doesn't get exposed as to what they are on the output. So you have to be careful that you know what you're doing The other thing though that we can do since we we have these Since these are actually From the command line as build arg Here We can go ahead and override those at build time within docker So if we set up our docker file to take these sorts of arguments We can simply pass the different Variables we might need for different checkouts to provide different versions the other thing i'm That we need to be able to do and we need to be aware of is once we do that We now have two Different docker containers With two different versions This is where not just labels But tags and tag maintenance would also be be important So I went ahead and I tagged that one as v109 So if I look at what images I have we have available We've got a latest we've got a v10 v 2 18 and the 2 19 that I just built But you'll also notice That latest is not latest right I built this originally On the 2.18 stream without Giving it a tag Docker latest is always a convenience tag So if we do want to go ahead and make sure that Our latest corresponds to a particular version We need to make sure that we remember how our docker tags work And not to assume that docker latest will always point to latest So someone's going to take a look at these and start to say oh well It should be 2 19. Why is this not working? If we went ahead and look at looked at the latest version It's going to go ahead and tell us that the label is it's no it's it's the 2 18 version So we need to be careful when we actually start using versions and tags and these sorts of nifty tricks That we go ahead and we use the right things So these these tags are the ones that get set based on those build arcs So this also gives us not just the The ability to say okay Our experimenters and our scientists can go ahead and take a look at the tags, but they can also use docker inspect to try to figure out which Actual versions will will match because those will be picked up from the build arcs And should be then correct based on the The docker file we built All right, so we've gone through and we've taken a look at how we might go ahead and use these As inherited platforms some of the issues around labels and How we can use some of the build arcs to Make our work easier to prevent to provide multiple different versions of tools and and things available to our community One of the other things and I've got some of these pre-built is that that's apologies We can also do things From alternate versions of base images right Well six and rel seven are very similar. We've actually got distributions for Numpi sci-pi and blaz, but if we have people who are using older versions of software We can go ahead and make those available on newer versions of hardware and operating systems right, so if we have an existing set of experiments existing set of live of A very different reports that are all running on rel six We don't have to worry about trying to completely migrate and rewrite those We can just go ahead and containerize them run them on a new environment something that may have access to Faster networks different disk more cpu capacity or Just a newer environment and we can go ahead and we can isolate those changes to the to the experimental environment from the hardware layer And that allows us to be able to reproduce older Experiments and older results on newer systems And just to show you that this stuff actually all works as advertised and as necessary We'll go ahead and we'll play around with the Our environment So I I mentioned that we have some actual benchmarks and tests that I put together from an upstream And I'm actually using the r benchmarks. This is just a standard benchmark. Um, I don't I don't remember what I was about to say We're going to use the the r version because the python version Takes about eight and a half gigs in about 20 minutes to actually run the same analysis as r which is one of the reasons that r is gaining popularity for data science over python So this is all just running through saying yep, this is our base container. It all works And this is just kind of a standard R benchmark that that That we use online to do various different kinds of things So what I've got is a set of uh, this is the python and r scripts. We've got Three different sets of data We've got a training set. We've got the actual test set that we want to work We've got a secondary correlation that we're trying to get done So we do have inputs and outputs by that failed because I didn't want to do that um So what we're going to do is we're going to go ahead and run the Docker file and we're going to go ahead and mount these input and Output directories as volumes We are running with se linux turned on which is why we need the little dash z at the end of our Stands us that we make sure that docker goes ahead and sets the right Contacts on those two directories and this will go ahead and just take those two directories Pull them into the running container that we built for the the experiment And go ahead and okay, that's Excellent, this is where it always breaks. This should not break. I know what I did give me one second. Sorry All right, so I apparently forgot to put my r script in my benchmark. All right So we'll get a live demonstration very live demonstration of how you might do this for an experiment So we don't actually have the The benchmark of the r script in there. All right, so let's skip ahead and we'll just do the the python work It's never a real demo unless something goes absolutely wrong. So at least I feel good about that Okay, so we've taken a look at how you might use these as a Administrator right how we might provide environments To scientists to data analysts to various people who are attempting to have runtime environments That are consistent across multiple different kinds of platforms Does anyone actually tried to develop inside of a container? Doesn't work very well does it so you don't you lose uh, how do you actually get your Running code in test it bug fix it rerun it those sorts of things when you're Working with your code now and inside of a container pardon You can mount it as a volume That's exactly how we're going to do this today The what we have here is the same the same environment And because it's that that's actually why we were going to go ahead and use those entry points straight into the python interpreter Uh typically the way folks might go ahead and build a Local environment is just run build a python script run python against the script and have it execute With the way that we set up that entry point We can go ahead and set up an alias or something along those lines to make things easier Like this one. So I've created a scikit-pi alias, which just goes ahead and Runs that docker it and exactly I suggested Creates a Volume out of the local Current working directory mounts an attempt Right temp was our work directory. So anything that we run locally We should be able then to Run within this particular instantiation of the uh of the docker file Right. So this is the one that takes too long, which is not the one I wanted to run But we'll go ahead and run it and I'll kill it as you can see that it's it's going ahead and it's taking the uh the test and training data just taking the uh The correlation data from the individuals and it's going ahead and it's it's figured out the uh various different shapes of the data as now doing various analysis try to fit a A nice curve to it to figure out what sorts of correlations We're trying to do So this is one of the tricks you can use in order to provide this complex environment To a end user So that they can use exactly the same sorts of interpreters and libraries that would find in the runtime environment on their local environment without Actually having to install anything Uh, it was just in a talk anyone saw uh, oh and taylor talk about purple egg He's actually Taking a slightly different tack to this exact same problem of how do I develop Against containers from a local environment for things like python and some other Languages, and it's got some real interesting promise and does a lot of the uh The underlying How do I build a container? How do I run local content inside of a container against a uh A different environment it's still in Big development so he's looking for people to who have use cases who have questions who have Various different things that they might want to contribute so Purple egg on github is the name of the project And this is actually running through and doing some things. All right so that is uh I'm not going to go ahead and let that finish because that's Gonna take way too long so So this is a basic run through of some of the things you might be doing But as you can see there's a lot of manual things and there's obviously things that can go wrong um There are other things that we haven't talked about a lot of these sorts of environments And a lot of these sorts of jobs are Usually run in distributed systems So these are the things that that we're leaving for for you guys MPI works. I've seen MPI in a docker container It uh, it doesn't have any issues as far as runtime But there are some issues about how do you get it actually up and running and Masters and workers talking to each other and how do you get it coordinated? schedulers For these sorts of jobs and how do we get the the parallel processing? This could be something for uh for kube This could be something for something that understands it like Moab or or One of the more typical hpc environments GP GPUs FPGAs this is uh very very Useful in these sorts of calculations And within a docker container it's a lot easier than within a virtual machine VMs introduce the need for PCI pass-through emulation layers and makes GP GPUs sort of Sticky to try to get to use well Kuda toolkits are tied very tightly to hardware and the driver versions But with a docker container Knowing your environment you could actually map some of that complexity Possibly that would be why you would build your own base layer or platform And you could then you'd have to still run a privilege to get the the device mapped in but it's much easier to To use a kuda toolkit this way There's a couple of different implementations that are floating around online that show you how to do these sorts of things And then building right There were some interesting possibly tips that you hadn't seen on on build args But that's really not a scalable way of doing things right manually hacking on docker build files It's not all that Conducive to long-term support efforts That said there's a lot of different ways to go ahead and build containers. I've used it and generally tend to use Ansible for a lot of the things that i'm doing inside of a container It seemed for me. It's easier to do things that way and to have things consistent that I can run as Bare metal virtual machines or container targets. I'll use one set of content that just change the actual Target you're looking at But it's also possible that because we were looking at things like distributed processing looking at things like how do you build use continuous builds for both security and for Experimenter updates something like the open shift platform might be useful here We get cube for free. We get the build service for free So looking at things like origin and the open shift container platform and figuring out how we might provide some of these build services uh Would be Possibly the next step to make this a little more than just a laptop demo to get people thinking about how this might work So with that Thank you very much. I appreciate you the putting me up with Issues with the live demo and uh any questions here What kind of performance impact do we get when we get uh inside of docker? Well, let's go ahead and Let's see if I'll get one of these two things to run on purpose So docker overhead in and of itself is fairly light the Docker really only Comes into play in love itself Why is that? Sorry, uh, it really only comes to play on on startup and management once this once it's up and running Uh, docker doesn't really get in the way of the process So generally we find that that process performance is pretty on par with with bare metal and Now and yeah, and once you start getting into things like off host, right? That's where you can also start getting into some issues like if you do run rel 6 on top of rel 7 What happens there when you when you drop it into uh, you know, you've got brand new Network drivers that the base layer may not know about um If you have things that are latency sensitive when we're talking about communications, that's where it's possibly better to start using things like The the dropping into the host network namespace rather than trying to worry about an overlay network But that's going to be the probably the the single biggest impact that you'll have to take a look at when you do these sorts of workloads from a Possible performance impact Sure. Yeah, so the question was um Not just from a performance latency or a network latency, but also things like Scheduling impacts. So if you are doing things that that are pneumo aware That that's another one where pneumo awareness is something that we would have to start looking at If it is if cpu pinning or something you wanted to look at That might be Possibly outside the range right now. I don't know if we can do that very well with cgroups Other questions Yeah, so the question was how do you deal with container management? Because yes, I was doing the very very bad thing of just running containers And if you take a look, there's probably 40 or 50 or 60 Exited containers sitting on my on my machine right now That that's another one where having Another layer above just docker run is probably necessary for these sorts of things That gets a little bit into the scheduling left to the reader exercise Because yeah the the Because they don't because they exit and they exit cleanly every single time We would have to look at something else. It could be cube. It could be docker swarm It could be composed could be a lot of different things But definitely it's a good catch because yeah, that that's that's demo purposes only So the question is reproducibility is nice, but is tying experiments to particular versions of maths libraries necessary Or does it hurt reproducibility and performance? Is that a fair restate? Sure Sorry, so it was reproducibility and verification with different math libraries So usually the difference between something like atlas openblads La PAC or something along those lines Usually that's a layer higher at the at the library layer that Linear algebra should be linear algebra should be linear algebra So that these two runs here between Say the the openblads experiment container and the scikit container Those should Those those those various different Dot executions Should just be faster because of the implementation and if the math is in of itself Incorrect Then that's a bad library you shouldn't be using and people in the community should already know about that And yes, it might be a bad experiment. Yes, you could you could actually you could pull down a version of openblads from github that has a bug in it and Yes, you could throw off your entire experiment that That however that could happen no matter if you're in a container or not a container if you just happen to grab a linear linear algebra pack that isn't really Generally considered useful by the community yes, so The question was have I thought about or other people thought about putting ipython in the framework Yes, so jupiter themselves actually have published their own Version of various and they use sort of the same base thing right they've they've a base notebook And then they publish different kinds of notebooks with other things in them as you can see They're not small Kaggle the the competition site that I talked about does the same sort of thing They produce these big giant images of Everything you might use in python for any one of their competition and their open source So if you find a new library pr on github that's also like five gig In fact, this isn't really just an issue with these guys The art platform which all i'm doing is pulling stuff down from the open source Apple repository is two gig on its own and then adding more libraries makes that bigger So size is an issue with all of these but yeah, there's already folks in the community The jupiter folks for ipython and the notebook system that actually will run a local Web-based environment the kaggle folks distribute various different for theirs so that you can do the same sort to things in fact, they're the ones that that Keep me into the alias version as opposed to the docker run version of running it as a kind of a local interpreter And there's definitely if you if you google for especially in the data science community data science containers There's a lot of people who are talking about Publishing them and how we do it and those sorts of things any other questions Okay Pardon Um, I don't know Yeah, it's been i'm now recognized by it so i don't think i'm allowed to So yeah, yeah, we'll we'll find out didn't hopefully it won't be anything horrible Uh, cool. Well, great. Thank you very much. I appreciate your time and attention Okay, yeah one two one two