 Awesome, I can hear myself, but I you can hear me. That's welcome to 2022 Chat messages don't work, but microphones. So and we lost our moderator. So I will act as the moderator and introduce Us to you introduce the talk. Yeah, good All right. Welcome everyone Welcome to the talk on unlimited data science libraries using one container image and no installation So let's start by introducing myself. I'm the guy on the right HPC system administrator at Ghent University in Belgium. You can find me on github Twitter and my email I was supposed to give this talk together with Guillaume, but he can't make it But we found a very good replacement Yes, so I today jump into my Guillaume suit My name is Marcel Hild. I'm working for the same company as Guillaume for Red Hat and luckily, I've been working on the same topic for the past three years as well Inside the office of the CTO of Red Hat on the topics of AI ops and machine learning So we'll have this is our story arc today the usual suspension curve we start with some context and some background then go into the problem and Wonder what might be the solution to that problem and then if the demo gods are with us We will have a wonderful presentation and a demo how we solve the problem and then some links So spare your photos. I mean you can take photos, but all the links will be on the last slide So the context is as you might have guessed since we're on the AI ML track data science and One might ask what is Red Hat an infrastructure company doing with data science So that was the question that the office of the CTO also figured like three years ago somehow all the Infrastructure all the Linux boxes all the Kubernetes classes out there will face different workloads mainly AI ML workloads and These will become a major part of the workloads that will run on our infrastructure so you need to have a slightly different approach to This kind of workload and in order to make that make sure that it works well in this cloud native context and on top of our products We started on this project called Started out on this project called the open data hub so not just to make sure that the workloads run but also to use these data science tools ourselves and in order to make our products better and our projects better and also to experience what it means in the age of clouds and cloud native tools to work as a data scientist and Hence race how many people of you are considered yourself data scientists There's a few so how many people of yourself consider yourself a like DevOps SRE people Supporting stuff. That's the majority of folks So the typical data scientist works on his laptop because that's a controlled environment You do pip install you do tweak tweak your installation Sometimes you do a zoo do pip install because you don't freaking care whether this is installed as a route But it just should work your libraries, right and the data science They did the cloud native approach should is it's a little bit different. So you go to your browser and you do your stuff so at that point We figured let's create an operator a a distribution a best-of-breed distribution of the most common data science tools and move them into a cloud native context, which is open shift or Kubernetes and Use them and improve the user experience for using these tools, which is different, right? open data hub itself is a Sort of a meta operator it pulls in different other tooling and installs it So if you see Kafka or kube flow or tensor flow or spark there We're not reinventing the wheel and creating operators for spark But we're pulling in these operators from other communities so that it's all very well integrated so you can just start out with installing open data up and boom you have an environment to try out Data science inside your browser on top of your Kubernetes cluster so this is the shameless block that I managed to sneak in as a as a Return for my job in replacement. You can try out this open data hub installation on the operate first cloud Which is my next project where we deploying Tools and tooling and clusters into a community cloud where it's easy to use So with your github handle with your github account you can log into the operate first cloud and you can try out open data up there Just as we speak Obviously, this also flows into redhead products. So open data hub itself was productized Some months ago as redhead open shift data science or short roads you can consume it as a service on cloud or reddit.com Which runs an open shift dedicated which is our Managed open shift environment And you can not just use it and try it out on a on a link that will have at the end of the presentation But you can also integrate with ISVs with which are sometimes not a part of the ODH Distribution so you will not just get the open data up bits and pieces but also integrations with some other vendors Integrated into this cloud environment So much for the cloud native pathway of a data scientist Now the data scientists faces even more problems, which Kenneth will talk to Okay, thanks a lot for covering that I'm very happy I didn't have to do that myself My background is very different. As I mentioned, I'm in HPC system administrator. So HPC is short for high-performance computing Also known as super computing and that's really large-scale infrastructure Dedicated for doing heavy-duty computing with lots of special stuff around it fast network GPUs Large shared file system. So we're talking petabytes of storage base And what's also very typical is that you have multiple typically hundreds or even of thousands of users that have access to this system HPC clusters nowadays are definitely used by way more scientists than Was the case before so the traditional workload is simulations like weather climate Physics simulations things like this, but this is really opening up now to a larger variety of workloads Artificial intelligence machine learning data science. So it's really opening up I'm actually seeing a big influx of additional users Of course, there's a strong focus in high performance computing on the performance of The workloads that are being run and you just want to you want to get the best possible Let's say use of the compute power that you have available not only to make a single simulation faster But just to get more simulations done in the same amount of time Maybe you know this already, but Basically any supercomputer in the world is running Linux Couple of years ago like a decade ago that was a little bit different But everything has basically converged on Linux. That's the only thing that matters and ideally these things are easy to use for Scientists to do their research. That's not always the case today because these systems are still quite complex And yeah, we're going way beyond classic HPC. We're now mostly talking about scientific computer So HPC has been around for a very long time since the 80s The picture you can see on the bottom left doesn't come from a Star Trek movie This was an actual supercomputer and 85 a grade 2 And picture doesn't show it But when this thing was running there was actually smoke coming out of it on top So they like making this thing sexy and looking making it look very good things have evolved a lot since then What's nice to to mention here is that we're seeing a thousand x increase in compute power roughly every Dozen or so years so the the smartphone you have in your pocket now used to be the supercomputer the size of a building 20 years ago, and that's important to realize that it's just moving really really fast We're now getting very close to what's called the Excel scale era where we're able to do 10 to the power of 18 floating point operations per second And it actually it takes longer to pronounce that than actually do that amount of calculations in a single second So it's really insane traditional workloads assimilation modern workloads are way more diverse And we're still for some parts of it still still stuck in the 80s or maybe a bit further Like roughly half of the of the time spent on supercomputers is uncoded written in Fortran So a programming language that was invented in the 60s and has evolved a bit since then but it's really still the same thing And then a bit about the HPC user experience Maybe this is not something to be proud of but how you typically use a supercomputer is the SSH into it You get a terminal window and that's where you do your work and That can be very challenging if you're a bioinformatician and all you get is a black window with letters And you can't use your mouse people are not very happy with that and that's slowly starting to change We're starting to see web portals Through open-on-demand for example to have a bit more modern interface and maybe going a bit closer to what's happening in the cloud community, so Things are definitely influencing each other there Now we're we're at KubeCon. So I have to mention containers Containers are definitely also finding their way into HPC systems Now HPC is typically very slow in adopting new types of technologies, but containers are definitely there And that's mostly happening through the Singularity project, which is now being renamed to AppTainer because of yeah Basically a fork essentially happened and they are rebranding the community version of Singularity into AppTainer That works for a lot of people, but it still remains a challenge. So containers do work on HPC systems We can't really use Docker. That's already one problem. So you have to use in a different runtime like AppTainer And also leveraging all this special hardware that we have the fast network to GPUs Shared file systems. So all of this can be a bit of a challenge when when using containers This is a bit of a personal thing as well But one thing I personally don't like about using containers on HPC systems is that you're in some sense sacrificing performance To get mobility of compute. So what you typically do when you build a container image is you build it once and then you run it everywhere That's the point of using containers That doesn't really fit well on an HPC system because what you really want to do is optimize the software for the system On which it will be run and that that clashes pretty hard with how containers are typically used What also doesn't help is that the hardware resources we have in HPC systems is getting more and more diverse pretty quickly So we're coming from an era where everything was basically Intel and AMD CPUs Then Nvidia GPUs came and that was already adding some complexity Now now we also see ARM CPUs and if you have your container image for x86, you try to run it on ARM You're not gonna get very far And also for GPUs. There's AMD GPUs coming There's Intel GPUs coming and this space is basically exploding becoming way more diverse and that's definitely an issue as well The explosion of workloads and different user profiles bioinformaticians physicists and All types of scientists is definitely an issue as well so they're not very familiar with containers at all and it takes them a while to to find their way around it and Also related to to that last point Somebody will build will need to build the container image that the scientist needs and it's quite possible that they are not able to do that They are not able to figure it out. So somebody else has to do it for them I'll pass the word back to to Michael Yes, so You might think that the container Is the solution to this problem? So you would have a and that's what we know as people that are Breathing containers, it's an immutable Thing you can change it and then if it dies you throw it away and then it's the same again All right, so you would start with a container image which has Python you Add another layer on top of it which has NumPy pandas Scikit-learn whatever you need for your data science tools and then you just add another layer Which has send the Jupiter notebook images, which is like the IDE or interactive editor for the data scientist So that's what he uses in his browser, which we will see in the demo Later on so and you can pinpoint the versions because if you want to repeat an experiment and If you want to repeat a demo the usual mistakes that you have is so nobody added a lock file And you're running into some different versions and you can't repeat it because or something's changed or you can't repeat it because You have different processor architecture or whatnot right so that's the beauty of containers usually you got just can take them you Freeze them and then you go wherever you want and you get the same environment. So seems like a pretty good solution except for The user Doesn't want just one version, but he wants multiple versions So maybe your image recognition stack works with tensorflow version What is it 2.5 or 2.6 because you have some dependencies so suddenly you have two container images Now factor in more libraries One uses tensorflow the other one uses Pytorch Keras, whatever you name it and then you get to this combinatorial explosion, which you suddenly cannot handle anymore Right, so is the container still the right solution for this or do I go back to my laptop and create Not pet environments, but well great pet environments not cattle environments maybe we want to create a Container That has it all one container to rule them all which can get very bloated And you might run into issues with different versions So you you still you might solve the multiple tools problem, but you certainly don't solve the multiple versions problem unless you're Doing some sort of hex Now this is the current state of the art right you're using a container which is a base image Which has Python installed. Okay. I'm fine with Python 3x most of the tools work there So I still type in my pip install because that's what the data scientist knows Fails then maybe exclamation mark pip install and suddenly it works because it's like a shell environment I do my stuff and then out of memory error container restarts boom what happens You start from scratch because the container just reboots and it it's not persisted in there right, so it And not just the non-persistent problem, but also it's just slower because you're installing stuff and sometimes the libraries are not just interpreted, but you don't find a wheel file a A pre-compiled version and you don't have a development Built environment inside your container right so you cannot even install the stuff because it doesn't work So the data science Scientist experience is pretty much like okay. I'm going back to my local machine. It works because I'm not interested in Operating any environments or administrating environments. I'm interested in getting my job done So this leads to the pointy-haired manager management nightmare But luckily we have a solution a solution to it which Kenneth will talk to Yeah, okay, so this issue of having lots of software that you need People with different backgrounds who have different needs in terms of software. This is not new on an HPC system This has been there for decades so this the solution we typically use is we Install whatever software people ask for but we install it in a non-standard location for example slash apps So something outside of the regular Linux file system hierarchy Typically when we do this we end up with hundreds or even thousands of different software installations different versions of TensorFlow Different software packages TensorFlow PyTorch are all next to each other Sometimes even built with different compilers because you may get different performance by just using a different compiler to Install your software stack. All these installations are nicely separated. They each have their own unique directory under apps So it's a bit of a mess and it's not easy to find your way around What's also important is usually when we install software we optimize it for the CPUs that we have in our super computer So we don't just pull We don't do just a young install TensorFlow Let's say because we know we'll get a binary that's a lot slower than it could be if it was properly compiled from source So that's what we try to do we build it from source And we try to ensure to get good performance by using the the CPU features that are available on the system And whenever people want to Use additional software they send us a request and we just install something else in a separate directory It doesn't bother everything else. It doesn't affect anything else so you basically end up with in a Situation where you have your operating system. You have a shared file system slash apps, which is mounted on all the nodes of the super computer all the all the worker nodes And in that shared file system you have a whole bunch of software installations everything next to each other nicely separated Now that's good, but then how do you get your scientists to actually use that easily, right? Are you expecting to find the way around in this in this mess? No, there's an an easy way to expose The software stack to your end users that's through a tool called environment modules So this is an the traditional way of Missing some pictures here. Okay, this is the traditional way of giving users access to all that Software you install centrally on the system. What they're basically doing is they're they're loading or Yeah, they're loading module files that we Create and the module file expresses what has to change in the session environment to start using that software and I'll give an example So they're they're basically playing with a module command They're doing what you'll load what you'll avail to check what's available And that's all they really care about they don't care how the software was installed They just talk to the module command and and that's enough for them. They don't care what's going on in the background Today, there's there's two main implementations of this mechanism the original environment modules tools project Which is most which is implemented and tickle. So that's a pretty old scripting language That was the original implementation It was a bit Unmaintained for several years around 2010 2015 It wasn't getting a lot of love because it was abandoned abandoned essentially by the maintainer But somebody new stepped in and he made it he he kept evolving it made a new logo. So it's now an active project again While the original project was dormant another project emerged elmod Which is basically the same concept which is implemented differently in Lua It has some some features that the original version doesn't have and there's a mix and match going on There are lots of of cross-pollination between these projects as well And the concept of environment modules was actually created in the 90s So this is a screenshot of the original paper and you can see the date here June 1991 so this idea this this Way of giving people access to software that's installed in a weird place has been there for 30 years So this is very briefly how it works And the screenshot of the terminal you can see the example command is not there if we try to run it It just fails We can through the module command we can activate a new set of modules that are available So the module use on some location that we know where stuff is available And then when we check for available modules, we can see there's a version of the example software available We load this module which just changes stuff in the background in the environment And then we're ready to start using this example program And all the user has to care is do module avail module load and tada magic happens and the software is suddenly working On the right you see what's actually in the module file that you're that you're loading This is a written in Lua, so it's written in a shell-economic way It's just expressing what has to change in the environment for the software to become available and now Back to my co-speaker to explain you why this is relevant so what Guillaume did pull out his Highlander sword to the rescue and fought the battle of Getting these modules into a single container image without bloating the image, right? so we could go away with Just stuffing all this environment, which is which Kenneth just showed and stuffing it into the image But that's not fun because for obvious reasons so instead of doing this we have a read only Only many volume in your cluster configured, which can be mounted from many ports many Containers being spun up and Then you start your environment as a user and user a starts his notebook image user B starts his notebook image and pulls in the same tooling that Kenneth just showed and now So the gods the demo gods the Highlanders, otherwise we bring them without with us we'll See how this works good, so this is the web page you would be confronted with if you're Logging into your open data up environment or if you go to operate first out clouds to open data up It looks similar and then you click on launch your chup it up environment which I Precooked for us, so you will be in a Web-based IDE basically, so our task as the data scientist is Okay, it loads, so I think we're good identifying dog breeds This is a dog. It's Peter snow But which breed is it so I would start my Notebook Which a colleague provided to me, so he's sharing this notebook with me it's just a file that I can upload into my Jupiter environment and I Said okay, let's run all the cells run all these blocks here One works second doesn't work man. What's that import torch vision? Seems to be that we don't have torch vision installed, so I'm going to this left-hand tap here It's a little bit like plugins for this IDE. So this is the module plug-in and I can see if my Friendly administrator provided me with a torch vision module. So I type in torch. So there seems to be a torch vision great. Let's Load whoops Thank you, Apple. Let's load it and you can also click to Want to on these on these tools and see whether it's the right one so it gives you some some information about the module itself So seems to be loaded loaded feature modules torch vision is there. So let's rerun all the cells again So the kernel needs to be restarted. It's basically restarting the Python interpreter and Oh, we get asterisks All the way down get the classes from torch vision works import warnings boom, it's a Leonberg with a 98 percent Probability of course we trust the AI Good. I think that's that's the demo right? You could start our studio as well. Ah the R studio stuff so And we can go even further so this is this is Let's go back to the launcher here Voila, so this is the environment that I can start with I have a classical notebook with which we just saw we have a Terminal console and we have an a Lyra distribution, but I can also load stuff like our studio which is a Which is an environment for our which is different than a notebook thing and you might have an R background as a data scientist so I just loaded the R studio module and voila our studio is available here and I can Click it like it's hot and our studio Spins up if You're on a wired connection and if your phone is tethered well very well And you're presented with a different work environment called our studio so without having a bloated container with a small container you can Extend your environment pretty easily good Back to the slideshow So we we've showed you that we basically have like an an app store of pre-installed software that we can just Dynamically load into our notebook and start playing. So how did we actually install this big? Set of of software applications for this we used Easy build which is a tool that's really focused on HPC systems So it was Implemented for Linux especially attention to performance and lots of details that are specific to to supercomputers Installing software from source and so on It's an open source tool Implemented in Python. It's been there for 10 years, but it's probably not really well known in the cloud community I guess because it was written for for HPC now easy build not only installs the software It also generates environment modules for you So you when you do an an easy build installation of tensorflow for example, so you run the eb command You're an eb command it starts installing it It will also create the module file for you And then you're ready to go load tensorflow and start playing with tensorflow figuring out if something is a cat or a dog And this is really what was what was being used to build this This rocks volume that is being mounted in the pot. So we have another short demo To give you an idea like what if we want to add additional software Into this volume that we're mounting so we basically then go to our let's say build environment This is the open shift cons web console You have your pots running and there we are on the odh easy build pot. I see my details I can click on terminal and Voila, I'm presented with a terminal which is Oops a little bit smaller like this All right, and here I just have to make sure I started bash login session So my module command is available Like this so in here. I have my module command. I also have easy build all right and with the module command I can check If I have anything tensorflow installed It should be a model for tensorflow available in here. There's actually multiple So that's okay. So this is pretty ready to load if I want to install something else I can ask easy build for example Do you have or do you know how to install? Psychic image Probably gonna come back with a long answer for me It knows about how to install psychic image a whole bunch of different versions so you can pick one Install this in this build environment and then the module appears and it's ready to go in the notebook environment So it's pretty easy to install additional stuff Dynamically and it will just be straight away available in the notebook. You don't have to do anything else on that end and then Think we're running out of time. So we should start wrap up You also have some future work planned on this so with easy build It's pretty easy to build your own software stack whatever you want in there It's building from source. So that's gonna take a bit of time. It's optimizing for the hardware on which you're building So you need to be a bit careful there that you're also starting your Your notebook on the same kind of hardware or things Maybe become problematic But we we think we can do a lot better than that. So rather than giving people a tool to install their own software stack we think we can Make a community project where we're essentially building a central software stack that works anywhere Regardless of the operating system that you're using Regardless of the type of CPUs that you're using or even the specific subset of CPUs Intel has Wells-Karl Lake AMD Rome whatever We know how to do this and this is a new project the European environment for scientific system installations We basically want to give you Something like the rocks volume that you can just mount anywhere on your laptop on an HPC system in the cloud and open shift It doesn't really matter And we give you a bunch of models that you can note and you can start playing straight away So no installation time everything properly optimized for the hardware on which you're gonna run it and so on Now I don't have the time here to really explain this in depth But we have this layered approach where In the file system layer, this is what's responsible for distributing the software and it does this in a very smart way It's only it's basically Netflix for software So as soon as you start loading a module and start actively using the software It will actually be downloaded on the mount in the background for you and as an end user You don't really notice so you don't have to install anything. It's just being pulled in automatically We have this compatibility layer in between That's where we make sure that we can run on basically any Linux Operating system that could be a bare metal Linux or it could be window subset window subsystem for Linux And we're also looking at Mac OS and making sure that it can work there And then on top here in the software layer we can install anything that you want TensorFlow PyTorch the whole world Easy build currently supports two and a half thousand different software packages without versions We can install all of that in there and just ready for you to go And to start playing with it Now this is a very ambitious project And we have some links at the end if you want to learn more about this Okay Yeah, and really quick. So how this translate into the container world world. There are still some challenges and Quest open questions. So do we need such a compatibility layer or do we get away with multiple base images? how do we Address this file system layer problem Read only Execute images distributed to all the clusters is that namespace or how does our back work there? So there are some open questions and there's some some solutions to it. So but it's yeah, not solved yet and This is the Moment where you can Take something home to try some something out for open shift data science the this short URL red Ht slash data science will take you to a free tryout environment of the roads products. So you can Play around with this open data up as a service or go to open data at IO or to the operate first cloud and You want to highlight some of the Highlander of the free easy built Yeah, so the links to easy build to the environment modules tools are there Easy project same thing the last link there is a pretty detailed paper that explains our idea and what we have in mind And the last link is something I didn't mention But the reason we really started the easy build project is some of the scientific software is really messy to install It's not like a pip install tensorflow and then it magically works There's things that we have to build from source and literally spend days just to get it to compile and install And if you want to learn more about the type of problems we run into when doing that you should check this YouTube talk So a recording of a talk I give it at false them a couple of years ago Good. I think that's it. Do we have Still time for questions So the we have any questions. Yeah, the AB guys are already out. So we don't know We have one question. There's also one question. Oh there Your environment is the same container you use for the base Jupiter Yeah, basically spawn the same one and be on top right, but save the installation outside the container Yeah, indeed. So the the build environment we have please repeat the question Oh, yeah, so the question was is the build environment the same as basically the runtime environment where the notebook is running? Yes, so it's the same operating system It's the same container essentially and then we're using that same container to also run the software And we have to like we cannot we cannot switch to a different container because then stuff may not work anymore And that's where this idea of the easy project also comes in to give you that flexibility to just jump to a different OS It doesn't really matter anymore. What your build environment was we have ways of abstracting and away So the software can run anywhere. So that's like the next big step and in this whole idea And we had one question over over here And I think then we're also out of time and we Can take more questions here. Sorry First of all, thank you very nice talk very interesting I wanted to ask you how do you keep track of which version of the dependency people are using? In that case you were importing tensorflow and there was just one just one version but I can imagine that you have multiple version and Very often the problem that we have is that people don't keep track of which version they are running Yes, very good question So the question was how do I keep track on which version I am using in my in my notebook and I get it So like okay, it doesn't work So I just install the latest version on from that drop-down, right? So and then I'm running into the version hell and not into the dependency hell. So you can export Combine the modules that you used into a How does scheme call it a container or a like a collect? I think it's a collection and then you can also copy and paste a little snippet of metadata Into your notebook. So next time you execute all the cells It will talk to the module environment and load all the versions think of it as a pip file lock for For this easy build module environment So you can embed that into your notebook so that you can actually if you're going to the same environment You can click all cells execute and voila it should work And in another aspect there is when you're loading a tensorflow module the way we install it It has all the dependencies hard coded in the module file So it's always going to load the same version of Python the same version of whatever other dependencies you have So if you're loading the same module, you're getting the same software good Now we can we we have time so and Are you also planning to support spec? Planning to support spec and well, you could do exactly the same with spec So what we're using easy build now because well Gio was familiar with this already from his previous life pre red hat But you can do the same thing with any other tool and you can combine actually it doesn't really matter You can install modules with easy build install mode with the spec use pip There's even a project that uses modules as a front end to containers You could throw all of that in there and just combine stuff Now you probably have to be a bit careful what you combine and there may be some issues there But yeah, you can have multiple collections installed with different tools that that's absolutely fine The same idea in the easy project We're focusing on easy build because we know that really well, but you could the software layer You can use any tool you can use a bash script if you want to it doesn't matter All that matters is that the installations you get in the end have to be nicely separated You need a module file to go along with it, but how that happens is irrelevant Okay We have another question Yeah, now that you say it with the versions in the Jupiter is are there plans to connect it to virtual and that sounds like a cool solution not combining it That will make it specific so the question was whether to combine it with virtual amp, which is another environment solution and Dependency solution management in the Python universe. Yeah Yeah, so that that would be specific to Python and maybe it wasn't clear in the demo But when loading the torch vision module Other stuff was being loaded. That's not Python at all like FF impact is in I guess an optional dependency for torch vision and I'm not sure how you would bake that in a virtual The beauty of environment modules that it could be anything could be C C++ software with no Python front-end. Yeah, it doesn't matter But that would make it just so sexy right that I have not a virtual environment where I have to install all my stuff But it would point to easy built and all the goodies are installed with it And I would only have it once on my system or remote instead of 20 different virtual ends. Yeah So it's a more generic solution. It's basically the same idea environment modules, but probably Predates virtual length and as far as I know virtual ends are specific to Python. So, yeah Good, I think I see people already leaving. So people are hungry at that. Thank you for your Attention and being here and we're still here to talk. Right. Thank you guys