 One two Okay, I think it's it's 9 a.m. So I don't know if you should start but we are streaming so Everyone can watch the recording after that So welcome the big crowd the audience. Thank you very much for getting up so early on Saturday And also there is I think Dan Walsh is having a presentation at the same time. So I understand that everyone is there So my name is Vashek Pavlin I work for redhead for six years and I would like to tell you something about what we are doing in a CoE with Jupiter hub on OpenShift and when I saw my talk on schedule I called it data exploration with Jupiter hub on OpenShift, and then I realized I'm not going to do any data exploration So I fixed it and I will call the talk enabling data exploration on with Jupiter hub on OpenShift so what I would like to talk about is How we deploy Jupiter hub? What are the what are components like the technicalities of the actual platform and tooling? And how we integrated with other parts of the of the platform So why do we explore data and why do we want to do whatever we do in the AI center of excellence a Smart person Clive Humby from UK said that the data is the new oil It is valuable as it is if you collect a lot of data, but it's it needs to be refined same way as the oil needs to be refined into plastics and and gas and and chemicals and whatever because without cleaning and transformation and stuff like that It's just pile of pile of bits, which you can't really make sense of in most cases because it's too much and and And it's it needs to be clean and stuff like that the gentlemen also came up with the Tesco Club car Which that might not sound as a lot, but it is a great source of information about customers and what they buy and how they work So he probably knows his stuff For being able to work with Jupiter hub on OpenShift there are some prerequisites So let me quickly go through that and that's one of them is OpenShift you need to have OpenShift running Which is kind of obvious if you want to deploy there What is OpenShift who doesn't know what is who never heard about OpenShift or doesn't know anything about OpenShift good But quickly for the for the audience of big audience on the on the YouTube It's an enterprise distribution of Kubernetes. It is built on top of Kubernetes. So there are all the It's a scalable container orchestrator So if you have anything to do with containers and you want to run them in production You want to use something like Kubernetes or OpenShift? and then it has all the basic all the basic concepts as Kubernetes so Things like pod services deployments persistent volumes, but it adds more it adds the Development workflow with builds and image streams and things like that. You can go to okd.io, which is a new new Place where to find information about the upstream version of OpenShift, which is which was called OpenShift origin in the past The second thing that we we need for the work that we are doing with Jupyter Hub is some object storage You are probably familiar with AWS S3 We use Ceph and then Ceph implements the S3 API so you can use the same libraries like Boto3 in Python Or the Hadoop S3 library in Spark To access your data in Ceph Sorry with the S3 API Luckily, I don't have to I didn't have to set up either OpenShift Nor nor Ceph Everything that I will use here is deployed on MOC mass open cloud If you saw Stephens or Sherards or whoever was talking yesterday about MOC We are working with them on something called OpenData Hub. So this is part of the OpenData Hub and We are deployed there So what are the tools that we will be using for the data exploration that we are not going to do? First and they're like the core part is Jupyter. You can run Jupyter server Jupyter notebooks On your laptop and you can use that it looks like that basically so you have some It's split into cells and the cells can be either some kind of markdown or It can be code or it can be output of the code and then you as a user type something into your web browser It's a web application that has a back end and You type something in a web browser and the commands the code is sent to the Jupyter kernel the kernels There are plenty of them. You can find them on GitHub and there is Python, Scala, I even saw C-Sharp R is pretty popular and then it sends output back to the Jupyter UI and you see that in Your web UI of Jupyter The good thing is that the actual file is just a JSON So if you want to do something fancy with the content you can take the JSON and parse it and work with that Or you can view it in the in that Jupyter notebook UI and and and actually run code and stuff like that What builds on top of that is something called Jupyter Hub and That basically the change is that Jupyter notebooks themselves are a single user you as a user Run them on your laptop and then you can write the code and you can see them see the stuff But if you want to provide that capability In a distributed way to like a team of people in your company or at school I think it's like for universities that might be super interesting or even high schools Like if you want to start coding in Python, you can provide these notebooks and you can provide the Jupyter Hub to the to the Class and and they can just go in and they Jupyter have it automatically when they log in it will spawn up Jupyter notebook for them and they can work with that and it spawns and manages those notes notebooks So that a user always gets to his own persistent version of the notebooks that he work was working with and The last part of the system would be a pitchy spark. I assume that you are probably all familiar at least slightly with the pitchy spark As the website says it's a unified analytics engine for large-scale data processing So that means that if you want to process some data You want to do some cleaning and you want to do some model training and something you would use spark It provides API's and libraries like spark SQL and machine or ML lip machine learning library, which has implemented plenty of algorithms and it works in a cluster mode so you have master and workers and Workers do the work and master orchestrates them and we will use Jupyter notebook to connect to spark and do the processing in spark so that the notebook or the Jupyter server doesn't have to be dead beefy and work with so much data So I have a quick demo it is It is nothing fancy, but basically just walk through of How that works? How did you better have works? So this is I have open shift. I have Jupyter Hub deployed So I go to the URL that open should generate it for me and I will sign in with my open shift credentials that is quite a quite important because If I don't I don't want to remember another credentials. I want to use something that already know I'll get back to that Later how that this solve in Jupyter Hub now we select from the list of images. I Want to use spark so I will select this spark image And these are basically just what do you would if you if you install? Jupyter notebook server on your local machine or your laptop These images basically represent your laptop and the install Dependency so if you want to use spark and pie spark you would need some configuration and pie spark installed and Java installed On your laptop and the same way it works with this With this notebooks that the image contains all the dependencies that are needed So I have these two notebooks One I called Boto because I use Boto free library, which is a library that implements S3 API. So I Have my credentials in environment variables. I'll show you how I got them there and I will I connect to some endpoint and So I can run this right? So it installs the dependency if it's missing And I'll connect and then I can list buckets if you sort of Open data representation yesterday. You saw Stephen to create the his bucket here. So we are on the same Endpoint so I see his bucket and I created mine here. So you could see my data there And then I have this other one which I actually just downloaded from internet I looked for pie spark Jupiter notebook and I got to this repository Which someone created I don't know the gentleman And he has coupled so I just took the last one. I think because I thought that that is going to be the coolest one Probably and I had to fix some stuff because he was writing it for Python 2 and we are running Python 3 But it was mostly just a syntax fixing And what it does is that it's again connects to a connects to S3 connects to spark see I have this Spark cluster URL in my environment and then it Downloads some data from them from the object storage, which I pre-uploaded there and then it does some decision tree building and this is your tree tree decision tree training and Let me run that run all below and It validates whether the decision tree was a good one it uses data data set from KDD Cup, which was some Network intrusion detection Competition so build a classifier for network intrusion detection and So that uses the same data set So it's now connecting to spark and For the spark we can look here into open shift that it's running As part of my namespace as part of the Jupiter have namespace. I have two workers And I have created a route so that we can look Into the We can't look into that because it doesn't have fix that quickly Okay, so let's not fix that So what do you would see here? Let me try different thing. Let me start Firefox. What happened? I'm sorry Yeah, so what we see here is that We have two worker nodes Each each Executor gets 20 gigabytes of RAM and eight cores Well eight courses together so each executor gets four and then that's where we are actually running the the notebook code So it downloaded the data and now it's It's processing the data splitting the CSV Into multiple and then it will be then it will be training the decision tree What do we have in open shift as I as I mentioned we have the Jupiter hub and we have now my own Jupiter server Which Jupiter hub is routing to and we have the spark Spark cluster so I'll Let it run it will take some time the training it takes like I don't know eight minutes So I'll go back to presentation and And then we can we can revisit that so about the architecture of Jupiter hub Basically the entry point where you access to as a user is Jupiter proxy which then routes either to Jupiter hub API or that your server that you started and Then there is something called spawner which takes care of spawning those notebooks or the little servers per user We use cube spawner as the name suggests It is a spawner that is working with Kubernetes and it generates the pod definition and Submitted to open shift. There is also a database which gets it's just a for tracking of users and started notebooks and things like that So that it doesn't and then the proxy routes and things like that. So it doesn't disappear When there is some restart or something So how it works as you saw a user comes to Jupiter hub and Then it's redirected to some authentication. There are multiple implementations of authentication for Jupiter hub You can have github authentication. You can have Kerberos You can have a pre-generated set of users and password. So if you we when we were doing some demos We just generated 20 users and gave gave the attendees for the workshop those users and passwords Then when you when the server is spawned so the user requests the server to be spawned Jupiter hub generates Generates the artifacts for open shift and if there is if it finds that I want to start a spark notebook It also generates like a config map for spark operator I will explain explain what spark operator is later, but basically It takes care of the spark clusters so open she starts at my Jupiter server and Notifies the operator about the requested spark cluster and The spark cluster is started and then the user and then the user access is his notebook and connects to spark And when he stops the server, it also kills the cluster We use APB and simple playbook bundle You can learn about that from the documentation of open shift, but basically it's just a it's just a set of Artifacts for open shift about how to deploy each service and how they should work together And you can have that in a catalog in open shift and nicely deployed it by free clicks or something like that So that APB source code can be found in the open data hub I owe and we have we have it built in quiet Under the organization open data up So we can go there and you can download the image and deploy to your open shift and then try that try that yourself So what is special about our Jupiter hub because this what you could do basically I build my work on top of work of Graham dumplings and who has I have link at the end, but he has Jupiter hub You better up on open shifts or Jupiter have quick starts something like that And I just took that and build something on top of that and what are the differences basically are mainly these four things So image auto discovery single user profiles a formal clusters and publish and share What did me what does it mean? So you saw that select box for the images that is automatically generated from the Notebooks that are built in open shift from the images that are built in open shift It is not very nice right now the user experience is not very good But I'm planning on improving that with some descriptions and then like install dependencies and things like that So it's it's it provides more information to the user But it's it's helpful if you you don't have to know you don't remember you just pick from the from the select Books the single user profiles we quite quickly when we started to use Jupiter have we realized that every Sub team in our team and every image has to have different configuration like if you are working with Some parquet file that you download from object storage and you don't use spark because you just want to process it directly in the notebook you might need more more Memory for the notebook if you are working with spark then you don't need that much memory But you need spark diploid if you are working with some specific object storage Endpoint or bucket you might need to have that in your environment variables. So we build this Library that basically is configured with a config map in open shift right now and you can mix and match the Images that are that are used and the user names and users with like what should happen when the user selects that image and how it should be configured You saw that I had that spark cluster inside inside my namespace and That works it's called spark operator and operators are a concept in communities and open shift that there is a service the Operator which listens on the event and if you find some specific event it will react to it. So here user comes and says please operator very can I get can I get a spark and Says yes, sure you can get a spark and it deploy spark based on the configuration And when the user leaves and says I don't need a spark anymore So it removes the config map or the custom resource. It will delete the spark cluster again So we have that in the in the profiles that we say that if you select the spark image We want to instrument the spark operator about that configuration of the spark image and say please deploy two workers And one master with these resource limits for us Yeah, and the last and the last bit is basically What we hit also quite early the workflow about Sorry the workflow about how do you share your notebooks because if you have jupyter up and you want to I don't know I want to give share art My notebook I have to download it I have to send it over email or push it to get and then he needs to download it and upload it to upload it to his Jupyter notebook to this is jupyter up. Sorry Which is not very nice if I just want to show him a simple change in like line 24. I changed this letter and now it works So I build a plugin for jupyter hub where you click a button Where you click a button You give you give it some name No here you give it some name Yeah And You hit publish and you get a URL which you can access and you get a nb viewer And we viewer is a tool that lets you view the notebooks without being able to execute anything So it's read-only but it renders the notebook in a basically the same way as the jupyter hub and this this URL you can share if it's it's public It's not behind the authentication so you can share it with anyone and then he can just view and you can also download the notebook So that that I think helped us a lot To speed up the speed of the process So how is our training going so we saw that we see that the decision tree classifier Got trained This is the decision tree so there is a lot of if and else statements and Now it's doing something else so I didn't really Dig deep into this notebook. I just wanted to show that basically with our deployment we can directly use Spark and and the ML libraries without having to do Many changes to the notebook that I found randomly on the internet so that the integration is it's really good So Yeah, so I just wanted to Go over a couple ideas that I have about like next steps for the for the jupyter hub that we could do So we right now have this spark operator integrated But I've it seems that the dusk. I don't know if you heard about dusk. It's a Python based Distributed analytics engine or whatever you would call it Provides advanced parallelism for ellenix enabling performance at scale for tools you love so it's basically seems like spark Implemented in Python supporting Python better than spark maybe So We are thinking about like adding that next to the spark operator having a dusk operator Which would then spawn on dusk cluster if users wants that If you if you noticed In my notebook, I have I have these environment variables with With credentials They are not there automatically, but I would like to have them there automatically Populated for users based on some secrets somewhere I Have to add them in the single user profiles config map So I Would like to have that as a as an automated way how to get those credentials from some source of truth and and push them into the Server automatically so that user doesn't have to care about that I would like to work on github and github integration So you can have a button same as the published one I would like to push this to my gith repo or I want I want to create a gith repo for this notebook or something like that I've seen some Attempts on the internet that people were doing that but it never really worked In a user friendly way You saw that select box which was pretty ugly for the images So I'd like to make that more fancy more more user friendly and make users Make it more useful for users And also jupyter hub exposes metrics So if you want to know how many requests how many users and things like that and maybe Build some alert think on top of that like my cluster is getting full because I have too many users using jupyter hub at the same time But we need to Enable I think they are enabled, but we I'm not sure what is exactly in there And we don't have a prometheus watching that so we need to set that up also for the Jupyter hub APB and probably is extend the metrics because as we start using the spark and the connection between Jupyter hub and spark we need to be able to map it together in the metrics and and That's basically It's basically everything I had these are some useful links. So this is the APB. This is the link for the For the singles profiles, which is a quite simple library just for that one use case Here are the open-shift configuration for the Jupyter hub, which is then used in the APB Spark operator a colleague from red analytics IO red analytics team in redhead was working on on the spark operator So that I just used it and it worked perfectly And this is where we came from with the jupyter hub the jupyter hub on open-shift Which grab them put them put them put together and you can go there and you can try jupyter hub without all these Sparks and things like that just on open-shift in a simplest in a simplest form Yeah, so that's that's basically it any questions Yes, sure Yeah, so the question is whether with the spark operator we get we have one shared cluster Or if we have cluster for users Yeah, I didn't mention that so We are basically thinking whether we should deploy one big beef a cluster for spark and then let everyone connect to it But that has its issues and limitations like that you have to reserve that capacity on your open-shift cluster like if you want to really have a 100 users and you would you want to allow them all of them at the same time go to that cluster Then you need to have reserved hundreds of gigabytes of RAM for those for those workers or you can have fmro clusters So when the user comes and logs in and start the server it will start a spark for him And when he goes away, it will kill the spark cluster for for his spark spark cluster So we are doing the fmro clusters right now. So when the user comes he gets his own Fresh clean spark cluster with some resource limits Which are obviously tighter than if it would be one big cluster and you would be the only one there but We need to do some performance testing and And get some more data about like how that actually works And if it's if it's useful for people Start items for what? This bar cluster it's it's quite fast Basically, I can I Can probably show you so I'll I'll call my cluster. I will stop my Jupiter server I Think it's well, it's basically couple couple seconds or maybe maybe couple tens of seconds Why can I close that? Great. So once the Jupiter notebook Server disappears Which should be any seconds now But it has to wait for the timeout because there are no Shutdown scripts in that in that image. It's also one thing that we need to fix So when it goes down the the spark cluster will disappear as well, and then I can start again So in the meantime, we can take probably another question or two if there are any Yes Where's your data kept? I Where's your data kept? I have an HDFS cluster so we have that we have the set cluster backing the backing the open shift and The set the set cluster is basically where we push and pull data from okay So I could just connect my head to cluster. Yeah It doesn't it doesn't really matter like what technology you choose for that So it's it's gone. So I'll just go here. I click start my server. I will pick this park and I'll go back here, and you'll see that my jupyter notebook is starting and basically immediately I got the two workers running and the master node takes a bit of time because it needs to connect to the workers and Figure it out. It depends on like if you are starting it for the first time There are there is some time that needs to know that the images the container images takes to download On the note, but if it's like the second start and the images are the same for all every user So once they are downloaded on the note, it is basically instantaneous Start I don't know what why the master takes so long now But I think it's it's running. It's fine. It's just didn't update the UI. Yeah, so it was basically instantaneous start Yes, the The notebook spot spawner Does it scale up to multiple nodes or do you have to configure that for just on the jupyter hub? Yeah, so this so the spawner the spawner is Not doing anything smart. It just Generates the pod definition and pushes it to open shift and open she schedules the pod so that basically means that as Is the it's up to the open she scheduler to schedule these so it would it would distribute them across the cluster It wouldn't put them on a single note Depending on the size and load on the cluster. I don't know the details of implementation of the open shift scheduler, but Yeah, it is it is it is based on the open shift scheduling So it would be distributed and the same same goes for the For the spark it doesn't have and we could configure it in a way that it would have some affinity. So like Get put my jupyter hub close to the spark, but it doesn't really Bring anything because we are trying to we are trying to pull the data from the safe for s3 or something Not from not not sending it from the notebook server. I Think there was some other question How mature is the spark operator? Yeah, it is it is quite new. I think it's like couple weeks old But honestly, there is not I mean there is missing and I still miss that. I've already filed couple feature requests I'm missing some configuration options like At the beginning there was there was no another way how to set limits Resource limits for the for the workers and for the master and stuff like that So that's there now and I have a couple more feature requests in queue for like I want to be able to force update the images and I want to be able to configure These values and stuff like that, but generally the working of like spawning and killing the cluster It's it's working very well. I haven't I haven't had an issue with that The the guy you're gonna say who works on the spark operator. He actually built a library in Java I think it's he calls it JVM operators, which is like a library that you could use to build another operators so he's trying to get that very stable and then Spark operator would like to be left it from that Okay, I think we are out of time anyway. We have one more minute. Okay, so if there is there's a question No, let's hear it first. Thank you very much and enjoy the rest of the conference obviously