 talking to us about building AI with Seth and OpenShift. Thank you, thanks everyone. So, Saturday morning, thanks for, you know, not watching cartoons or Netflix or anything coming out here to see us speak. Today, I'm gonna talk to you about what we've been doing in the AI Center of Excellence at Red Hat. You guys have seen a number of us talk before about our strategies with AI and machine learning. You've seen Vasek and Steven talk about Jupyter Hub and Seth and Spark as well. I'm going to expand on that and introduce a concept of OpenWisp where we have serverless actions. So you'll see a little bit about how we chain all of those technologies together to get a nice machine learning pipeline. First off to do some level setting and ground rules. Everyone is familiar about the concept of machine learning and AI. This is to just show an example of a typical workflow. It starts off with, of course, Daniel mentioned this before, the data. In this case, we have a library of data sets and from there we want to start to develop a model. That's where a lot of the brains and the intelligence comes into analyzing the data and understanding what you wanna do, trial and error. From that developing of the model, you then go into the exercise of tuning the model and using the data that you have to then train it. Once you do that, of course, the big excitement is after you've trained it and developed your model, you get to deploy it and actually see some data coming in. So for today's session, I'll walk through an example of all of those types of steps that would go through in a typical process so you can see how we can use tools in an open shift world in order to achieve that workflow. The first part of this, as we mentioned before, was the open shift framework as we start out. How do we do this in a containerized world? Well, in this case, we're using open shift. Open shift, as you all know by now, is a container platform that uses certified Kubernetes, enterprise Kubernetes, and it also allows you to do more hybrid cloud type of things. You may have some of your infrastructure in Google, AWS, Azure, you may also have it on-prem. Open shift gives you that ability to seamlessly manage all of that in one ecosystem. Once we have our containerized world, we need a place to store the data. For this example, we'll be using Ceph. You can also use other technologies. The reasoning behind Ceph is really because of the growing need to be able to separate your data from where your performance and your compute actually happens. And that was one of those architectures that came from some of the Amazon, maybe a lot of you may be familiar with Amazon's infrastructure. You had a lot of the EMR type of scenarios where you'd have elastic map reduce and a Hadoop ecosystem. And then environment allows you to spin up a Hadoop cluster, process your data really quickly, and then tear down the cluster and not have to worry about maintaining it. Obviously, if you tear down the cluster, you have to make sure the storage is still there. And object stores like S3 and Ceph came about. We use Ceph because it has S3 capabilities and we're able to leverage a lot of the technologies that are already built on top of S3. Of course, it has a RESTful gateway, which is nice to integrate with as well and it's a distributed system. Some of the other software that I'll be using in this demonstration will be Spark. You've heard a lot about Spark in a lot of the AI talks and a lot of the container talks as well. Spark is a great engine for processing data, allows you to do batch and streaming, but it also runs on Kubernetes. And in this case, we are using the ratanalytics.io, work that they've done in order to move Spark into a Kubernetes framework with their Oshinko and ratanalytics Spark engines that we'll be using. Jupyter Hub, we've seen several examples of that. That allows us to have a multi-user console to manage Jupyter notebooks. Users can have many different notebooks. You can also do many different users on there and that allows you to do your data science work. It's designed for data science and research and a great tool to do that. That'll be running on Kubernetes OpenShift as well. And then the last little piece here, some of this may be a little bit new to you guys, but once you actually do the data model work, you want somewhere to deploy it. In this case, we'll be using OpenWisk. OpenWisk is a serverless action. If you've ever used Amazon AWS Lambda, it's a similar concept where it allows you to focus more on the delivery of the real code and not worry about the architecture. So I'll show a quick example of deploying OpenWisk and then taking that model and the code that uses the model and deploying that to OpenWisk as well so that you can have a nice REST API wrapper around execution of that model. So to start off, one of the first things you want to do is make sure you're collecting your data. This is just a little bit of a diagram that shows you a typical workflow that you may have to actually ingest your data. Internally at Red Hat, we have a lot of different systems that send us data. This is just some of the examples we do. We retrieve data from Git. We also have build logs, CI logs, all that information coming into us. But we also have a warehouse that houses IT data, customer data, customer feedback, support tickets. And we leverage all of that, send that data into Seth, and then use the machine learning on top of Seth in order to process it. We have a bunch of different mechanisms for getting that data into Seth. Some of it is through Jenkins. Some of it's through other types of workflow managers, and it just seamlessly moves that data into the object store. I'll show some examples of that as well. In order to start out with Seth, the first thing you'll do is you'll actually configure it from an object storage perspective to be ready to ingest your data. I'm not gonna go through the exercise of installing Seth. I think some of you have seen that before, but I do have a link on the slides. If you want more information about installing it, I'll just step into the part about setting up a user, and then actually after you have your user set up, then just getting your data loaded into the system. So once you have Seth object storage installed, you have to make sure the object gateway is installed. And then once you have that, you wanna set up one of your users with S3 access. And in this case, it's just a quick command that you might run where you would create your user in the Seth S3 environment. And when you do that, what you'll get back is an access key and a secret key. That is just like as if you were working in any other S3 environment, AWS or anything else, and that access key and that secret key is the important part about letting you actually write and read data from the Seth environment. So from there, what I'm going to show really quickly is I actually have, let's do a clear here, and hopefully you guys will be able to see this okay. I'll try it. Yeah, it looks pretty good. All right. So in here, what's that? Oh, whoa, that's weird. Yeah, that seriously cleared it out. Oh, that's weird. Still in all kinds of funky stuff here. All right, can you see it? Okay, that's good. So I'm gonna show really quickly as I have a bucket that I have, let's see. Oh, sorry about that. In the Massachusetts Open Cloud, in that bucket, we just called it Open Data Hub. And since I'm using SAP instead of AWS, what you'll see here is I'm actually using the AWS command line interface. And you would go through the typical steps of setting up and configuring the AWS command line interface. Actually, I'll just tell you to show your step real quick and Steven touched up on this before. You would typically do an AWS configure where you take that secret key and that access key that I showed you in the previous step and you would just populate that information. I'm not gonna actually show my secret keys here, but you'll get the idea. Yeah, I know it has awesome data in it though, trust me. So once I do that and have it all set up, then I do an AWS LS. You'll see here for my training subdirectory, I actually have no data in there. So what I wanna do really quickly is just upload some training data that I have here. And I'm going to just run this command here. What it's going to do is upload a tab separated value table called training data. This is actually some data that we're using for doing sentiment analysis. And you'll see that as I work through the example. However, there's all kinds of different formats of data. I'm just using tab separated values right here. You could use snappy compression with parquet. It can be JSON. It could be CSV files. Anything that is typically supported by a lot of the Hadoop ecosystem. Now that I've uploaded that, you'll see I actually have my training data there. Great, awesome. Yeah, uploaded data. Now what? Well, here's where you start to analyze that data. And a little bit of what Boschek and Steven has shown before, you will then start to use Jupyter Hub. Jupyter Hub is cool. It allows us to integrate with Ceph and actually query the data and use tools like Spark, TensorFlow, Scikit Learn, all of those other frameworks in order to process the data. For this example, I'm gonna stick with Spark and you'll see I'm actually accessing the data in Ceph by using the S3n libraries and jars in order to get access to it. So very simple concept there. And what I'm going to do is show you really quickly what we've done as part of the work for the MOC deployment. So here I'm logged into an OpenShift instance and this wouldn't necessarily be a data science role but it'd be more of the DevOps systems engineer type of role. We've actually created an APB that allows us to go through the service catalog and find Jupyter Hub down here. So what I'm gonna do is I'm gonna create a quick project here and I'm just gonna name it Jupyter Hub and you'll see that project there and what I'm going to do is actually click on this service broker, select my projects and it will start to deploy Jupyter Hub. Now there's a couple of options that we have here. The database memory, Jupyter Hub memory, notebook memory. I'm gonna increase the notebook memory to two gig and then start to create this. Cool. Now it's actually starting to deploy my Jupyter Hub. You'll see it's pending and then you'll start to see some really interesting things happen once the pods start kicking off here in a second. So we have the, let's see, there we go. So the pod will start to initialize. You'll see with Jupyter Hub if you saw Voshik's example earlier today, you'll see that there was a number of different images that you could have. We have a TensorFlow image, we have a Scikit learn image, we have a Spark image and so this is actually building all of those images behind the scenes, preparing Jupyter Hub for you and also creating a Spark operator. That's Spark operator. Once we actually started notebook that has Spark in it, at that point each individual user will have their own Spark cluster that spins up behind the scenes. So as that's running, I'm actually gonna shoot over to a different instance of OpenShift that I have since this is still running and getting started. I wanna show you exactly what it looks like when you start to get into Jupyter Hub, so I'm gonna go back here in a second. Once Jupyter Hub actually starts up, in fact I will, actually I'm not gonna, I'm gonna just continue on my server. Once it starts up it gives you the option to select whichever image. I'm gonna skip that step because I've already selected a Spark image for the sake of time. And in this Spark image, I've created a notebook called Sentiment Analysis Training. I would love to take the credit for all the intelligence on here, but actually Subagit on our team was the data scientist that helped us create this. What this code is doing, I have no idea. All I just know is the end where it actually trains a model. So I'm gonna actually run this really quickly and just kinda step through the code and show you a little bit of magic behind the scenes here. First thing it's doing here is you do have the ability with Jupyter Hub to add a few more libraries that may not be on the image that you're using. In this case, we do have Spark and we do have Scikit Learn installed, but we do not have TensorFlow and Keras and some of the other ones. So what I've done here is I'm going through and installing those on the system and then it goes down a little bit farther and starts to import some of those libraries. Here's where a lot of the magic happens with Cep. You'll see here, I'm actually instantiating a PySpark instance and what it's doing is actually leveraging the Spark that's already on my Jupyter Hub. So if I go here to my Jupyter Hub instance, you will see that there is a Spark cluster for SH Griffey, that's me, that's me. So I've got my Spark cluster and I am submitting my job to that but you'll also see there's some other people, S Hewls, that's Stephen Hewls and then we have a couple of other people that are in the system here. So what it's done here, I've actually used the S3N, sorry, I said S3N but it's actually S3A, the newer one. I've used S3A to actually connect to Cep. In this case, I'm reading a CSV file and it's just doing some basic printing, some validation, hey, here's what the CSV file looks like. Farther down, I'm using a little bit of pandas and then shows you what some of the data looks like. That data is stored in Cep but now I'm reading it and you see some of the contents of it. And then it does all the data science-y stuff and then it shows some graphs. I like graphs so I was like, oh, that looks cool. Don't know what it tells me but it looks cool. And it continues down, we've got some more graphs here. The interesting thing is when we're doing the sentiment analysis training, we have to understand whether the accuracy of this is pretty good, is this a good model that we're working towards or are we having trouble with the model and we need to refine it. In this case, the example that I'm running through right here, we don't have much data in the system so I'm not gonna get too much into it but this will show you some of the cool things we can do with the data. We're showing a little word cloud where we're actually detecting what's being talked about in the training data. But as it goes a little bit farther down, you'll see it starts to build out the models and that happens somewhere around here where it's actually building the model, it's gonna run the model. This is where, going back to that diagram, you would do some tuning and manipulation to make sure that the model is accurate and then at the bottom here, once it's actually done, it's almost there, it's at step 83, it's finished with that. You'll see the accuracy printed out. Accuracy of 69, maybe that's good, maybe that's not, I would say it's probably not good, I'm not a data scientist. But that'll tell you, as you get more data in there, as you train the model, you can do some cool things there. So now I've got a great model and I need to do something with it. What's the next step? Well, in this environment, what we would do is, since we already have Seth, we're actually gonna store the model in Seth as well. And that allows us to get access to the model from any number of places. We can use OpenWisk, we can use any basic Python code, we can use Java code, whatever we want. And it's agnostic at that point. So storing the data in the model in Seth, what we're doing here is pickling the tokenizers and the models and storing that there. And the outcome of this, if I go back to my Amazon CLI, I'm gonna do an LS on the model folder. Oh, man, sorry about that, guys. Let's just shrink this and see if you can see a little bit better. Does that work better? Is that good? Okay, awesome. All right, now my screen's all jacked up. Oh, let's see here. You still see that? Okay, cool. All right, so what I'm gonna do here is I have a, I'll do a list on this and you'll see I've got a number of folders. I have data sets, metrics, models and training. So now I'm gonna look at my models folder and I see that I have some sentiment data there and then I'm gonna go to the sentiment folder. I'm sorry, left out a slash. And now you see, okay, great. I have my model, I have my dimensions, I have my tokenizer. Okay, so what do we do with that? So we go back to here. Now we wanna talk about how do we deploy the model and make it useful. In this use case, I'm using OpenWisk, which is a serverless action. I mentioned that earlier. But again, you can use any number of things. If you wanted to use Argo, if you wanted to use NIFI or any of those technologies, as long as you can get access to the data in SEF, that's great. Sometimes you may wanna even cache the model, maybe put it into some kind of caching layer, then you can do that as well. So the first thing I'm gonna do with the OpenWisk is I'm gonna actually show you a quick and easy way to deploy it. So I'm gonna go back to my OpenShift here and I will create a new project called SHGriffy OpenWisk. And you see that project has been created right there. So I'm gonna go back to the command line. I'm just gonna copy this command really quickly and I'll explain what it's gonna do. And I am just going to use the OpenShift command line interface to actually deploy this. Recently, the OpenWisk project that was done on top of OpenShift has moved into Incubator at Apache. So if you actually search for OpenWisk with OpenShift on Apache, then it'll come up. What I'm doing right here is I'm just taking that master template and I'm deploying OpenShift. So I'm deploying OpenWisk on top of that. And let's see, so, and what you will see here is it instantly started creating a lot of different deployments. You have Nginx, Strimsy, if you're not familiar with Strimsy, that is the Kubernetes Kafka that's also been worked on from a lot of the Redhead folks and it has CouchDB and a number of other things. What you see in the background here just starts to spin up a bunch of different pods. But again, for the sake of time, I won't focus on waiting for this to be done. I have another instance where this is already up and running. Once I have the OpenWisk environment all set up, then you actually use the OpenWisk command line interface to take a look and see exactly how to deploy the system. So what I have here is I have some code that is actually going to consume that model and actually run a sentiment analysis on top of that using the model. Up here you'll see some code where it says analyze sentiment. This is actually where it will load the model and start to feed in the text that I pass it and then I have some additional information here where it's just taking some command line arguments. Nothing fancy, just a quick example to show you. In order to deploy the OpenWisk Python code that I just showed you, you just run a couple of commands here so I'm actually going to create an action and I'll kind of step through exactly what this is doing. In OpenWisk everything is just an action. So I showed you the Python code, very simple code. I don't want to have to worry about spinning up my own engine X, my own web interface, all of the REST API that has to go along with it. So I can take that Python code and you'll see here what I'm doing is I have the name of the action called sentiment as service and then I have main.python and I actually already have a Docker image that has some of the model libraries, it has TensorFlow, it has Keras already installed on that Docker image. So I'm going to do a quick deployment of that. Boom, there you go, it's already deployed. So once I want to actually test it because OpenWisk comes with a REST API, I am going to use postman. Hopefully you guys can see that okay, man that's a really big postman. So I have two different postman commands here and all I'm doing in postman is a post with a to a REST endpoint. In this REST endpoint you see I'm now pointing to the sentiment service and I'm giving it some credentials. These are my OpenWisk credentials and the body of it is just some text that I want to pass in that I want to send them an analysis on. I'll send that along the way and down here I have an activation ID. Now if you saw it, you may not have seen it but really quickly it's starting to build up if I go over here. I'm going to go up to the OpenWisk project. It spins up a pod automatically and there's my sentiment service running in OpenWisk and it's actually doing some analysis right here. It's loading up the models, you see it loaded the model and it's actually using TensorFlow as well to start to process the data. As that's running I can then poll and see well what's going on with the action so I'm going to take that activation ID, it's basically like an execution ID and replace that with just the one that I copied here and I'm going to do a get. When I do a get to OpenWisk to see what's going on with that service it'll, well it says it's not exist until it's actually done here so I'll just keep sending it until it completes. All right we got a 200 there and now what you'll see is the result of that action. There's a little bit of metadata that OpenWisk provides but then at the end of the day you see here my results sentiment equals positive. Now the way that we've used this in Red Hat is we have a sentiment analysis service that sits out there in OpenWisk and as people are submitting their request they're actually calling to our REST API in OpenWisk to get results back. We do a lot more than just this sentiment analysis that I've shown here, we do entity detection as well so that we understand for the given text what are people talking about? Is there a good feeling about Red Hat? Great feeling about OpenShift? What do people think about DevConf? And it gives you a great way to analyze the data. But to bring it all back home now we have an entire process and where we've built the model we've actually uploaded the data, built the model and deployed it. So going back to what we had in the original slide it gives you the pipeline and what we've done here is for the storing of the data and the models we're using Ceph to develop, tune and train the models we're using a combination of Spark and Jupyter Hub and then to deploy the model into the rest of the environment to actually make it usable is OpenWisk. Obviously for this use case there's a number of different users that you might have in the system where your engineers, your data engineers may be more focused on the Ceph side and then your data scientists will be in the Spark and the Jupyter Hub side and then for the DevOps and the data engineers as well they'll be involved in the OpenWisk side so you'll have a number of different people involved in it but that's teamwork, right? So that's all I have. If you want more information we are starting to publish a lot of what we're doing inside the Data Hub to the Open Data Hub project and just to let you know a lot of what I walk through here is part of the Open Data Hub project so Stephen mentioned it before with the MOC and we've seen a lot of it as we've gone through the past couple of days so you'll see more and more information and more code being submitted into that repo and of course if anyone wants to contribute feel free to join us. We also have a lot of information about Spark on Kubernetes and OpenShift you can see that in the RadAnalytics.io page you can also take a look at the now incubating OpenWisk on OpenShift project and contribute in that way. At the end of the day you can always just contact me shgriffy at redhead.com. Any questions? That was awesome. Live demo on the Mass Open Cloud, that's the coolest thing ever. Boris, I did that for you, you, you. So we're running Data Hub on the Mass Open Cloud Sherrard and I know many of you probably don't that the infrastructure we're using right now on Mass Open Cloud isn't necessarily ideal for machine learning, training models, that kind of thing. If you could have whatever you wanted up there within sort of reason, what kind of architecture would you like to see? Is it storage? Is it vector processing? Like where do we need to go now? Yeah, I think that's an interesting one and the reason I'm so excited about the MOC is I actually would like to see the users tell us that. I think we'll certainly be doing some performance evaluation on the workloads that are happening on the MOC. I have a feeling that we're going to explore more about Ceph and the storage on Ceph. There is going to be a knowledge sharing that has to happen where we have to understand the best way to store this data. I mentioned before you could have a tab separated value table or CSV file. But there's other technologies that will allow you to process the data better, more column there formatted data like snappy compression with parquet, things like that. But from the physical hardware perspective, we'll certainly want to move more into the GPU enablement, FPGA type of enablement of the system. So it'd be great if we start to leverage some of those technologies as well. That will allow everyone to do work faster. And I think that's going to be the nice blend of having the right storage for your data, but then also having the right compute horsepower in order to make it happen. Any other questions? My question is mostly around data acquisition. We are currently using, I mean, in your example, we are using training data. So what kind of data normalization or cleansing components can be used in the OpenShift enablement? Good question. The simplest answer to that, Spark is great for ETL. We've used that internally at Red Hat for a number of our products. And I think in this use case, it's great as well. The nice thing being, again, you've separated the storage from the compute. So you can have Spark. You can do your cleansing. You do your manipulation of the data. And then what you would want from that is more of a workflow manager to manage those Spark jobs as it's flowing through the system. Some of the other technologies that we are looking into is the ability to do Hive on Spark. So you can take a Hive job in Kubernetes, but use Spark as the framework to do that and manipulate the data from there. A lot of people are just more comfortable using Hive as opposed to Spark. And there's some other technologies that we're looking at, but it gives you a good baseline to go off of. So you mentioned OpenWISC on OpenShift as kind of an alpha project. Yes. So what's missing there? What major problems am I going to run into if I try to use it for something I depend on? The major issues I've seen, honestly, is how fast is being developed. I'm actually excited the fact that it's moved to Incubator because before then, if I deploy it today versus next week, things may break and there may be some inconsistencies there. So I think Incubator will give us a little bit of a better chance to version things off. It has a lot of support for multiple languages, Java, Python, some of the other ones, Node.js. I think the missing thing may be just putting it through the ringer of a real use case and see how it scales. And then also, we're working through the exercise of hooking up Prometheus to a lot of these as well to make sure we can monitor the system. So I think, really, it seems to be in a final, finally in more of a stable state. This has been the development on that's been going on for a year now. So I think it's starting to level off and you get a little bit more stability there. Okay, thank you, Sharad. Amazing talk. Thank you, thank you. Thank you. Mr. Reminder to everyone, the party is tonight. It is in the Ziscon Lounge. It is at 7 p.m. It is going to be the funnest party ever, full stop. So please do show up. We actually have lots of fun things. I'm not making that up. Chloe is on the fun committee. She could tell you more if you wanna ask her. Anyway, so please do turn up for that and also do not forget the keynote tomorrow morning, which is with Chris Wright and Saran, what is her last name? My mind is jelly. Anyway, it's gonna be really good. So turn up for that as well. Thanks.