 Hello and welcome to the first big data SIG on OpenShift Commons. I'm Diane Mueller. This is the first big data SIG of 2017, I should say. We had a couple of them last year and their recordings of them are up on our YouTube channel. So we'll post those notes in the chat and we can find the past ones as well. And then a few briefings on big data. Big data has been a very, very hot topic and Apache Spark on OpenShift has been a very big hot topic. So today we are lucky to have with us Michael McCune who is going to talk about building cloud native applications using Apache Spark on OpenShift and all that goodness. And so without any further ado, I'm going to let him get started. You can ask questions in the chat, Bill. The session should take about 20, 30 minutes of his talking, depending on how long and how gross he is. And then we'll have Q&A and an open conversation afterwards about this. So, Mike, without any further ado, go ahead. Thanks, Diane. Yeah, so I'm Michael McCune and I'm going to talk about building cloud native applications, Apache Spark applications on OpenShift today. So a little bit about me just for background here. I've been in software development for a couple of decades now. I've worked just about all ends of the stack from embedded programming up to virtual machine orchestration and whatnot. I joined Red Hat emerging technologies group about three years ago now. And I started working on the OpenStack Sahara project, which is a data processing service for OpenStack. And then I slowly migrated over to working on the Oshinko project, which is a development we're doing about deploying Apache Spark on OpenShift. So here's a forecast for the talk we're going to go through today. I'm going to start by talking about how we build application pipelines and what goes into that. And then I'll move into a case study of an application that we built called OFA Clid. And after we talk about that, I'll get into a demonstration of the application. And then we'll talk about kind of the lessons we learned from there and finally where we might take that application as our next steps. So the inspiration for this talk and kind of a lot of the development we're doing for me touches on some larger themes. And these really fold well into the OpenShift paradigm. And the first one of those is developer empowerment. So giving the developer the ability to deploy their applications into this platform and make them available for others is a really strong theme for me. And kind of along with that and with OpenShift comes this kind of improved collaborative effort as well. As we'll get into the demo and whatnot, you'll see how we're able to push code changes and update directly from our GitHub repositories into code or into deployments. And that really helps out with the way we collaborate with others. And then finally, operational freedom. A lot of times it's difficult as a developer to get changes pushed up into a live deployment. And so in this respect, OpenShift gives the developer a lot of freedom to kind of do those things within their own contained project. So cloud applications. What are we talking about here? A lot of times there are a bunch of different services that you want to integrate. And you need to bring them together in a way that they really kind of work. And so you can see here I've got just kind of like a slew of things that you might want in one of your big data applications. Spark, MySQL, Kafka, all these different kinds of things. And you oftentimes require a lot of deployment flexibility with the way you put these things together. So you need to be able to bring pieces in and take them out. And you might want to, one day you might be using Kafka and the next day you might want to switch that out and change for active MQ. So you need to be able to kind of work with that. And also they can oftentimes be really challenging to debug. So this is something that just comes with working in the cloud. We're used to working with these applications where we can maybe SSH into a server and check out its logs and do all these other things. And as we move to a container-based kind of orchestrated environment, those debugging kind of pathways need to change a little bit. And so before I start engineering an application, before we start writing all the code and everything, what I like to do is plan out what's going to happen. And in some respects, this almost looks like a creative effort, like you're making a movie or something like that. But I like to storyboard out what's going to happen. Identify the moving pieces. And in this case, you can see an example of what we might be talking about here. We might want Node.js, we might want Python, Spark, MongoDB, and finally some HTTP. And we start to storyboard out how the data is going to flow through our application and what's going to happen. And then during this process, what I like to do is visualize not only the success, what's going to happen when your application succeeds, but what will happen if your application fails. And that's really important to the debugging effort and figuring out how will your users and how will your developers kind of interact with that. So getting deeper into the planning part, we start to think about what is insightful analytics and how will we create a data processing application. And a colleague of mine kind of suggested this ingest process publish flow. And I really like that. And so what I try to display here is as you look at your application and we look at the pieces that we just put together, we can kind of see one part of this is the ingestion part, then we're processing it, and finally we're publishing it. And this is really how your applications could be built. And as we're doing that, we need to think about what is the data set that we're operating on? How will that be brought in? What does that look like? And then how will we process that data? What operations will we do on top of that data? And finally, where will the results go? How will we take whatever process we've done and bring them back to an end user? Because it does no good if it never really gets out of the processing stage. So after we've kind of planned the application, now we start to think about how we're going to build it. And what I like to do is start to decompose the application components. So really break down what are the different pieces of the application and how can we start to separate those? And this really helps to feed into the collaborative efforts that will naturally occur. And so as we start to go back and look at kind of the adjust process published model, we start to say, where are the natural break points of our application? These different pieces that we have in place, how can we break them apart? And then also building for modularity. Like I said before, you might want to replace a component at some point. We're using Node.js as the front end now, but at a later date, we might want to change that for a different piece of technology. And so building each of these pieces in a modular way really helps to create a more complete application in the end, something that you can really iterate on as you build it. And then finally, for each one of these pieces, we want to start to think about stateless versus stateful applications. And this is probably something that you've heard about if you've looked into the container world, the idea of microservices and how you build these things. And the idea of a stateless application is something that doesn't necessarily need a large amount of configuration or setup to get it running or doesn't necessarily need to get to a certain state to be useful. And where possible, I like to try and build these applications stateless because they're much easier to restart and redeploy. There's not a complicated setup that needs to occur whenever you're deploying one of these applications. You can't always do it, but it's what I try to strive for. And then really, as you start to think about collaboration and how to put together these multi-component applications, focusing on the communication points between the applications is where you can really start to build strength. So you want to coordinate on the middle points. How do the two pieces talk to each other? In this case, the Python and the Spark application, how do they integrate with each other? And into that, you want to build the idea of network resiliency. And this goes back to the idea of building modularly. So network resiliency means, can one of these pieces drop out and will the application still kind of stay up? Or will the whole thing come down? Because as we can see here, we've got five different components we've identified. And if one of them disappears, we don't want the whole application to go away. We want to be able to recover gracefully from that. I mean, if your Mongo database disappears, it's not expected the application will work the way it should, but it shouldn't all fall down. And then finally, the Kubernetes DNS is very helpful in this respect because it provides the ability to use regular names to access different parts of your application. So if your application is built into multiple components that are each different containers or pods, then using Kubernetes DNS gives us a really repeatable name to refer to those services. And we don't have to worry about the IP addresses associated with any specific application or component. So how do we collaborate then if we're doing this? If we've got all these separate applications and they all need to fit together to form kind of a greater hole, what do we want to think about? So we want to think about the right tools for doing this. And what I mean by that is make sure you're using version control. Make sure you're using a continuous integration testing framework. Make sure that you've got some way to coordinate between the different players involved in the application development. And then going back to the idea of building modularly, you want to make sure that each project can be built in a way where it can be tested separately and also flexible enough that, like I said before, you can drop pieces out and bring them back in by really thinking about each component of an application as being modular. That helps the collaborative effort. And this may seem axiomatic or whatever, but you want to do iterative improvements as well. Don't always try to bite off the biggest chunk possible, but make small iterative changes so that each of the pieces can kind of advance their feature set and still stay in step with each other. So they're not getting too far away from each other by dropping some huge change that might break the modular approach that you're using. And lastly, and this kind of feeds back into the first point, coordinating the actions between the different developers who are involved. And so this goes back to what tools do you have that will help you do that? How do you make sure that everybody is kind of moving in the same direction? And focusing on the middle points between the different applications really helps with that. And this kind of gets you to the point of building these modular applications that can come together. So now we'll get into a case study of OFA Clid, which is an application that a colleague and myself built. And this instrument you're looking at here is an OFA Clid and was the inspiration for the naming of this project. Now, during this case study, I'm going to dive into some code samples. And I just want to kind of lay out ahead of time that my intention with the code samples is not necessarily to show you the most innovative and new code out there. But what I want to show you is how easy it is to kind of get these patterns working in your own applications. So we'll see as we get to those. So what does OFA Clid do? At a very high level, it creates Word2Vec models based on text corpora. And Word2Vec is a natural language toolkit that allows you to take text volumes and then apply vector. It will determine vectors for each word, such that you can use those vectors to then find similar meanings and do kind of interesting mathematics to figure out what kind of contextual information about the text you're looking at. And in the case of OFA Clid, we ingest those text corpora from any kind of HTTP available data source. And you can see from the diagram here at the bottom, you know, we've got a browser that's the user outside of Kubernetes, and they're talking to a NodeJS application inside of Kubernetes. And that NodeJS application talks to a Python REST server, which is communicating with the Mongo database and the Spark back-end. It is bringing the text data in. And then once all that's put together, the user through the browser is able to make similarity queries against the text corpora that they've specified. This is kind of a laundry list of some of the top technologies we used while building this, you know, Apache Spark, Word2Vec, Kubernetes, OpenShift, NodeJS, Flask, Mongo, OpenAPI. These were all technologies that helped us build what we've created. And, you know, big shout out to all the open source communities that are putting together these projects. It certainly made our job a lot easier. So, let's get into kind of the deep dive now and the crunchy bits. At the beginning of the talk, I was talking about focusing on the middle points. And that's where we actually started this process. We talked about creating a front-end and a REST server that could do the work for us. And we used OpenAPI, which was formerly known as Swagger, to create an API definition that allowed us to kind of figure out, okay, if I'm writing the front end and Will is writing the back end, then, you know, we can at least stick to our OpenAPI definition to make sure that we're on the same page with respect to how we're communicating with each other. And so OpenAPI is a schema for creating REST APIs. There is a great amount of tooling out there for it, anything from creating the schemas to displaying them, even to auto-generating code that will create servers and clients based around your schema. And like I said, this was like our central discussion point. This is where we began the design process once we figured out what we wanted to do. And what we're looking at here is a piece of, at the top, we're looking at a piece of the OpenAPI definition. This is from a YAML file. You could also specify it in JSON. And you can see how it just gives us an easy way to define the paths that exist in our REST server and the definitions of the data that goes back and forth. And so we just see like kind of a root path here that's going to return information about the server. It's nothing too complicated, but you get a feel for how easy it is to kind of build up these interactions. And then at the bottom is a piece of Python code from our REST server that actually ingests the swagger definition or the OpenAPI definition and uses that to build the routing at runtime. So, you know, this is a way that we could just work on the OpenAPI definition and the code for the backend could kind of stay the same in terms of bringing that in. So, as we built these applications, there's a lot of kind of configuration data that needs to go back and forth. I talked earlier about stateless versus stateful. And even in a stateless application, you still have to inject some information to get it running the way you'd like it to. And so when we think about what do we need, how do we build these applications, we start to figure out what configuration data does each application need to start. And what you can see from the diagram here is our Python application needed to know how to get to the Mongo database. And our Node application needed to know how will it speak to the REST server and what address and port is that at. And what I'll show here is kind of how we took that information and how we used OpenShift and Kubernetes to inject that information into our applications. So what we're looking at here is a Kubernetes template. And OpenShift uses Kubernetes as its platform layer. And so these templates get passed through into that. And you can see, we're defining the container, in this case, for the Web front-end application, the Node.js application. And you can see we're passing in environment variables that contain these variables. And this gets used in turn by OpenShift to create some really nice kind of application launch pages where you can fill these values in if you need to. And on the back end, and this is looking at the code from the server that uses that information, you can see that these are just, these just appear, the top two lines, these just appear as environment variables that come into our, that come into our application. And so in this way, we can really inject a lot of information if we need to, to inform our application of where it needs to talk and how it might do that. Now, given that I just showed kind of like, you know, accessing a database and whatnot, I wanted to bring up the idea of secrets, which are a primitive that exists in Kubernetes and is extremely useful if you need to pass sensitive information around, because as we just showed, those were environment variables coming into our container. So you don't necessarily want to put like database credentials directly into an environment variable, is that, you know, that could be a security breach. So you want to use these secrets to do that kind of work. And I want to illustrate this because I think it's important if you're thinking about building these applications, you know, you might need to inject, you know, credentials into your application. And this is the, this is the facility that Kubernetes provides for doing that. So it's pretty easy. You can see here, you know, we've got like a volume mount that will create a secret volume within your container. And then we have the container definition that's kind of telling us where we're going to mount it. And so if you look at the bottom there, you see Etsy Mongo secret is where the Mongo secret will be mounted. And then within the container, within the application that runs in there, those secrets then just appear as files. And so you can simply just read the values out of them and pull that into your application directly. And in this case, you know, we just, we pull in the username, pull in the password, and we build up our connection string to the database. So now we'll get into the Spark processing part of it. And at a high level, you know, basically what we do is we have a bunch of URLs. We read the text from those URLs. We split the text into an array of words, and then those words are passed into Word2Vec to create the vectors. And I'm just going to show you kind of how easy it was for us to set this up and how it really didn't take a lot of code to get it going. So on the Spark processing in, we have this kind of work loop that's happening. And this is happening within our REST server and it's running and it's looking for requests that are coming in to its input queue. And so you can see that the top couple lines are just kind of some normal Pi Spark where we're setting up the context for the work we'd like to do. Then we check to make sure we've got a database connection. We're doing a little setup on our output queue to make sure it's ready. And then we start processing. And for each job that comes in, you know, whenever we get something off of the input queue, we pull the URLs out and then we run this train function, which is where we're passing in the URLs in our Spark context. And this is where the training really gets done. And then once that's done, you know, we pull out the items and we kind of take the words and the vectors and we store them for later usage. As we drill down into the train function, what you see is we, you know, we create a Word2Vec context and then we want to take the text, we want to load each one of those URLs so we can break the text apart. And so you can see this RDDs equals reduce line, where we're doing this URL to RDD function. And that happens for each URL that we've been given. And then, you know, the code sample below here is the actual URL to RDD. And so we open the URL, we read the text out of it, and we're doing a little formatting to take out the new lines. And then finally, we're passing the RDD back to the previous function, to the train function. And then we just do this Word2Vec.fit and that's going to take each RDD that at this point contains, you know, each RDD will contain an array of words for each URL that was placed in. And it will do the work at this point. Now on the other end, when the user wants to make a query to find similarities, we have this function here. Now at this point, we're not using Spark to kind of do the synonym finding. But what we've done is we've created an internal mapping of words to vectors. And so as a query comes in, we pull out the word that we're looking for, we make sure there's a model that we can pull against, and then we basically just look through the model and use it like a key value store to pull out the values and we return them to the user once we've found them. All right, so now I'm going to demonstrate how this works. And so I'm going to drop out a presentation mode here. And what I've got is what we're looking at here is an OpenShift console. I'm using OpenShift 3.4, which I believe will get released relatively shortly. And what I've already set up into it is this thing, Oshinko, which is a project that my group has been working on, and it helps us to deploy Spark clusters into our OpenShift projects. And so to start with, in order to deploy OphiCloud, as you saw from the code samples, we need a Mongo database and we need a Spark cluster. And so I'll start by adding a Mongo database. And we're looking at the application catalog for OpenShift here. And so I'll pick a data store. And I'm just going to use the Mongo ephemeral database. For the most part, I'm going to leave these entries alone, these forms alone. The Mongo database service name is fine. I will add a sample or a database for OphiCloud so it knows where to talk. And I'm just going to give it a simple admin password to make my life a little easier here. If you were doing this for real, you probably want to use different credentials. So we'll create that and you can see that with the OpenShift template for Mongo, it actually gives you a random username and password if you didn't supply one. So you could actually get a nice randomized password set up there. So I'll go back to my project here. Okay, that's already running. And what I'll do now is I'll set up the Spark cluster. So if you notice what I did there, I'm using the link from our route to the Oshinko web UI and I go into Oshinko. And this is an example of how we use a potted web UI to create these clusters. So we've got no clusters present in our project right now. And what I'd like to do is deploy one. So I'll give it a name. I'm going to call it OPH. I'll just give it one worker for now because we don't need a too large a set for what we're doing. And if we switch back to our project, what we can see is that we now have some Spark pods that are beginning to deploy. And this, we have a Spark master pod up here and the UI associated with that master pod. And I haven't exposed the route to the UI, but you could do that. And then we also have a worker pod as well. Now, at this point, what I'll need to do is start the Oficlide application. So I go down here. I have uncategorized and you can see Oficlide has appeared here. And as we talked about before, the parameters that I specified in my template get reflected into the OpenShift console here. So you can see, I can tell it where I want to go for Spark. And I'll tell it, I'd like to use OPH 7077. And that's the standard Spark connection string. And then for the Mongo database, I'll also give it a connection string here as well. So say admin, admin at MongoDB. And you can see that I'm using the service names that I've applied to my different pods so that I can access those. So I don't need to know the IP addresses, the Kubernetes DNS will handle this all for me. This web route hostname, if I needed to kind of modify how I reach this from an external route, I could change this to make it look a little different. But I can use the default for now. So I'll create that and go back to the overview. All right, so we can see that our application is now starting to run and it's running. And previously we had, what I had done is I had set up a couple builds to actually make the containers that go into this project. And so if we look in the builds, you can see that I had two pods or two containers that go into our application. And I allowed OpenShift to build these for me. And I set these up ahead of time because they can take a little bit of time, a minute and two minutes. But using these builds is how you can get OpenShift to update your project for you when you make changes external to the project. And we'll get into that after we kind of see how it works here. So we'll go back to the overview. And all right. So you see I've got a route here associated. I'll click on that. And sometimes this happens, you run into a problem where it's not loading. And it looks like that's what's happening for me now. All right. That's because it's a live demo. Yeah, I didn't make the proper obeisances to the demo gods apparently. There we go. All right. So what you saw is it took a couple, it took like a minute for the DNS routing to go through and everything. And some of this could have been because my application had reported ready to Kubernetes, but hadn't quite come up to speed. It may also have been because the edge router hadn't quite put all the pieces together yet to allow me to route through. But what we're looking at is Ofaclide. And we've got two main views here, the trained models. And this is per text corpus that we want to analyze. And then we have queries and we have run any queries yet we haven't made any models. So the first thing I'll do is I'll train a model. And the standard example we use is the works of Jane Austen. And so what we've done is we've gone to Project Gutenberg and they've made available to us these are the works of Jane Austen, all in one nice text file. And so I'll just copy the URL from that and I'll put the URL here. And I'll tell it to train. And we can see we've now got something's happening. We've got something that's training. And if I go back to OpenShift, I can look into the pod that has our application and go to the logs. And we've got our web and our training application. If I look in the training application, what I'll hopefully see is no errors. And I don't. And you can see here, it's actually started to run a task against Spark. And so this is a way I can I can confirm that it's actually doing something. You can see that it's run some task. And we'll see if it's completed. So we go back here. This page updates a little slowly sometimes. So I just give it. Okay, so our model is ready. And now we can write queries against it. So I'll start a query. And I know I know from from prior example that there are many characters in Jane Austen's book. And I'm going to give it the name of one of the characters, Mr. Darcy. And we'll see we'll see what it can tell us about the text. If I run a query, boom, it brings us back this query. We can see that given the word Darcy, what are some other words that are similar to it in vectors? And we have, you know, Wickham, Collins, Bingley, Denny, Willoughby. These are other male characters in the works of Jane Austen. And Word2Vec has created associations based on, you know, the placement of the word and its position in the text that kind of give its similarity to this word. So I'm not extremely versed on her works. But from what I know is that these other characters kind of are male protagonists who sit in a similar position as Mr. Darcy. And so what you can see is that, you know, we could use this, we could use this to kind of just explore different types of data that are available to us online. And I'm going to jump back to, now, the next part to this that, unfortunately, I don't have the code loaded, but if we go to our GitHub project for this, and then what I'm going to do is go into the web model, if I wanted to make a change to this project, what I could do is push a code change into my GitHub repository. And then I would go back into the builds that I have here. And in OpenShift, what you can do is you can configure these builds to take a web hook from something like GitHub or Bitbucket or wherever your code is stored. You can set those up so that when they receive a new commit, that can hit back to your OpenShift project and trigger a new build. Now, I wouldn't, at this point, my cluster is hidden behind, you know, kind of some firewalls and whatnot. It's not publicly accessible. But what would happen is if we made a change to the web application, for say, you know, I could come in here and manually start a new build, and it would, by this configuration, it would look to my upstream GitHub project, it would pull the new information and build a new image for me, and then create that. Now, I didn't, unfortunately, I didn't set up the code change. But what I'll try and do is I'll just try and run this quickly. So let's increase the text size here and see if I can just do a quick update. So I'll go into the directory where of the client web application is what I have here. Okay. And let's make a simple change that will change the color on the background of one of these queries. So I'm just going to see if I can, I'll see if I can pull this off quickly and hopefully it'll work. So go to the queries HTML, look down to each one of these cards. And now we're kind of, now we're really, we're really kind of digging here looking for some, for some background. So, all right, so I've had these, these divs right here represent the cards that are in here. What I'm going to do is I'm going to try and add a, I'll try and add a background blue. And that's something that, that's something that should appear for us. I'll make it right. Just to make sure this is the correct, I'm doing the correct thing here. All right. And then we can see what's our status. All right. Let me make a, all right. Push this up to GitHub. And now we can, we go back and just, you know, we'll check our project to make sure that, all right, so you can see my changes come in here as the commit. And what I'll do is I'll go back in and I will rebuild the, the web container. So let's start to build. Thankfully this, this doesn't take too long to build. And we can watch the logs as it builds. You can see it's, it's just pulled this change that I got from my repository. So you can see change card to blue. And again, the only reason I'm manually doing this is because I don't have the, I don't have, my cluster isn't publicly available. So I can't let the web hooks hit my cluster. But in, in your installations, you know, you could set this up to do it internally or you can have it set up, you know, to pull from outside. And you can see it's, it's building the, it's building the container here. And then shortly what it will start to do is push these layers into the local image registry. And so now if we go back to the overview, what we will see is that OpenShift has detected this change and is now starting to redeploy the application for us. And so this will, you know, this will take a couple of seconds. It'll make sure it'll stop the other one. It'll, it'll switch over the new, no, and it looks like, well, it thought it failed, but it actually brought it up. So let, let's see if our change is actually taken now. So I can go back here. If I go to queries, and I think I'll need to reload this page. There we go. We can see that it's quite ugly. And I've, I've chosen a poor color, but you can see how the change that I made was really quick to, to go through and was actually automatically redeployed by OpenShift for me. And so this really is where the whole collaborative effort and the DevOps kind of things come into it. You know, you can imagine that, okay, this is a pretty bad choice that we made and we picked the wrong color here. But in, in a normal application, you would have some sort of, you know, continuous integration and deployment going where you would be running tests, you would make sure these things go. And then finally, they would get pushed to a repository that would update your OpenShift that would update the image and OpenShift would automatically redeploy this for you. So you can really see the power of how this can change just by pushing changes to your, to your code repositories. And I could have done the same thing with the training module if I needed to alter the way the models were behaving. So the analytics component could be changed out as well. It doesn't just have to be the web component. So let's go back to the, to our slides now. Okay. So all that being said, you know, what did we learn from this? And, you know, one of the things that went really smoothly was the use of OpenAPI. When we use that as our kind of defining middle point, Will and I were able to take that and just kind of run off in our own directions and do what we needed to. And as long as we stayed to the definition, it made it really easy to integrate changes. You know, he would make changes to the training module. I would make changes to the, to the web module. And it was really easy just to grab a new version. Excuse me. And, and integrate it together because we had already agreed on kind of how we would talk to each other. Another thing that really smoothly for us was, you know, using Docker files. You know, it's no secret at this point that Docker and containers are extremely useful to the development effort. And, and by both of us kind of creating Docker files and quick paths to create containers, it made it really easy to prototype this application and work with someone else who wasn't necessarily sitting right next to me. You know, I could grab their latest version, I could build a new container and test with what they had. And then lastly, the Kubernetes templates as well. Much like the Docker files, the Kubernetes templates made this really easy. So if there was a question about how to deploy the application, we could just share a template between us and that template would effectively put all the objects that you needed into OpenShift and give you kind of this rapid uptake. So things that kind of require more coordination. I, you know, I hesitate saying these things went poorly because they didn't, they went well, but the things you have to think about. API construction. So while OpenAPI did help us a lot, it also kind of exposed the fact that as you're building APIs, you really need domain experts to be involved so that you're not kind of going in the wrong directions. Now there were some choices that I made in the API that later on didn't make sense as we learned, you know, as I learned more about how the kind of analytics component worked, we needed to adjust that API. And so that, you know, that cost us some time later on because we had to adjust the way the API worked. Compute resources. Now for this project, the compute limitations were pretty low, you know, we're just doing kind of some simple word calculations. But you can imagine with a larger operation or a larger application, you would be much more concerned about how you're using your compute resources and how they were being used efficiently. And I've got another slide here. We'll talk about that. Persistent storage is another issue that comes up, especially when you're working in a containerized environment, you know, how do we, where do we store our databases or our files? You know, how do we make sure that they're backed up properly? And this is something that is going to be different for every application you create. And you'll just have to kind of assess, you know, what works for you and, you know, whether you use an object store or some sort of network attached file storage, you know, these are concerns that definitely require more thinking as you build your applications. And then finally, kind of spark configurations, what do you need for your spark applications? Because as you deploy spark applications to a cluster, oftentimes there are, you know, third party packages you might need, or, you know, just extra bits of configuration need to go to the spark server. And these are things that you'll have to kind of coordinate on and figure out how will you deploy your spark images such that they'll have what you need. Now, on the compute resources, what I'm really getting at here is that as Kubernetes schedules pods for you, what it's going to do is it's going to look out across its cluster and the nodes of its cluster, and it's going to try to find places where the pods fit. And it will use its schedule or to try to make them, you know, be put in the right places. And for spark, because it uses the JVM, if you don't specify kind of the memory constraints and the CPU constraints, then one of these pods could just try to hog up all the memory on a server. You know, for example, I've seen situations where I, you know, I would deploy a spark pod, and then I would look to see what it's doing. And it's like, why is this thing trying to grab like 10 cores and, you know, 256 gigs of RAM? It's like, wow, because it doesn't know there's a limitation that it should be, it should be following. And that's really more of a Kubernetes kind of scheduler thing. You want to make sure that you you kind of mark up your pods to let it know, you know, what the constraints on those things would be. Another way to do this too is to use label selectors. So like labels can be used to make sure the pods get scheduled to specific nodes. So if you have a node that has some, you know, special features, or you set it up to do all your work, you know, maybe it has really fast drives or it has some really big, you know, hard drives or a lot of memory or something, then you could use labels to help tell Kubernetes where you want each one of your pods to land. And these are just, you know, these are things you'll have to take into consideration as you build your applications. So where, you know, where can we take this project next? We could certainly add more spark to the mix. You know, you saw that the queries weren't necessarily using spark to pull information out. They were just looking into a hash table, basically. You know, we could add spark there to help, you know, to help do the queries in kind of a different manner. And adding into that, you know, we could separate the query part into a different service. You know, our front end could use one REST service for doing training and another REST service for doing querying. And this, you know, this might fit into a more microservice approach to the way you do these things. Because if you notice that you were getting, like, way more queries than you were training requests, it might make sense to split the query service out so that you could put it behind a load balancer and, you know, scale it in a different manner than you would the training server. And then finally, this is really a development project. And to take it to the next level, we need to put some production kind of controls in here. So we would need to, we would need to make sure that there was CI CD happening before it got deployed. You know, we would want to kind of make sure that everything was in place before we were having our development branches pushed right up to the OpenShift server and kind of redeployed for us. And there are a lot of different ways you could use that. You know, like I mentioned, you want to use a continuous integration to make sure that your testing is happening on a regular basis. And you might use development projects and production projects with an OpenShift. So you knew that all the development work was happening in one project and production work was happening in a different project. And OpenShift kind of gives you the ability to segregate these two projects into different namespaces so that the, you know, the two won't mix and match with each other. Finally, these are some, these are some project links to the different different projects that we use. There are some of the top ones. You know, the OphiClide application, Apache Spark, Kubernetes, OpenShift, the Oshinko project. There are several pieces of the Oshinko project within Red Analytics, within our Red Analytics organization on GitHub. And then finally, we've just put up a developer portal that you can explore and gives a little information about what we're doing with the Red Analytics effort. So thanks for your time. And I'll look at the questions now. And what I'll do is I'll actually switch over to the Red Analytics site. You guys can just see what we're what we're working on here. So this has been really great. There actually haven't been too many questions. So I'm just going to ask out loud if people have questions, type them in the chat or raise your hand in the chat and I'll open up the mics for you. This, it strikes me is that we need even more sample applications and that this might be a good time to maybe do a hackathon on Apache Spark and OpenShift. And we did a hack for health saying a while ago, but some sort of thing that forced people to write more sample applications. This has been a great example. Yeah, that's a great point, Diane. And one of the things that the Red Analytics team is working towards is creating in our repository for this Red Analytics website, which is up on GitHub. We're going to be adding information about how more applications can be added. And the hope is that we'll be able to create a pattern whereby you could look at OphiClide and a few of the other apps that we have in the works right now and use those as templates for how we can, you know, we can bring more applications into the fold and create more applications because I think we would all like to make this collaborative and get more people involved in what's happening. And I'm always asking people, you know, to do interesting things with some data sets that, you know, that I worked on in the past from financial data sets and things like that. So I actually think there's something to it. If you get a couple more examples up there that we ought to do some sort of hackathon hosted maybe by commons in the not too distance future, because I think that would really help get more eyeballs on Apache Spark and using it on OpenShift because you've made it look ridiculously easy. I have to and I know it's not because I've been trying. No, no, no, no, it's easy. It's easy. The works of Jane Austin and not complex XML data source, which is what I've been trying to hack on. And it's been, you know, typically I pick something that's tougher than it should be. But yeah, so I think this is a great example. You could do lots of interesting analysis just with the word to work here. That might be of interest to raise up visibility of this work. There is one question that just came up. Oh, you're going to give which day will are you giving your workshop? It's on Saturday. All right, I will be on. I will be the 28th, I think. Yeah, so 28th right about one o'clock. If you're in Bruno, Czech Republic, come come see us. It just happened to be in Bruno and the Czech Republic. If you just you have to be hang around Moravia. So I see a Christian's questionnaire. Yes, we so we did we consider alternatives to Mongo database? We did, but you know, really for this application because we were trying to rapidly develop it. And I think it took us about a month to put this whole thing together. We just decided to use Mongo because it was real easy to do to store the whole data structures in there. And we already had a Mongo template available to us in OpenShift. I know recently there has been more work going on bringing JBoss Data Grid into OpenShift. And I think what you're going to see is probably within the next month or two, we'll have another example application coming up that's like a movie recommender type service. And that uses JBoss Data Grid on the back end for its storage platform for its storage surface. Awesome. Maybe we can, when you get that done, we can have some of the JBoss Data Grid folks come on and talk a little bit deeper about what that is too and do a combo talk. That would be really nice. Because I don't think I've done a JBoss one in a while. Yeah, we have some really cool integrations that are happening now between the spark work that we're doing and the JBoss middleware stuff. So Data Grid and Data Grid, we're trying to look into, especially Data Grid right now, we're trying to look into how that can be integrated. And I think Zach Hassan has been doing just some excellent kind of prototyping in that area. And he's got it working to where he's got a spark application that kind of exists in the same realm as OFA Clad and is using Data Grid for all its storage. And it's really cool. Cool. Well, this has been very enlightening. Any other questions out there? Because we're definitely going to get you back and get a few more of the big data folks from Red Hat. And if folks out there who are listening to this, either the live one or the recorded one, if they have an example of using Apache Spark or other big data tools on OpenShift and want to give a presentation or show it off or get feedback on it, please reach out to me at OpenShift Commons on Twitter or via the OpenShift Commons mailing list. And I will happily give you a podium to pontificate on as long as it's educational and of interest to the big data sake. The other thing that Mike mentioned, DevConf is coming up in Bruno. It's a big open source conference mostly around Red Hat technologies and technologies that up the upstream into things. So that's coming up in the end of January-ish. In just about two weeks. In about two weeks. So I'll be there for that. And then in Berlin on March 28th, the day before coup con, we are hosting another OpenShift Commons gathering. And the big data sake will be having face-to-face meetings there as well. So if you're coming to coup con, which is co-located in the same at the Berlin Congress Center on the 29th and 30th, we're going to be there the day before in the same facilities and the same hotels and everything. So just add a day on and come talk to us and bring your questions and answers and solutions there as well. So that would be great if you could join us for that. And as always, there's the OpenShift Commons mailing lists. And if you're not on them yet, just go to commons.openshift.org and scroll down to the interest sections and pick the topic of your interest. And we will add you to that topics mailing list. So again, Michael, thank you very much for taking the time today. Oh, yes. And there's one other thing. Diane Fedema is going to be giving a talk in Victoria, BC. So if you're up in my neck of the woods on the 23rd of February, there will be another interesting cloud-native big data talk in Victoria, the British Columbia, which is the seat of our provincial government. So please, please join us for that. And these things will be on the calendar as well on the OpenShift Commons page. So thanks again. And we'll see you all soon. Thanks, Diane.