 You need to be a bit loud so that this can pick you up. Okay, so just talk real loud. You want to start? Okay. So thanks for joining me today. My name is Michael McEwen and I work for Red Hat and I'm going to be talking about building Apache Spark cloud services in Python. So to start with, let's talk about what is a cloud service or at the very least what I mean by a cloud service and what I want to talk about today. And so I'm kind of relying a lot on definitions that have been put out by the cloud native computing foundation. And this is an organization that's like a governance body that is kind of helping to govern projects like Kubernetes and Prometheus and what they say is that a cloud native application should be something that's containerized. So, you know, we're talking about building for containers. We're talking about deploying to Kubernetes. So this is obviously a strong part of the story. They also say that a cloud native application or service should be dynamically orchestrated. And so what this means is that you can use the container with a container platform to, you know, migrate between instances. It really should be generically useful in those situations. And then they also say it should be micro service oriented. And this is a really nebulous topic. It's a little more philosophical or ideological. But the way I take it is I kind of look back to the Unix philosophy of old that a micro service is an application that kind of does one thing and does it well. And so really what you're talking about is building purpose built applications that you're going to put into a container and deploy to the cloud. And here's a link to their FAQ. They've got some great language that kind of talks about what these platforms mean and what these applications are. And so I highly recommend checking that out. Now, since we're talking about the cloud or we're talking about cloud native, the platform that I'm using is OpenShift Kubernetes. And how many people here are familiar with Kubernetes or OpenShift? Okay. Pretty good audience. So, you know, this is like the generic diagram. I'm sure many people have seen this. You know, you've got kind of your container network and everything. But what I'm most interested in is this part over here. Because I want to use this as a developer and I'm kind of interested in how does it connect to my source code repositories and how can I get kind of automated build features in there. So, although all this part is really interesting from an infrastructure point of view, what I'm kind of curious about is this part. And this is what I want to talk about today. So, uh-oh. Looks like we've got images out here. I'm going to switch browsers quickly because this is, apparently I have an issue here. Sorry about this. Okay. So, what I want to talk about is Apache Spark. And how many people here are familiar with Apache Spark? Okay. So, again, a good number of people. So, this diagram might look familiar to you. And this is basically a general outline of what an Apache Spark application might look like. You know, we have what we call the driver process, which is where all your user code lives. And then there are a number of executors that help perform the distributed work. And underneath all this is a cluster manager that controls how these pieces interact with each other. And so, I'll go through some of this a little quickly because it seems like a lot of you guys already know this, but kind of the fundamental abstraction at the base of Spark is something called the resilient distributed data set. What we refer to as the RDD. And this is really the primary or base level abstraction that happens for all the data. And what it is is a partition, lazy, and immutable homogeneous collection. And so, what does that kind of mean for the resiliency and the distributed nature of these things? Well, they're partitioned, meaning that when the data comes in, Spark makes partitions in your data set. This makes it easy to distribute that data. Second, they're lazy, meaning that any calculations that need to be performed on those data sets, they don't happen until they absolutely need to return a result. And finally, they're immutable, meaning that you can't change the data set. You can make a new one, but you can't modify the one that's being distributed. And these pieces together help to build this resiliency into the system. So, if one piece falls out, it's very easy to recalculate that partition and send it back to do work. And this really helps Spark to become kind of a stable platform to do these type of calculations. So, what does this look like in action? Let's say you have an array of numbers, and you want to do a calculation to say which one of these are even. How many of these numbers are even? So, the first thing I would do is tell Spark to parallelize this data set. And that means it's going to distribute these things, and it's going to create the data set for me. So, this is our RDD of each number in its own partition. And then I'm going to say, perform this filter operation on my data set. And this will give me a new data set. And in this case, I'm just looking for any number that has modulo 2 equal to 0, meaning it's even. I want those in my new data set. And so, I end up with 2 and 4. And I count it. Now I know how many even numbers are in my data set. So, this is a really simplified look at the operations I might do. But this is how these RDDs are built up, and how they might be distributed to do work. So, when we take OpenShift, and we take the model of Spark, this is what it might start to look like at the infrastructure layer. So, we have our nodes that are the physical nodes where the cooblets live. And inside those nodes, we have our container pods. And you can see that, you know, over here we have a Python application. And maybe it's got some Spark containers. You know, maybe these are the executors, and this is our master. And maybe we have more applications, a Mongo database. So, you can start to see how we can use the platform to distribute the different processes that are occurring for us. So, what is a Spark application? And this is kind of the way that I generally reduce these things. You have source data that's going to come in. You'll perform some sort of processing on it and return results. Now, source data and results can be very nebulous in many ways. You know, source data might be coming from a database. It might be coming from a file on a file system. It might be coming from a stream. It might even be coming from an API call where data is pushed in. Likewise, results could mean the same thing. So, some of these are abstractions that you'll have to deal with. And what we're looking at here is a very simple Python Spark application. And this is going to do kind of what we just looked at earlier. It's going to take in a data set and then it's going to parallelize that data. It's going to run our little you know, evens counting function on it and it will count it and return that. And so, if I run this on my desktop this is maybe what it would look like. I would have the spark submit command which is a tool that comes with Spark. And I tell it to take my application and in this case I'm going to say, for all the numbers zero to a thousand tell me how many of those are even, right? So again it's very simple process and it starts to run and the JVM goes and it spits out a bunch of logs. Now, as an application this is probably not that useful to me because what just happened, what do I have as output? How many numbers were even? Well, you can see, you know, way up here it spit out some little print and said, all right, 500 of those are even. Okay, so how do I use this in a cloud situation? How do I go from using this locally to taking it to the cloud? So again, I start to think about how am I going to design my application in a way where it can become a microservice. So I go back to this pattern of ingesting data, processing data and then publishing it, right? Ingest and publish can be very nebulous and they'll depend on the systems that you're designing for. You may have a database that you're reading from to ingest data and publishing might mean sending a call to an API that another service exposes. Likewise, that could be inverted. You might have a service that calls your microservice on an API and publishes data to a database. These will change depending on what you're doing, but in general, you'll be operating on this type of model. So as you think about building these applications, you need to consider the structural needs of what you're building. Depending on what cloud you're using, it may be Kubernetes, it may be Mezos, it may be something else. You have to know how will I deploy my application? How will I command that application from the outside and then how will I control it? What tools are provided to me by the platform to do that? Likewise, where is my input and output data going to come from and go to? These will be dictated by the systems that you're working in. There's really not a good way to generically cover all these, except to say that you're going to have to consider these pieces. What I'd like to talk about is kind of three common architectures that I've come across and I have a feeling that many of you will come across these as well as you build applications. We'll talk about on-demand batch processing, continuous batch processing and then stream processing. First we'll talk about on-demand batch processing. Generally what this means is I have some event that occurs which picks off processing and then of course I take some data in that I want to process and I create results. This is something that happens every time it's triggered and that could be a cron job, it could be something that comes in on an API call, maybe a user visits a website and that produces the action, but this is one pattern you're going to see. When might you want to use this pattern? In general, these are kind of some of the top things that came up to me as I was thinking about this. When you have non-deterministic request windows, so think about a user who visits a website like Amazon. They're going to be clicking through products and just looking for something that maybe they'd like to buy and every time they click on a product there's going to be return some sort of rating, some number of stars maybe says I suggest this product for you or I don't suggest this product for you. This is the type of area where maybe an on-demand call is what you want to use. You may not have these things pre-calculated, you may have some model that contains this data but you'll need to filter that at the time that a user actually hits it and you won't necessarily know ahead of time that that user is going to do that. Likewise, when you have quick results to calculate, something that can be returned very quickly because if you have a situation where you don't know when these things are going to happen or you have a user visiting a website they're not going to want to wait a minute for some long processing cycle to happen before results come back. So this is another situation where you might think about using this technique. And then likewise, when you have a lot of situational dependencies, so again think of the example of a user visiting a website. If your processing depends on a user entering information before the request can be performed, this is a situational dependency. So this is something that you can't pre-evaluate or maybe is very difficult to pre-evaluate. This is another scenario where on-demand might be what you want. So what does this kind of look like? I'm using an example here that I'm calling the Hello World of Spark because it's been in the Spark code base. I looked just the other day at like Spark 0.1 Alpha. And this example was I think one of the only examples that existed in there. And it's a Monte Carlo method for estimating pi. So we're talking about throwing darts at a dart board, counting how many land inside a circle versus how many land outside, and that ratio will approximate pi for you. So this is just a function that does that. And you can see this part here. We're doing a similar type of operation that we did before. We're paralyzing a range of numbers. In this case this is the number of random points I'd like to use to calculate pi. Then we're mapping a function onto it and we're reducing that. And this is the technique for returning pi. So now I've got a function that every time I call it can give me that answer for pi. And what I might do is I might embed that into an HTTP server. So if you're familiar with the framework called Flask for Python, this is an HTTP framework. This is what it might look like if I embed this function into a return. So now I've got an HTTP service, a REST-based service that can basically take a request and give back pi whenever I need it. And this is just one way to look at it. You might use GRPC, you might use some other kind of remote procedure call mechanism, but this is just one way to look at it. So let's move on from that and talk about continuous batch processing. And I look at this as a type of scenario where you have data sitting in some sort of data source. Your processing is always occurring. There's no need to trigger it with an event because you want this process to always be running and always be producing results to wherever your store is. And when might you use this pattern? So if you have data that updates very frequently, let's say you have a database of users who are always giving ratings, for example. You want to be creating recommendations, let's say, off those user ratings and you want to be doing it continually because that information is always happening. So there's probably never a time that I don't want that to be running. And one way you might use that is when you're creating machine learning models for evaluation, right? So if we think about a recommendation engine where users from all your products like Amazon, for example, they're always rating what they like and what they don't like. Can you initially be creating models so that you can evaluate how those models are performing against what you have in production, right? So you want to say, as users add more data, are my models getting better? Maybe they're getting fresher? In the case of a recommendation system, you probably always want it to be fresh because users are continuing to add information. And then another way to look at this too, or a situation you might use it in, is what I'm calling life cycle management processes. So think about a lot of the work that gets done in distributed computing is data engineering. It's transforming one schema into another schema or taking one format and turning it into another format. This is something where you might just want it to be continually running. You say, I've got users who are putting in data nonstop. Maybe it's a text data. And I want to make sure that no one's putting in words that are, you know, on our band list or something, right? So you want this to always be running, always be searching for results that are coming out of that. Now what might this look like? So this is a piece of code from a recommendation engine that one of my colleagues who's actually sitting here helped to write Ruri in the back there. And this is part of a model generation service. What it does, as you can see at the top, there's this while true loop. It does some sort of database selection here to pull the ratings out of our database. And then, you know, there's a bunch of information going on here. And what I really want to highlight is this part here, we see where we've taken the new ratings from our database. So since the last time I created a model, I want to see all the new ratings that have been added to the database. I create an RDD out of that. And then I'm going to do some processing here where I create a model from that. And so this is just always running. And if it sees changes to the source data, it creates a new model. And it puts that model into my database where I'm kind of storing these models for later usage. So the last kind of pattern I want to talk about is a stream processing pattern. And this is something that I find very intriguing. And I think it's a really cool way to work with data. But this is where you have a stream of information that's always occurring. So think about Kafka or AMQ or those type of message bus type applications. There's data that's always coming in. And my processing, this is a little different than the continuous batch. My processing is reacting to every time data comes in on that bus, you know, within some sort of window. And then again, those results are stored somewhere else. Now, you probably would want to use this in a situation where you have like real-time event processing. So think about an IoT situation where you have sensors on maybe public transportation, right? So you want to follow where all the buses in your public transportation system are going. Are any of them running late? Those type of things. So you have a continuous stream of information coming in and you always want to be updating on what's happening. So that may be one area. Another is if you're working with systems that are built on a broadcast messaging system. So I like to think about the Fedora message bus. How many Fedora users do we have in the audience here? Okay, a couple. Fedora, since it's a large community system, they have something called the Fedora message bus. And it's a federated message bus system where you have messages coming in from build systems, maybe for all the different packages that are being created. Messages coming in from mailing lists, you know, all sorts of information being aggregated. And this is a situation where, well, I would obviously want to build something to that message bus, because that's the architecture of the system I'm working on. And then likewise, another level to this is what people are calling Kappa style architectures. And this is a way to look at stream processing where you have a stream that's your input data, you have processing happening, and then your output is always going to be to another stream. So you're kind of creating these message bus scenarios where one topic might be your clean data, and then you have a process that runs and maybe changes the schema or pulls out some specific information and lays it to another stream. And this allows you to build up really complex hierarchies of applications that don't necessarily need to depend on each other. All they really need to depend on is the message bus. So Spark has a really cool API called the structured streaming API. And this is kind of what it looks like. There's multiple ways to do stream processing in Spark, but I think the structured streaming is really interesting. And what you do is you kind of build up this set of instructions. You tell Spark where will my information come from, and that's kind of up here. I'm telling it a Kafka broker. And then I tell it what to do. So I wanted to select, you know, in this case this is real simple. It's just taking every value that comes in on the stream and casting it to a string and then rebroadcasting that. So this is just really kind of looking at a simple example. And then what happens is all those strings that come in get grouped by the value that they created and counted. So what's happening here is this application is doing like a word count on a stream. It looks at every string that comes in. It groups them by similar and then counts them up, right? So you can say imagine you had a stream of words coming by. This is just going to count them. And then at the bottom here you can see I'm telling you where I wanted to put the output of that stream and then I just tell it to start. And in this case I'm just storing it in Spark's memory storage and I'm giving it a name so that I can query that. So at this point it's doing a bunch of work but how do I get the results out of this? So what I might have as a function like this that allows me to look at Spark's internal SQL in memory representation and then I can run a query on it. So in this case the query that I'm running is I just want to know the top 10 entries of all the things that's counted. So I want to know the top 10 most frequent things that are happening. But I could call this function on demand when I wanted to get information out of it. So that kind of brings me to the next way you could do this, right? And this is an example of what the Kappa style architecture would look like. You have a stream coming in, you have processing happening, and then you have a stream that it lays it out to. And if we look at this kind of similar function that we just looked at, the top part of it's all the same. It's all doing the same work. But if we look at the bottom part where it's writing the output of that stream, I've told it now just to put it onto another Kafka topic, right? And so what this means is that I don't have to worry about having a routine that pulls that information out using an SQL query. I could actually have another microservice that just listens on that second topic. And then it would be automatically giving me all those counts. And I could do whatever I needed to there. I could aggregate them or I could do something different at that point. Okay, so, you know, we've talked about some different patterns you might use. How do I take these from that desktop example and bring them into the cloud? You know, I want to take it from source code. I need to turn it into a container. And then I want to push it into my orchestration platform. And at the same time, I need a way for the user to still get in and out or myself to get in and out of it. So the group that I work with at Red Hat, we have a community project called RedAnalytics.io and we've created some tooling, a project that we call Oshinko. And this allows us to use some of the source to image type workflows that are in OpenShift to say, I'm going to take my source code from a Git repository. I'm going to use the source to image to build that into an image that can run. And then when it gets deployed a Spark cluster will go with it and be bound to my application's lifecycle. So in this way, I don't even have to manage Spark anymore. I can use the workflow that I'm used to in OpenShift going right from my code, making pushes to my code and that might go through a CI testing framework. And then when it's successful it gets deployed onto OpenShift and a Spark cluster will appear and be bound to it. So I'm going to assuming nothing else goes wrong, I'm going to try and demonstrate for this view real quickly here. I've got a small GitHub repository here and this is a tutorial that you can find on Rad Analytics. And this is a web microservice that's going to create that Spark Pi for us, right? So it's going to create an HTTP service that I can on-demand query to get a Spark calculation. And you can see my repository, in this case, my repository is pretty simple. I've got to read me, I've got my app file, which is not overly long, it's just a little flask application. And then I've got the requirements like any Python application might have. So what we're looking at here is OpenShift, this is my project and I've already taken the liberty of loading into this the Rad Analytics templates that I'm going to use. And so what I'll do is I'm going to select from my project the template that I'd like to use. So I'd like to launch Apache Spark Python. So I select the template that I want, kind of click through the description. You can see now it's asking me for a bunch of information here. So maybe I'm going to call my service Spark Pi. And now what it wants is the URL for my GitHub repository. So I'll just copy this from here. Now, there's a bunch of other options I could use here, maybe if I wanted to build from a branch or there was a subdirectory I was building from or, you know, these options help you control how the application gets deployed. But my application is written in a very simple manner, so I don't need to fill in most of these. And likewise at the bottom I could adjust how the Spark cluster gets deployed. I could change the options that go to it. So I'll click create. What we see now is OpenShift. This is being built on OpenShift. And so if I look at the logs for this, you can see this is kind of a standard Python build process and then it's pushing it to the internal registry. Now you can see at this point my pod is up and running, but what's happening is the Spark cluster is being deployed with my pod. So this is automatically being bound to my application and it's kind of giving it a random name. Now the last thing I need to do to get to this is just expose a route to it so that I can get to it. And now if I click on this, hopefully it'll work. Okay, so I hit the root endpoint and it tells me, you know, the Python, Flask, SparkPy server is running. I need to add this extra thing to get more information. So if I add this extra route, see now there's kind of this weight going on. This goes back to the quick result side of this, right? What's happening? All I can see is a little spinner going, but eventually it comes back and gives me a really bad estimate of pi, okay? So like don't go to the moon with this or anything, but it's fun to play with, right? And, you know, something I want to point out here that I talked about before, which is now that my application is linked to this GitHub repository, you can see I've got this thing called a build here. And I don't have my webhook setup, but if I did, what I could do is push a change directly to my Git repository, the webhook would hit OpenShift and this would actually rebuild automatically for me. Now in this case, you know, I could hit start build, like if I made a change to the repository, I could hit start build and it would run and then deploy it again and attach it to the Spark cluster again. So as a developer this is really nice because I can really easily test my changes out and even in a private project I can do this. So let me switch back here. Okay. So you saw when I made that web request do the work, it took a little while to come back, right? And this is where one of the problems you're going to run into when designing these type of services is kind of the synchronicity issue, right? So I make a request for pi and now that service is off doing something, right? And if I tell it to use too large a data set, this could take minutes to come back, right? And you don't want that result coming back later and the user just walking away from the terminal or something. So to mitigate this in our designs what we like to do is start to separate the API concerns from the actual processing concerns. And this might be a very common way to look at this would be our main process, the API, whenever I make a request for a new pi estimate, maybe instead of giving me back the pi estimate, maybe what it'll do is right away it'll give me back an ID. And then I can use that ID to query the results of what's happening. Or perhaps the application on the other end of this uses a web socket, right? So I make the request and now the main processing loop can push the information back to me when it's ready. And in the meantime my application can display some message saying this is, you know, work is happening or whatever. So, you know, depending on what this API is, you'll have kind of different ways to mitigate this. But in general what you want to start doing with microservices is pulling apart these concerns to make it easy to deal with the other ends of them and to address issues like this synchronicity issue. So what might this look like in Python? Well this is like the main process, right? You know, we've got some code here, we set up you know, queues for doing the inter-process communication, we start off some other process and send everything going. This is pretty compact and I don't want to go into every line of it, but what I want you to take away from this is the top line, which says import multi-processing. The Python multi-processing package is really powerful. And I would say that if you're going to start doing these type of things, read the docs on that package because the primitives in there are very easy to use. And so if this is the main process side, this is maybe what our processing loop looks like, you know, and so we've got this big thing here where there's, you know, and this is coming from a service that actually responds to incoming requests for recommendations basically, right? So you have a user who wants to get a recommendation, just to kind of do it for you. And you know, it's a big piece of code, whatever, I've taken out pieces of it, but the main thing to look at here are these areas that are probably really difficult to see because they're in red and there's a lot of light washing it out, but you can see these primitives, this response, request, response, response, these are the cues that I use to communicate back with the main process. And so whatever I'm doing here, I'm using these primitives and again, this is coming from the multi-processing library and I really recommend checking it out. Another big thing that you're going to run into, especially if you're doing Python programming for Spark, is dependency management. So right now, Spark has some really good features for JVM languages that need to distribute dependencies. There's to that Spark submit command, there's an option called packages and you can give it, you know, Maven targets and it will pull all those packages in and send them out to the entire cluster so your applications can use them. With Python, the tooling is a little behind the times. So let's say we've got this application and, you know, the filter, this is maybe like, let's say we're building another service to tell us how many evens exist in the dataset, you know, our classic hard problem here. And you can see at the bottom here is where I'm actually doing the work, you know, I'm telling it to kind of make the paralyzed data, filter it and count it and this filter evens function is doing some sort of database action, right? So what this means is that filter evens function is going to be distributed to the Spark cluster. And so what that means is each executor is going to need Pi Mongo in place because what's going to happen is something like this. My main application wants to talk to Mongo, but now I've distributed code that also talks to Mongo. And so that Mongo library needs to be on every one of the executor nodes. And this is a situation where you might have to manage these dependencies yourself until the Spark community kind of catches up with this and gives us better tooling for doing it. So, just let's recap a little bit here. You know, we talked about this design pattern that I really like to use in just publish process. This is kind of the general pattern I like to get into. We talked about some different types of architecture patterns you might get into when you're designing your applications. We talked about, you know, the on-demand batch, the continuous batch and the stream processing. And then we also talked about the Oshinko source damage project. So this QR code up here is a link to this slide deck. Please download it. It's open source. It's just a reveal.js project. You're welcome to use what's there. Here's my email and my blog. And please check out redanalytics.io. We've got a bunch of tutorials there. And lots of this material you've seen here is on that site. So at this point I'll take any questions and thank you for your time. So you have any questions? Barma, don't hit me up too hard here, okay? Yeah, I'm talking about the first pattern the client would call a REST service and the response comes back immediately. So that's the code you're describing. So what happens, the client continuously pulls and the second, the function that you showed us, it is the multiprocessing package. Is the client continuously needs to connect to be the same invocation of that request for that process to happen? Or can there be another status function independent of the first one that goes back and gets the data directly from the spot? What is that? So the first example I showed is a very simple example, where you're just hitting it and it's trying to make the request. And if you double up on requests at that point, you're going to break that because that application is really simple. But once we start to separate the API the first way that I think about it for a REST application is I say I'm going to make a request to the REST server to start doing work. And the REST server gives me back an ID and that ID number is the ID of the work that's being done right? So my processing loop there, the second process, it may have a queue of work requests that have come in and it knows about the IDs and so when it does each bit of work, it can take that ID, it can update the status for what's being done and the main process will always be able to see what the statuses are with those IDs. So my client could keep requesting, I make the first request to do work and I get an ID back. And now I just query that ID and the main process can tell me no it's not ready yet, no it's not ready yet, okay now it's ready, here's the results. And so this is really what I mean about separating the concerns. Your architecture will dictate how do I want that processing loop to kind of queue up information and maybe I have several processing loops, maybe that work gets distributed in a better way. But really what I want to do is hide those concerns from what the user is seeing, right? So that they can do that, they can keep querying to say what's happening and they're not going to overload the spark work that's being done. Does that kind of make sense? Thank you for solving the same problem. Cool. Any other questions? Can you tell us more about the Oshinko source to image builder process? Sure. So what you saw with the demonstration I did was me exercising the Oshinko source to image tooling, right? So what the tooling does is it first pulls a source code from your Git repository and then it creates a container inside of an OpenShift that will do the build process, right? And so that's that log that I showed. It was pulling the code and it was starting to build it, right? Once it builds that code and it's successful, it deploys the built container to OpenShift and then the Oshinko tooling is actually inside your built container and what it will do is it says did you request to use a spark cluster that already exists because you could do that, right? That's one of the options I went by. If you didn't request that, then what it will do is it will spawn a spark cluster for you and you could specify the configuration for what you wanted to do and then your application will run and when your application exits the Oshinko tooling will kind of catch the exit on that and it will delete the cluster that goes with it, right? And then your application may go away or something. So that's kind of a general look at the steps that happen to do that deployment. And it associates, you know, like a service with it and exposes ports. The same behavior you would expect from the other source to image tooling, whether you were using, you know, JavaScript or Java or Python or something like that. Okay. So you described Oshinko a whole bunch without mentioning like CICD or continuous integration or deployment or anything like that. Can you speak up a little bit? You described Oshinko without mentioning CICD or sort of, but the way you describe it, and I don't know that much about it, it's a very similar workflow. If, like, and we do something similar in a Jenkins stack that uses Helm charts to deploy stuff to Kubernetes clusters, what are, like, why didn't you work that into a more classic CICD platform and how is it better? So the reason I didn't work it in is because I'm not really diving into the pipelines features that exist with an OpenShift, but that would, what you've seen here is kind of the lowest level of application development, the next level of application development that would happen if I were going to take this into a more production situation. First, I would put testing between, you know, when my application gets checked in to Git, you know, there could be a test running there, and if it gets rejected, then, you know, the commit doesn't get merged and it doesn't kick off a new build. Another way to look at this is I could use the pipeline functionality that exists with an OpenShift to create a pipeline that says first check the code out and build it here, then run the tests here in OpenShift, and then if all that works, then deploy it, and I could say deploy it to another project or something. So part of the reason I didn't get into talking about it today is because those primitives start to exist at a higher level, but once you've created your applications in this manner, you know, they become very easy to kind of mix and match, and I think if you're using Jenkins with Helm and Kubernetes, you're doing something very similar. What OpenShift kind of adds to that is this, you know, kind of UX around building the pipeline for you, right? So if you're not a Jenkins expert, or even if you're not familiar with Jenkins, the pipeline tooling allows you to specify those things using a language that's a little bit easier than diving into, you know, a Jenkins configuration or a Travis configuration or something like that. So really that would depend on what you're doing with your application design, but it starts to exist at a higher level than just creating the applications. And so the Oshinko tooling helps because when we're automating these things, we don't need to automate the Spark cluster creation. We can just let the tooling take care of it for us. So when it runs the tests, it actually spawns a cluster dynamically, runs the full integration tests, and then can report on whether it's succeeded and then, you know, deletes the Spark cluster after it's all done without having to kind of build it into my Jenkins file or something like that. I think we're running low on time, so... Yeah, okay. So, thank you. If you've got more questions, come bug me afterwards, you know. Yeah, bugging for questions after. Thank you, Michael.