 So I've been given the thumbs up, so I'm going to just proceed. Thank you for joining us after lunch here. I know it's tough. My name is Michael McKeown. I'm an engineer at Red Hat. And I'm going to talk to you today about streaming functions as a service with Apache Spark and Kubernetes. So a little forecast here. I'm going to start by talking about some of the fundamental technologies that went into making this up. Then I'll talk a little bit about Spark streaming. I'll talk about encapsulating functionality using Python. I'll talk a little bit about deploying this on Kubernetes. And then lastly, we'll kind of talk about looking forward. So first of all, show of hands, how many people here are familiar with Kubernetes? All right, so most people in the room are pretty familiar. Not too deep, but Kubernetes is a container orchestration system. I took this picture from the Kubernetes docs. It basically shows you what a single host might look like. This is a very abstract diagram. You know, you've got your hardware, your operating system, some sort of container runtime. And then the important thing is that all the containers are kind of running in separate namespaces, kind of doing their own thing. In practice, it looks more like this. And this is what an OpenShift architecture deployment kind of looks like. You can tell there's all sorts of different nodes doing the work. They're running different types of applications. There's some sort of control plane that's handling scheduling and, you know, the API. On the outside, there's all sorts of CI CD mechanisms attached to it, source control. So in reality, this is really more what Kubernetes looks like when you deploy it. Now let's get a little deeper. How many people here are familiar with Apache Kafka? Okay, so about half the room. Apache Kafka is a distributed streaming platform. And what that means is, from a very high level, is that you have something called a Kafka cluster. And you have the ability to have message producers and message consumers, as well as stream processors that can read from the streams and write back to them. And really what the stream is is it's a discreet way to pass individual messages to a large architecture. And you can see the connectors is how you might add persistent storage into that mix. Now let's go one layer deeper. How many people here are familiar with Apache Spark? Okay, so still about half the room. We're doing pretty good. Apache Spark is a distributed compute engine. And it runs everything in memory. And it's very fast. And it scales very well. And it's very resilient to network kind of behaviors. And at a very low level, this is what a Spark process kind of looks like. Or a Spark application. We call it the driver process. And I'll use that language throughout here. And what that means is the driver is responsible for where my user code is. You know, the things that I would actually like to do, the business logic. And the Spark session is a special type of object that allows me to connect to the cluster. And you can see the cluster manager is a piece of Spark that's managing the different executors. And these are the terms we use to talk about the actual functional units that run things. And there's a lot of depth we could go into here. But I'm not going to go too deep. But basically it's a way to distribute compute tasks. So Spark streaming is one of the APIs you can use to interact with Spark. And it's got two main interfaces. One of them is called D streams. And that's a very low level interface. It's the discretized stream. I'm having trouble with that word today. I'm not going to talk about it. So let's not worry about it. The other one is structured streaming. And structured streaming is a fluent style of programming where I can set up a path or a workflow that will run on every piece of information that comes through the stream. And although I'm going to demonstrate Kafka, there are a lot of different sources and syncs that can be used with this. You know, you can use Kafka, Flume, Kinesis. You could even use a raw TCP socket to connect to these things. There are a lot of other connectors you can use. But those are some of the main streaming ones. So there are a lot of different ways to work with the streaming interface. This is the pattern I'd like to talk about today. And what I'm showing here is that the type of work that I'm doing, I'm reading information that comes on a topic, say topic one. And then I want to use my Spark application to somehow filter that data or transform that data or mutate that data. And then I'll replay it onto a second topic. And the reason I'm showing the executors here is just to show you how this architecture kind of looks when you actually deploy it. Now the reason I like this and why I want to use it is because it's very easy to start building composable applications. And, you know, what we like to call, you know, pipelines basically. So you could imagine each of the pieces in the middle is some different application doing something to the data stream. You know, perhaps the first one is just doing some sort of filtering. The second one maybe is taking the data and making some sort of write to an API somewhere. Another service might be storing things to a database. But you could see how I could abstract this out pretty far. And you could even start to make branching topics so you could have different topics for different types of information. This pattern is very flexible. And it's made even more flexible with Kubernetes because the container orchestration allows us to make each of those pieces very easy to deploy, to redeploy, to put into our CI CD pipelines, et cetera. This is a little bit of Python code. How many people here are familiar with Python? Just a quick show of hands. Okay, so a lot of people. So this is Python code and this is showing you the very basic way to set up a structured streaming interface. And all this is doing, the records part is reading from a stream. And you can see the top couple lines there. I'm setting up my connection to Kafka. And then this select line is where I'm actually selecting the data that's coming across on that stream. And then the second set, the writer, is how I'm going to write those to a second stream. So at this level, this is the basic pattern that I work with. Reading from one, writing to the next. So the question is, how do I make this more useful for myself? Now you remember this function, select functions, column value, blah, blah, blah. What this is doing is it's telling the structured streaming interface that I would like to pull something off the stream that's called value. It's a string type and I'm just going to alias it to value because that's kind of the default name that Kafka uses to associate with the data. So that's where I start. But because this is a fluent style interface, I can start to add more instructions onto the bottom of that. So you can imagine if my data was coming across in a JSON format, the first thing I might want to do is convert that string into JSON data. And this functions dot from JSON, this is something that comes from Spark. And you can see I've got a little magic in here and I'm going to alias it to JSON. So in the first one I'm selecting what's called value which is the key that I'm using to associate with the data. Once I've turned it into a JSON structure, I'll pass that on as JSON to the next function that could run. And just for completeness, this message struct is, you know, this is a way to add types into here and so you know exactly what's coming off the stream. You can have a schema that goes along with the message type. You know, in general, Python doesn't have a typing system that you really need to use that much. I mean, it does, but you don't have to use it. But because all of this is backed by JVM code, you know, it's Scala code is what Spark is written in and Kafka is written in Java, I believe. You know, you have to have types, you have to tell it what's coming off of there. And this, if you're going to look at one thing, you know, from this, I would look at this PySpark dot SQL. The functions package has a whole arm load of things that you can already use out of the box that from JSON was one of them, but there's all sorts of converters and filters and just really useful things that you can reuse in there. How many people are familiar with dry? And I don't mean it's not raining right now. I mean, DRY. Landon, I know you are, man. I know you don't like to do this. Who else? So dry is one of these acronyms that's been in computing for a long time and it stands for, and probably not in computing, don't repeat yourself, okay? So one of the things that I do in my day job is I write a lot of these applications to show people how you can do these types of interactions. So I'm rewriting that piece of code that I showed you at the beginning over and over and over again. And I started to think to myself, how can I make this easier to do so that I'm not always writing that, you know, what we call boilerplate over and over and over again. And so what I did was I started to think, well, in my Python applications, what if I put an extra module inside my repository that I'm building from and then at runtime I can say, well, I'll try to import that and if it exists, I'll set it to this thing called user function and then I can add this into my record stream. So whatever that function does, and it could be anything at this point, I'm going to call it, I'm going to pass it the original string value and I'm going to expect it to return a string value back to me. I'm not going to call it if there's no value that came across the stream. So if a record came in but it didn't necessarily have any sort of value associated with it, then I'm not going to do anything. So in this way I could start to say, well, now I can inject something at runtime or at deploy time that can add a little more functionality and I don't have to keep rewriting that loop over and over again. And this is what my module.py might look like. We've got a user function here. It takes in a value. I check to make sure it's not empty. If it's empty, which means I get an empty string, a non-null value. I wrap it. I call it empty. I return it. Otherwise, I do this thing where I load some JSON data. I call it wrapped and then I return a dictionary as a JSON dumped string and that could go on to the next value in my processing stream. All right, great. So I've got this way that I can now make modules. I can inject them. I'm kind of thinking about where can I take this? Like what can I do with this? I still have to make an individual repository for each application I'm building. I'm building from those. I'm trying to figure out what can I do? So I'm sitting around and I'm thinking like, where do I take this next? And Python's great. I mean, some people say it's terrible, but it's also great in that you can do a lot of things dynamically. So I kind of came up with this huge blob of code here. And what this does is it will load a value at runtime based on a URL that you give it and it assumes that that's some sort of string and it will attempt to turn that string into a Python module and then load that module and set it up as what's called a user-defined function and inject it later into my stream. And the really important part here is this functions.UDF. And you remember I said there's that PySpark.SQL like functions module. That's where thisUDF comes from. And if you're familiar with databases and whatnot, this idea of user-defined functions is really common in the old SQL world and whatnot. This was a way to encapsulate functionality. I could have some typing associated with it. I could inform the system that I'd like to inject a function here. And by default, the function will take a string and all I'm telling it here is I'm going to return a string as well. So if you can see I'm starting to make an API around these functions that I'm injecting. So now this allows me to have something like this where I can have my spark driver, I've got my stream going and at deployment time I can pass in a URL and it will load that file and actually set it up on all the executors. So now I'm running that as my function, right? And this gets really interesting because on Kubernetes it's very quick to redeploy containers once they're running. And all I have to do is change an input parameter to my application, restart the container and now it's got a new function running in the middle of it. And so now you can imagine, okay, I no longer have to rewrite that boilerplate. All I have to be concerned with are these little wrapper functions that I can store and load at runtime. Okay, so demo time. I was going to do this live but as, you know, the VPN here is kind of crazy. I'm going to use a recording to do this. What I'm going to show you is a really simple application where I've got a topic that I'm going to inject random numbers into and I've got a file called odds.py that's going to just parse out all the odd values and propagate those onto the second topic. And then what I'll do is I will redeploy it with something called evens.py and that way I'll switch to just filtering evens. Part of what I'm going to use to do this is another piece of technology from a group, you know, one of the communities I work with, the redanalytics.io community. We've created some tooling that will automate the deployment and creation of a spark cluster to go with my application. So one of the things I love about OpenShift is this idea of what we call source to image. And what this allows you to do is say I've got a repository somewhere. I know it's a Python repository. I can instruct OpenShift, I would like OpenShift to grab that repository and automatically build me a Python application. And the Oshinko tooling will make the Python application and then it will also deploy a spark cluster that attaches to that application. So now every time I start this thing I can have it start up a cluster with my application. So let's see if this works. I mean we know it will because I have a video of it, but you know, I have to say that to stay honest. All right, what we're looking at here is the OpenShift console. And this is an interface into Kubernetes basically. How many people here have used OpenShift or are familiar with OpenShift? Okay, a good number of people. So what we're looking at is pods and these pods represent containers. And what I'm going to start by doing is I'm going to start up the generator application. So right now it's not running. You can see it says zero pods. I don't know how many of these videos kind of slow. I was scrolling down to show I've also got Kafka deployed inside my project and that's the infrastructure I'm going to use. So I'm going to start scaling this up and this container image is already there. You can see it's already running. I've also got this listener application. All it does is attaches to a Kafka stream and shows you the data coming across. So you can see there's just a series of random numbers coming across here. So the next thing I'm going to do now is I want to deploy this application using source to image. You know, a lot of people, you've probably seen a lot of demos where someone's typing on a huge OC command. I've put that OC command into this repository that I'll have a link to at the end. Just kind of copy this huge blob. What this thing is doing is it's creating a new application. I'm telling it like where to get the application, the name of the application, where the Kafka server I should connect to is. I'm passing in some specific Spark options. There's a lot going on here. There's a lot to unpack. The last line you can see I'm passing in a URL to the function that I want to use as my user-defined function. So I think at this point I'm going to copy this stuff. I'm going to dump it into a terminal and it will kick off the build and start it up. I took this from the practice talk I gave last week, so I don't remember what I was talking about here, but I was just gabbing on about the different options. See, this is where I'm grabbing that evens.py. And that evens.py and odds.py, those are also in this repository that I have linked for the presentation, as well as a PDF and some other stuff. So here's a quick example just so you can see what's inside that thing. The user-defined function I created. I know I'm getting a string, so I'll turn that into an int value. I'll then check to see if my int value modulo2 equals zero. This is basic stuff. It should be even then. I'll wrap it in a message that I can send back onto the queue and that'll be my number. And you'll see this through the listener. So I think I copy this, paste it into a command. You can see I'm going to copy it here. I'm going to put it on the terminal. You could do all this through the OpenShift console as well, but I find it more convenient to use the terminal for a lot of what I do. So you can see this just created all the resources. It talked to OpenShift and it started to spawn this. And what you'll see is first it's going to start building my application. So this is the output from the build. It actually builds pretty quickly because it's a very small application. I'm going to scroll down to the bottom and just show you that there is a spark cluster deploying with it as well. I don't have to do that ephemeraly every time. I could just do it once and then reuse the cluster. But I'm just kind of showing how all the automation works here. So this is starting up the filtering application. You can see it's waiting for the spark cluster to become active. Once it does, it's going to start processing the numbers that are coming across that stream. And I'll go back into the listener at this point. And this is another thing I really like about OpenShift. And you'll see if you remember originally, the listener was watching all the numbers that were coming across. And that was my topic one. My topic two is actually this other thing called output. And based on the way OpenShift is set up, if I change an environment variable, and I've got my deployment set up properly, when I save this, OpenShift will automatically redeploy the container for me, changing the environment variables inside that container. And what you can see is it's not rebuilding it, it's just redeploying it and changing the environment variable. And we'll see here, okay, so now we're looking at the number stream that should be just showing even numbers at this point. And in fact, that's what we're seeing because we're just looking at the output. So the next step would be, I'm going to go back to the deployment for my filter application. And I'm going to change the URL to use the odds function as well instead. And so just like the listener that we were just looking at, OpenShift is now going to redeploy my application. And when it redeploys this application, it doesn't have to rebuild it. Because I didn't use the same spark cluster over, it will redeploy a spark cluster, so that's kind of a slowdown. But what you'll see when it redeploys is as this log continues, now we should just be looking at all the odd values. And you see there's a little weight here, because I'm looking at the listener at this point, right? So I'm waiting for the new application to spin up. I'm waiting for it to start processing records from Kafka, and then it will start emitting them. And remember this point, because this time that I'm waiting is something that I'm thinking about as a developer as I watch this, because I'm like, well, we obviously missed a bunch of data. It took 30 seconds or whatever for this thing to restart, and now you can see we're getting the odd values. But if this were a production deployment, I would have missed some segment of the data that's coming across. So just keep that in mind. All right, so that's pretty much the end of that. So like I said, the timing. This was kind of the next thing I started to think about as I'm working on this, is like, is there a way that I could make that even better? Is there a way that we can make this, you know, faster so you don't have to wait for this redeployment to happen? Because when you're doing these things in a production environment and you imagine someone like Twitter or something, they've got, you know, or a banking institution, they've got thousands of messages a second coming across these things, and you can't afford to miss 30 seconds of time. So I started to think, well, I can put strings into the topic. A Python application is basically just a series of strings. You already saw the code for how I loaded it from a URL. What if I inject that directly onto the topic and allow my spark driver to pick it up and redeploy everything without having to redeploy the application? Just change that. And there's a problem with this. And you need to understand a lot about Kafka, not a lot, but enough about Kafka and Spark to understand it. So if someone can tell me what the problem with this is, aside from like Eric and Jason, maybe Landon, you know, I've got a little prize. Can anybody tell me what the problem with this is? Speak up, Zach, yell it out. So he's asking, am I just taking that string and putting it in a message? Yes, that's what I want to do. I want to encapsulate that string in a message and I'll have some logic in my spark driver that will read it. What's the problem with this? Okay. It might be... I mean, people can kind of reverse engineer your code and they can stick any arbitrary piece of code in your software. Security might be a problem, someone can see that, but that's not actually the real technical issue here. So I'm going to... The problem is, is the way Kafka and Spark work together is that Spark, the executors, will each look at one message and it won't be sent to the other executors. So if I send a message across that has the code in it, only one of my executors will change if I have the logic in there to switch it. Not every executor will get that and this is something that I've been thinking about trying to figure out how to solve this and there are some ways to do that. It really gets into how you configure Kafka and how you configure Spark and there's a lot of options for determining how data is replicated across what are called partitions in Kafka and so I'm still trying to figure out how this might work. So how would we handle these singleton messages? One way I thought was, what if we added another topic to this, a control topic that we could just send updates to the system? Well, that might work, but now my Spark application has to be listening to multiple topics. I have to synchronize between how I shut the executors down and how I restart them. There's also a concept in Spark of something called a broadcast variable and a broadcast variable is something that the driver can create that all the executors can see. So I figured, well, I could just make a broadcast string and every executor could look to that string and then execute that. The problem is, those broadcast variables are read-only at the executor point and my driver is not actually processing what's going on on the stream. Only the executors are processing. There's another option which are called Kafka consumer groups and this is a way that you can segregate everything that listens to Kafka into a different group and thereby each group will be ensured to get every message. So there's a way that you could set up like consumer groups and then have each executor as a node on a different consumer group. It gets really, really complicated and maybe there's another option with partitioning the data properly, but you can see this is not an easy problem to solve and I'm not even convinced that it's something that needs to be solved. Maybe the way we're doing it now just using the URL is good enough. So this is pretty much working for me and I've used it in a lot of demonstrations and workshops and whatnot and what might I like to do with it next? So one thing would be, could I figure out this wire message thing or maybe one of you could figure it out and then you can share it back and we can all go out for dinner or something. I'm not sure that it's worth solving. I've had some pushback about security based around putting these things into topics and whatnot, so maybe that's not a good way to go. Maybe using the URLs is right because we can version that stuff, we can have it checked in somewhere. I'd also like to produce some really good metrics to show exactly what the switching time is doing. How long does it take you to roll over using different methodologies? I think that would be important for anyone who wanted to try and use this technique. You'd want to be able to know, what am I getting into? How long will it take me to do that switching? I'd also like to write a little more documentation for the repositories I've created around this. And lastly, I think it would be interesting, especially in Kubernetes, to create some sort of application that lives inside Kubernetes that serves up those filters. So instead of calling out to GitHub to pull in the filter code, I could have a container sitting next to my other containers and I could just use that as my warehouse for pulling out different filtering functions. So I think that might be an interesting way to look at it. So that's pretty much where I'm at at this point. This QR code will take you to the repository that I have set up for this code. So you can see the slide back is there, the filtering functions are there, and there are instructions on how to run this demonstration for yourself on OpenShift. BonesBergade.org is a community that I contribute to where we're collecting application skeletons, and we like to reuse those on Kubernetes and just locally and whatnot. So we have a bunch of stuff there about setting up a streaming service, setting up different types of, just various types of things where you're going to reuse boilerplate over and over again. And then I'll get you in just a second, okay? And then redanalytics.io is where a lot of the code for the Kafka and Spark stuff that I've been doing is coming from. So I'd say recommend check out those projects. If you want to stay in touch, you know, please hit me up, send me an email or toad at me on mastodon or something, and we can, we can carry this on. So, yeah, so question. Just a thought, I was thinking for the previous problem if there's some way to use an operator to propagate the Spark executors together. Yeah, so that's a great idea. So the idea is we could use an operator to control what the function is that's inside that. You know what, you're absolutely right. I've done a little bit of work with creating application and job custom resources that would go along. You could most certainly create a custom resource definition that also had the link to your filter in there. And then, yeah, that would be a great way to do it. If you were starting to build an architecture around this for like a service inside your business or whatever, that might be a really great way to do it. It's a good suggestion. So any other questions? So actually I think it's a really cool idea of injecting code there. I always thought like library loading code is pretty cool. I think you could solve most of the security problems by just digitally signing the messages. I'd be okay with that in a stream that I was consuming. But a tougher problem from my perspective would be handling the user's version of Python or if they import numpy, they import something that needs to be compiled or something that would be my fear is we solve security but then they try to do something in Python that's not supported on the executors. I don't know if you've thought of that. Those are some great thoughts actually. So I think you're right. From the security side, there's a lot you could do and that would depend on your organization. If your organization is okay with using like GPG signed blobs of code that go on to the message stream, maybe that would be good enough in your organization or you're confident that the network that's happening inside of OpenShift is secure enough or inside Kubernetes. This is part of the reason why I'm a little hesitant to go too deep on it because I think every organization is going to want to do this a little bit differently. So yeah, that would be one way to do it. Another thing you hit on there is yeah, how do we keep track of the versions of Python, the versions of all the modules you've got in there and the examples I had for the code were just using standard library packages. So I could import from the Python standard library if I wanted to import something like Kafka producer library inside there. Yeah, I would have to build an image that had that inside of it. So to the point, this is the great thing about using Kubernetes. If I were an administrator, I could version all the images and we could know like, all right, you need Python 3.6. Here's the Python 3.6 source to image builder. This is the one you want to use. From an organizational level, you could say, here's all the controlled images you can use. If you want to request a special module be loaded into these so that I could use them, you could go through a whole process for doing that. And the way that the source to images are built now, the source to image images are built is that we have them broken out by language. There's Python 2.7, Python 3.6, Java, Scala. I think we even have an R1. So we've started to do some of that where the platform is really going to help you out because you can build this deep versioning into every object that gets deployed into the system. Until the rogue hacker in your group comes along and it's like, I got my own images and I'm doing my own, you know how it goes. Thank you. That's a good question. Thanks, Mike. I was wondering, could you, before you actually kick off like the Spark streaming job, maybe set up a separate thread that can watch the URL and if it detects, you know, another one, maybe via the operator idea, it could like just kill the stream and restart it with a new value. Yeah. That's a good idea. Like, could you have something watching like a separate thread of execution watching for updates? I think the natural way to do this if you were using like Kubernetes K-native or you were using OpenShift is to, you could set up like a webhook that came from your Git server and then if a new check-in occurred, you know, like maybe you have a configuration file or something that tells where to get it or you're updating a custom resource, yeah, that could kick off some sort of new operation or redeployment of just the application. Or were you thinking more deeply about how could I do it without restarting the application? Yeah. Maybe there would be a way to use like webhooks coming out of like your Git server or something or you could send a different message there. You know, maybe that's a good solution. Thank you. That's a great idea. Any other questions? I know it's after lunch. We're getting all sleepy. Only one or two more sessions left to go. I appreciate that nobody went to the hot dog talk and they came here, so thank you. Any other questions? All right. Well, thank you very much for your time.