 Good afternoon, everyone. Thanks for spending some time with us. I'm Will Benton, and it's my colleague Mike McEwen. We both work at Red Hat sort of on machine learning and OpenShift sort of things. And today we're going to talk about adding intelligence to stream processing applications. So if you have an app that's processing data in motion, how do you turn it into an intelligent application that's processing data in motion? We'll talk about what intelligent applications are. But basically our idea is you shouldn't have to become an expert in machine learning to benefit from some of these techniques. And we'll sort of give you some techniques you can use to incorporate these techniques in your app without becoming an expert at first. So we'll start off by introducing the concept of intelligent applications. We'll have a whirlwind overview of Apache Kafka and Apache Spark. We will oversimplify everything. And we will not cover these in detail if you're looking for detail material on Kafka and Spark. You're in the wrong session. But if you're looking for how to have these already set up for you and use them to put intelligence in your apps, that's what we're going to do at the end of the talk. We will call out some useful streaming algorithms, which is a lab that Eric Erlman said and I are giving in this room next. So all right. So intelligent applications. Think about the apps, if you have a smartphone and does anyone here not have a smartphone? I would commend anyone who doesn't have a smartphone in 2019. It seems like a wise decision. But if you have a smartphone, the apps on the home screen of your phone, what are they? What's an app you use a lot? What's a website you go to a lot? Google? Yeah. So Google is a great example. An intelligent application is just an application that collects and learns from data so that it works better the more you use it and the more people who use it. So I mean, I think that almost all of the apps that I care about have at least some aspect of this to them. And it's hard to imagine that this will become less common in the future. So when we talk about intelligent applications, we think of things that look a lot like conventional applications, except they're processing a lot of data. So we're dealing with, in a lot of applications, we're dealing with data at rest, whether that's structured data from databases or unstructured data from file and object storage. And we're dealing with data in motion. We're dealing with events that are arriving either sort of from a conventional stream or from a message log like Kafka. And what we're going to do in our intelligent application is we're going to transform that data so that it's all in sort of a useful format and federate it so that we can act on data from these different sources at once and ideally train a predictive model to add some value from these data that we're observing. When we train a predictive model, we're essentially learning a function from examples. And we're going to take that function, which we call a model, and put it into production by deploying it as a service that the rest of our application can use. And this will sort of feed back into these other steps. We probably want to interact with external APIs or publish some of these things externally via gateways. And obviously, you can't have an application if you can't interact with it. So we need a UI. And in fact, for intelligent applications, there turn out to be a lot of UIs. How many people have used sort of a, I mean, Django has one of these, Ruby Unreal has one of these, but if you're developing a web application, you have a console environment where you sort of get a repl or a shell inside the environment of your application and destroy your database with any event in the situation. So you want a developer UI where developers and data scientists can interact with the sort of internals of this application. You want the actual end user UI, and then you want a way to manage this and sort of get reports on how well the application is running. Now, I think that it's probably no surprise, given who I am and where I work and what I do, that I think that Kubernetes and OpenShift is a great way to develop applications. But it hasn't always been obvious that you could do all of these things in containers. And in the past, people have said, well, I really want to do my applications in containers, but if I'm going to do machine learning, I need a big fancy machine learning cluster for that. I need a Hadoop cluster, or I need a Mesos cluster, or I need a supercomputer to do my machine learning. So it'd say, well, I'm going to run my applications in Kubernetes because that's what my developers and my SREs want to use. But I have this other cluster for the machine learning nerds. And the problem is you wind up making scheduling decisions in two different places. You have application concerns, but the applications actually depend on these things. And having a separate compute cluster really made sense when machine learning was a separate workload. But you can't run Google search without running PageRank. You can't run amazon.com retail without updating recommendation data quickly enough to exploit it when it changes. So in an intelligent application, these machine learning workloads are really integrated into the application because they provide essential functionality. It's not just something we run off to the side. So we found that actually having compute running inside Kubernetes, inside OpenShift, where you have a logical compute cluster for every application. And then you manage everything in one place makes a lot of sense. So we make the scheduling decisions in one place. My application components get scheduled in. And so do the compute resources that they depend on. And Mike is going to say one thing. And you might not have seen the animation on the previous slide, but one of the disadvantages to having a separate cluster is that the cluster then is determining the scheduling of which applications can run at any time. And it may be doing that based on the resources available to it. Whereas in this configuration, we're letting Kubernetes be the ultimate scheduler on which applications can run. And they have the compute inside their namespaces. So there is no waiting to use the compute. If the cluster is available for general usage, then your applications will run. So this is kind of a subtle point, but it changes the way these workloads get processed. But yeah, and Mike is right that this is a subtle point. So I want to say it again in a different way. Just to really drive this home. Like Kubernetes is flexible enough to run anything you want to run in it. You can train distributed machine learning. You can train scale up machine learning with GPUs. And you can run even services like Kafka thanks to the work of some extremely clever people at Red Hat that people thought wouldn't be possible to run in Kubernetes. And because you can manage all of these things in one place, you no longer have to worry about, oh, my data science team is using this compute cluster for production, and we need it for the application. You just have a single place to manage everything. So I mean, in this room, how many people are really want to be developing applications? Yeah? How many people are sort of interested in sort of managing the infrastructure, though, as an SRE as well? That's the one. Right. So I mean, these are great skill sets. But if you want to develop an application, you want to be able to do it in a way that your SREs are not going to be frustrated with you. If you're going to be an SRE, you want a way to manage these applications so that you're not having to deal with a bunch of competing demands from different teams. So any questions so far? You mentioned that when is this flexible enough to do anything? Any? I don't want to oversell it. Maybe anything is a stretch. There are some things that you can do in Kubernetes but you probably shouldn't. But yeah. If it's communication then. I mean, you're limited by the infrastructure you're running on, right? Like if I have Kubernetes on my notebook, I can't do magic, right? I can't treat my notebook like a supercomputer. I can't solve the halting problem. So I'm just saying that any workload you want to schedule in a cluster, Kubernetes is flexible enough to do that. So if you think, I mean, the distinction I want to make is between something like Kubernetes where you have a pretty general concept of containerized services and connections between those services and something like Hadoop, for example, or Yarn, where you have a way to run Java code in a certain compute model, right? Or something like Mesos, where you have an option to run anything that sort of satisfies a certain API. Or where you have something like Heroku, right? Or like a classic platform as a service where you have a limited model and you can put a certain kind of application in there and schedule it. With Kubernetes, it started off sort of meeting those sort of platform as a service use cases and they quickly sort of built abstractions so that you could make it more general and extended with custom resources and really run things that people thought you couldn't do in containers. I don't know if that's a satisfactory answer or not. You buy it? All those you mentioned still communicate by TCP, which is a lot of overhead and communication is often the bottleneck. Yeah, so are you saying what I run Kubernetes on a supercomputer where I had some special interconnect that's not where I don't have to rely on TCP for communication? Is that the question? Yes. OK, so no. Not yet, right? That's not something that I see the community focusing on. Yeah, no. I mean, I think and we could talk about, unfortunately, I'm giving another elaborate after this one, but we could talk about HPC offline. That's sort of something some of us have some experience with in a past life. Well, I think too, when we talk about deploying something like Kubernetes into a data center and if communications is the bottleneck, what kind of anecdotal stories that I've heard from people who do this is that that starts to get into how much do you want to spend on your networking, right? So if you can afford 40 gigabit ethernet or you can afford 100 gigabit ethernet, as you get further up the chain, something like TCP communications, that starts to come down, right? To the extent that I've heard people say, if I could have 100 gigabit ethernet, it almost starts to perform better than having spinning disks even attached to the machine. So then the question is, what would be the next step, some sort of multi-machine in-memory grid or something? Like how much further could you take it? And that gets into how big is your wallet at that point? I mean, this is another example. I don't want to get into horror stories, but we used to, some of us on the team that we have now, used to work on a product that was a cluster scheduler that sort of predated Kubernetes. And this was very, very flexible. It was almost programmable, right? In the way that Kubernetes is, it was sort of targeted at research environments, where people said, I want to run a bunch of experiments, and I want to manage them and scale them. And we said, well, you can really do anything you want with this. And I had a good friend of mine from way back as a professor who specializes in high-performance computing. And he said, well, actually, I couldn't do anything I want with this, because what if I need to say to my supercomputer, I'm going to check out 100,000 nodes at once? And I sort of said, OK, so we can't do that. We could do that, but we probably can't do that as well as a tool that's designed to do just that. But the other side of that question is sort of like, for most applications, it's a good set of trade-offs. It's not going to be the best set of trade-offs for really specialized applications, necessarily. Other questions? All right, this is going to be the oversimplifying overview of Spark, Kafka, and OpenShift. Basically, just sort of to explain how these things fit into a, if I say containerized microservices, do people know what I'm talking about? Anyone willing to admit that I should take a minute to explain what this is? It's your chance. OK, so people have thought that these kinds of things don't work in microservices. But really, the point of a microservice is you have some stateless thing, or you have a way to isolate the state from a service, and it has these well-defined interfaces. And that's how you interact with it. You're not depending on shared state. You're not depending on access to a shared file system. The assumption of no state isn't always exactly true, whether for performance or correctness. But the idea is that you can isolate the state as much as possible, or delegate the state to another service. I'm only going to give the sort of briefest introduction to Kafka here. Basically, the idea is you're going to process a log of messages. Imagine that you have a log that you can always write to the end of. So if you have one thing that you can always write to the end of, it's a bottleneck. So Kafka handles this bottleneck by scaling out. You have partitions, and you don't have a total ordering across partitions, because that just makes your bottleneck into a really serious communication problem. You have total ordering within a partition, and then you have a way to reconstruct a total ordering by replaying the partitions in order. And that's basically all I'm going to say about Kafka. So I wasn't lying when I said if you wanted to learn a lot about Kafka, this was not the right talk. Apache Spark is a distributed compute model. And how many people here have done parallel programming of some kind before? What have you used? I mean, I know what you used. Python? What libraries? What environments? Multilib. Multilib. OK. Other environments? Spark. Hadoop. Hadoop? Yeah. MPI, another thing that's not great for Kubernetes. Yeah, PBM. P-threads. So I mean, if we think about MPI, MPI is basically if you have a really fast network and you can write things that are tightly coupled in Communicabia messages, you can execute them efficiently. P-threads. If you think that PDP-11 is a good compute model and you want to impose some constraints on top of how you can run that in parallel, P-threads is a great way to start. Hadoop. Commutative and associative binary operations. Well, hey, you can run those in parallel. Let's do that. These things all start with the assumption that, hey, let's think about what's easy to execute in parallel and come up with a programming model from there. Spark sort of does this too, but it doesn't admit it. Spark starts with a programming model that looks like something people know. And if you've done functional programming, you know that Spark has basically list comprehension operations like filter and map and flat map where you take an element and turn it into multiple elements and then concatenate those lists together. And so we build up this sequence of operations and then we can execute them in parallel and that animation was more exciting than it needed to be. But we basically have a way to sort of run these operations in parallel and on a distributed collection. And the way Spark does this is it builds up a graph of the things that you're going to do. So this dependency graph can be scheduled across a cluster efficiently so that you can only communicate as much as you have to. And you're not thinking about where you're moving data when you're writing a program like this. You're just sort of thinking about what you want to do to the data. It's a declarative model rather than sort of an operational view of low-level compute. Now the interesting thing is that these operations are all lazy. So I haven't had roommates who aren't related to me in a long time. But think back, who has had roommates that have been sort of not picking up their fair share of work around the apartment? I guess some of the roommates related to me. The younger ones have this problem too. But I think of a roommate I had in college who would leave a big pile of dishes in the sink. And it would sort of the pile would grow and grow and all of the housemates would say, you've got to do these dishes, man. Please, please clean your dishes. And he would say, I'll do that. And then finally, when it was to the point of people were getting ready to carry him out of the house and leave him on the curb, he would wash all the dishes. So that roommate was lazy. And Spark, similarly, is lazy. Spark will only compute things when it absolutely has to. So in this case, we've built up a bunch of operations. We haven't actually computed anything until we've told Spark, yeah, I need that value back in my program, just like my roommate didn't do the dishes until he was on his way to the curb. So the interesting thing about laziness is that it means that you have to sort of, you can repeat yourself, right? Similarly, if you do dishes as they pile up in the sink rather than when there's a big pile of them, it's usually faster. But we might want to say that we're gonna do multiple things with this result from Spark, but we don't want to compute it twice. So Spark will let us catch these intermediate results and say save these in memory because we're gonna use them again. And those of you who've done work with Hadoop know what a word count application that Hadoop looks like. Diane, how many pages of code is it? It's like 400 lines or something, yeah, right? So you couldn't put it on a slide, right? I mean, you could put it on a slide, but no one would be able to read it, right? So yeah, word count is sort of the hello world of distributed computing, right? It's almost a joke to point out that word count is the hello world of distributed computing. But in Spark, it's a very simple, this is in Python, it's a very simple code that fits on a slide. And the compute graph that we build up as we go through this is we say, I wanna build a distributed collection backed by the files, the lines in some text file, and this could be a distributed text file where different parts of the file are on different machines even. It's not that interesting to write a distributed program to analyze a file that you could fit on one machine, right? And then we're gonna sort of build up a compute graph by splitting this file on spaces, turning each word that we have into a word occurrence, which is a tuple of the word itself and the number of times we've seen it, which is one. Fortunately, we can sort of aggregate those counts together by adding the counts for each word so that we get a bunch of pairs of words and counts. And when we get to this point in the program, we haven't actually done anything yet. Again, laziness. So we're gonna save the result to another file and that's actually when this computation is gonna happen. Spark is gonna look at the compute graph and say, well, I can do most of that without any communication. So it's gonna coalesce those first few stages of the computation together so that you don't have to sort of check point or introduce any unnecessary synchronization in those first few steps. So this is pretty simple, right? People willing to write programs like this. So the good news is that you don't have to, right? Like this is sort of the core of Spark. This is actually the lowest level interface Spark gives you. It's called resilient distributed data sets. Spark has a bunch of high level libraries on top of this for solving problems you actually wanna solve instead of counting the number of words in a file. There's a graph processing library. There's structured query processing with SQL. You can train machine learning models and you can do stream processing as well. And in this talk, we're gonna focus really on the intersection of the query processing and stream processing aspects. The cool thing about Spark is that you can deploy it in a bunch of environments too. You can deploy it on Kubernetes as a self-scheduled cluster. You can, there's a Kubernetes native scheduler for Spark or you can use one of these other cluster managers too if you want. So one of the big problems that people have had in the past with streaming data is you have to write special streaming programs. You have to use special streaming algorithms. The goal that the Spark team had is they said, I wanna be able to use the same abstraction for data at rest and data in motion. Well, how do you treat data in motion as if it's data at rest? Well, you look at it as a bunch of tiny slices of data at rest. So what Spark does is it takes the streaming data and it processes it and it turns it into windows. And then you can process each of these windows as a little tiny batch called a micro batch. And once you've done your processing, you have the output data where these are sort of in batches again and you can combine them together or you have an algorithm that will handle that. So ideally, the goal here is that you can use the same kind of code to operate on data that's backed by a file or data that's backed by a stream. The Spark structured query support is really interesting for a couple of reasons. Like the real strength of that RDD model we saw is that you can say, I wanna run any function on a distributed collection. The real downside of that RDD model we saw is that you can say I can run any function on a distributed collection. I mean, think of, think of like, think of, I mean, there are a bunch of ancient myths, right? About people who get exactly what they wish for. This is always a blessing and a curse, right? The RDD is sort of like that. It will do exactly what you tell it to, even if it's a bad idea, right? So if you think about writing a program to process a lot of data versus writing a database query, hopefully if it's something you can do as a database query, your instinct is, yeah, let the database figure out how to do this efficiently for me. And Spark has sort of a similar option where you can write programs and sort of a higher level library that lets Spark figure out what you're trying to do and ideally do something more efficient than what you would have done. So there are a few interfaces for doing this in Spark. You can actually use a string of SQL if you like to be surprised by runtime errors. You can use this so-called data frame interface, which is a sort of library, a DSL encoded as a library. And in this case, you get fewer runtime errors. And if you use the Scala language, which some of us are very enthusiastic about and some of us are much less enthusiastic about, you actually have a way to get almost everything checked at compile time, which I prefer because I make a lot of mistakes. So the interesting thing about what this does for you is let's think about what it looks like to execute a database query. So is everyone sort of more or less familiar with SQL or willing to believe me when I tell you what this does? One or the other? Okay, so what we're doing here is we have two relations, we have A and B, and we're gonna select all of the fields from those two relations, where they have some value in common and where we satisfy two predicates. And I'm gonna take these predicates at face value and say that this is gonna filter out most of the rows of A because it's uncommon, and I'm gonna take this one at face value and say that it's gonna filter out almost all of the rows of B because that's extremely rare. Make sense? So if we just execute this in order, what does it look like? Well, we have our two relations and we're gonna join them together. So we're gonna take the Cartesian product of everything in A and everything in B, which is gonna get us a lot of things. Then we're gonna go through and say, well, I'm gonna filter out the things that don't satisfy my predicates, which ultimately is gonna leave me with very few things. And then finally return those results. We can be more clever about this by filtering out the things that don't satisfy the predicates before we do the join, right? If we're operating on a smaller set, it's not only gonna be faster, but it's gonna be friendlier to our cache, it's gonna be friendlier to our disk, it's gonna be friendlier to our network, really gonna be better in every possible way. So we're gonna start by filtering out things that don't satisfy the predicates, and then from there, we can take the Cartesian product of those things and do the filter on the join, right? And that leaves us essentially with just the things we care about without wasting a lot of work. So the cool thing that Spark offers is this structured streaming capability, which really lets us write these kinds of programs that look like database programs on a stream of data. And that's what we're gonna be focusing on in this lab. But we can do a lot of really cool data engineering things with very little code and without worrying about our compute model. I just wanna call out some community work we've done. In Red Hat and RedInelix.io is an open source community effort that has sort of focused on how can we enable these sorts of intelligent applications on OpenShift. We've done a lot of work with Apache Spark, Jupyter Notebooks, and TensorFlow on OpenShift, and sort of there are a lot of example applications on RedInelix.io. I don't, I don't know if we have physical, we have physical laptop stickers. We have physical laptop stickers, but the laptop stickers that you would be interested in for this talk are OpenShift, which I don't think we have OpenShift stickers. Downstairs, yeah, downstairs. That's the RedInelix.io sticker. And then on the left we have StreamZ, which is the sort of tooling that makes it possible to run Kubernetes and OpenShift, which is a fantastic project. And we've loved using it. It makes it just painless to get running. So there are a couple of things we wanna call out, and then we're gonna get into the interactive part of the lab. So the first one is, when you're doing these sorts of intelligent streaming applications, it's useful to know about algorithms that can operate on any amount of data, while only observing things one at a time, right? You only wanna examine something once. You want something that can scale. You want something that's incremental. You want something that's parallel. There are a lot of cool streaming algorithms. I'm not gonna talk about any of them in this lab, but I am gonna talk about them in about an hour and a half with Eric, who's in the back row right now. And if you wanna check out that notebook, please don't do it in this lab, but you can just scan that QR code and get to the notebooks to experiment with some streaming algorithms if you don't have time to make it to the next lab. So now I'm gonna ask everyone who wants to do interactive work to get ready to type in a URL. And we're gonna go to an OpenShift console here. And once everyone has this up on our screen, we're gonna have a counting exercise, not a counting at scale exercise, but a counting exercise. So while people are connecting to this, so this will ask you to accept a certificate. We're not doing anything nefarious. We just don't have a real certificate for this infrastructure that we just spun up for this lab. So I mean, don't type in your credit card to the OpenShift console, but I would hope you wouldn't do that anyway, unless you're an OpenShift customer purchasing OpenShift dedicated. Well, people are getting connected still. Does anyone have any questions at this point in the lab? Is anyone not connected? I'll leave that up for another minute then. Okay, you have the URL. Okay, great. All right, so let's see if I am connected is really the question for me at this point. That image, this image is a travel keyboard from the Musical Instrument Museum in Brussels. So if you wanted to have a keyboard answer, I think this might be like a virginal or a harpsichord. It's a plucked string rather than a hammered string. Predates the piano, but if you were traveling and you needed to practice something that required no more than like two octaves of notes, you would bring one of these. Looks like the image is quite cool. Yeah. Mike, are you connected at all? Like, I'm just getting a loading screen. I can mirror my display and show everyone what it looks like. I guess I'll try the other Wi-Fi. Yeah, I'm getting a pretty decent Wi-Fi signal. Yeah, it's easy. Yeah, it should be connected. It's not letting me... Yeah, maybe it is just don't. So... At the risk of... All right, we're gonna try something really exciting. I think we're gonna try and plug Mike's computer into the projector. That's which. So... At least you might, right? We'll see. Like, I think this adapter was... It wasn't doing the widescreen or whatever. So it was cutting off the sides of it, so should we try this? Yeah, let's try it. I was tethered to my phone earlier and I just, my OpenShift console never loaded, so I tried getting on the Wi-Fi here. This is being recorded. I am not gonna do anything that's not responsible information security, which includes having any colleague type in their password on my machine. Can we just not go full screen? Yeah. I mean, that's sort of clunky, but... Of course that's what it does. So... Okay, so let's do the counting exercise. Well, Mike and I figure out our A, B in order. I'm gonna point to people and you're gonna say a number and then you only have to add one to it. Really do anything with these kinds of numerals. They're really useful. One? Two. No, no, you can be one then. Two. Three, four, five. I'll say that you also need to remember your number. Okay, so your username on this OpenShift console is going to be lowercase user, followed by the numeral that you said out loud. Your password is going to be lowercase user, followed by the numeral that you just said. No, no, password is OpenShift. Password is OpenShift. Okay, so please don't log into someone else's account. That will just confuse and annoy them. I don't know how you know my password, Mike. That doesn't seem like good information security. Yeah. It's not. So what we're gonna do, so everyone should be at a screen that looks like this now, right? Yeah. It looks like that. It looks like that. Right now since I've been weird. Okay, it's one click. I'll try not to right-click. Yeah, it won't right-click. Okay. So what we're gonna do here, once we're on this, so this is the OpenShift console. Is this new to anyone? People seen OpenShift before? It's DevConf, you may have seen it, yes. Oh, good call, Sophie, thanks. Yeah, so you click on username up to, so if we're like this, then you're gonna click on the project that has your name. All right, so if we're all on a screen that looks like this, I'm gonna click add to project, and then I'm gonna click select from project. Okay, so up on add to project, on the upper right-hand corner, and then select from project. And then I'm gonna select this thing that says streaming lab notebook. So what this is is we've installed a template in OpenShift that basically says, here's how I'm gonna set up an environment so that I can interact with some streaming algorithms in a lab and with sort of some stream processing stuff in an interactive environment. And this will set up everything we need. There's sort of, we're gonna step through it and make sure this is what we wanna do. You'll have to trust me because we didn't put a description in the template. And there's actually a bunch of, there's some other configuration we could do, but we don't need to do any other configuration now. So we'll just click create. And once we click create, this is fine, it's gonna be starting this up. Is anybody having difficulty with this? Just raise your hand.com. And what we're getting here is we're gonna see that OpenShift is telling us that we have a deployment of this notebook server and we have a single pod running our application. And once this is blue and a few mouse over it, it says running, you can click on this link up here and actually get to the actual notebook, okay? And your password here, again, extremely secure, is gonna be developer. I'm not gonna save that, Mike, sorry. Okay. So everyone should have something that looks like this. We've been giving this lab for a long time, as you can see. And what we have here is this is called a Jupyter Notebook and Memory. Has anyone here used Jupyter Notebooks before? Anyone here not used Jupyter Notebooks before? Okay, so we will explain what's going on here. We're gonna go through some interactive notebooks and the first one we're gonna click on is called Social Firehose. So it's this one right here and it's gonna load up basically this sort of literate programming environment where we have both text and code and we can execute text and code. So I'm gonna go up to this cell menu and I'm gonna select All Output Clear because I'm gonna run this. It had some pre-rendered output in but we wanna run it ourselves and so we can experiment with it. And there are two things you need to know about notebooks. Only two things you need to know. One is Shift Enter. Shift Enter will execute the cell that you're on. So we see we have a blue line to the left of this cell that has some text in it and go to the next one. Executing a cell that has text in it doesn't do anything. You have to do that yourself by reading it. But executing a cell that has code in it will actually run this in a Python interpreter and leave those results. So what we're doing here is we're just setting an environment variable before we start up Spark so that Spark is able to load the Kafka connector and talk to Kafka. And then we're gonna connect to Spark. This is just some code that sets up a Spark session. Basically initializing Spark so that we can do some streaming data processing. And that's gonna take a second. We're gonna sort of basically just say that we're gonna use two cores on our local machine without connecting to a cluster. We're going to set our app name and we're gonna create that. So once the asterisk next to the cell turns into a number you know the cell has finished executing. So shift enter, asterisk turn into numbers, two things you need to know. The other thing you need to know is that if you get really stuck, go up to the kernel menu and select restart and clear output and then go back to the top. That will sort of reset you and get you back to the vector you started. Okay, so what we're doing here next is we're gonna create a Spark data frame backed by the contents of a Kafka topic. So we're gonna say I have some messages coming in on Kafka and I want to treat those messages like a database table. So this is gonna return pretty quickly because Spark is lazy and it's not actually doing anything here. It's just sort of connecting to Kafka verifying that there's a topic and maybe figuring out what the schema of the data on that topic are. So we're gonna load that so we can tell that Spark is lazy because we can get a count of the number of messages on the topic and then if we wait for 10 seconds and get the count again, those counts will be different. It's like a database table that's always changing. This may bring up bad memories of writing web applications for some of you. So this is gonna run, it's gonna take more than 10 seconds to run this cell because we're sleeping for 10 seconds in the middle of the cell but we should get two different counts and we'll see how many messages are arriving in that time. Okay, so in 10 seconds we had 90 messages arrive. So this is not a high-scale application, this is an application that we're running on temporary infrastructure to demonstrate how to do these techniques. And in fact I did set up the generator at 10 hertz before it started so it's confirmed that it's actually doing 10 hertz. It's good to, yeah. Okay, so we can also do an action and just sort of take the first three rows of this data frame and see what they look like. See what this data that we're getting off of Kafka looks like. So this is a Spark data structure called a row. So we have a lot of stuff here. We have a no key, we have a value which is an array of bytes which is a JSON encoded object. And we have these three values. So that, I mean I may be getting too old for this but this doesn't look like any sort of data I wanna deal with so we're gonna figure out how to clean it up. So I'm gonna import a bunch of Spark functions here. The sort of narration explains what's going on as well but basically what we're gonna do is we're gonna take the JSON data we have in these messages on Kafka and we're gonna turn it into objects so that we can do something useful with those objects in Spark. So the first thing I'm gonna do is just take out the value, right? We're not using the keys that Kafka gives us. We have keys and values and we're just gonna look at the values. So if I just take the values and show the values well I can see that I get a JSON string and I guess we'll take Spark's word for it that there's more of the string there in that ellipsis. I could do values.take if I wanted to see the actual results there too. So now you see I have these row objects and I just have one field called value that just has that JSON string in it. The next thing we're gonna do is we're gonna actually turn that serialized JSON object into an actual record that we can do something useful with. And basically what this looks like in Spark is we're gonna declare a type. We're gonna say this is what the structure is gonna look like. We're gonna have these fields. Text field, a user ID field and an update ID field. And then we're gonna convert the JSON strings to structures. There's a function in Spark that does this if you give it a schema. And then finally we're gonna take the fields of that object out of the object and sort of flatten them so that we have something that looks like a database table. So we're gonna go from a JSON object, a stream of JSON objects to something that looks like a database table but is backed by a stream and has three fields. Make sense? And please interrupt with questions anytime they occur to you. Is there like a direct data frame way to do this in as well? So it is being read as a data frame. And we're just using user defined functions to convert the data that's in the data frame to data that's in a data frame but is more useful. That's a really great question. So what we had above, this is a Spark data frame. It's just that you have like a database table that has one field which is called value and has that string, right? I don't wanna deal with that string. I wanna deal with a structure. I wanna deal with three different fields. So what we're gonna do is we're basically just gonna decode this JSON serialized object into three fields that we can access and use. So we'll see what that looks like and we see that we can actually get what we expect, right? We get three fields, we get an update ID, a user ID and a text. So this is sort of cool, right? And this is, if we look at these things, they look a little bit like social media. And so as in the real world, this synthetic data that we're operating on, some users are chattier than others, right? Everyone knows someone who you have to mute on Twitter because they're talking too much. And we can say, well, let's look at the users who are most writing the most status updates. And this is just basically a database aggregation. So we say we're gonna take our records and we're gonna group them by user ID and get the count of all the records that we have with a given user ID. And then we're gonna sort them in descending order by count. And so we'll look at the top 20 users and in the 17,000, however many updates we had, the top 20 users are responsible for 28 updates. So how long has this been running? Running at one hertz overnight and then at 10 hertz since this morning. So if you think about someone you know who's tweeted 28 times since this morning, you're probably muting them. So this is a useful query already, right? And if you run that query a few times, you might get different results because these users are probably still tweeting even though we've muted them. And the data frame is gonna reflect newly arriving messages. So we can also count the number of distinct users we've seen. And because of the way that we're generating these updates, there is an upper bound on that number which will probably not be that close to yet. Yeah, so we have 3,563 different users generating updates. That probably won't change if I run it again but we could try, oh no, there we go. So we had a few more users join. And you know when Twitter went from 140 to 280 characters, do people use Twitter? People are aware of Twitter, people avoid Twitter. These are all reasonable approaches I think. But Twitter used to enforce 140 character limit and then they went to 280 character limit and I feel like I know people both in real life and only through the internet who were just delighted at the opportunity to say it twice as much. And so you probably know people who are always bumping up into the character limit on whatever they can type in a form on the web. And we can write a function to say who has the longest average updates. This is a sort of more interesting aggregation. Creating a data frame called user locacity where we're selecting user ID the length of the update text. And then we're gonna, for each user ID, we're gonna get the mean update text and sort by descending mean update text length. So this will take a little longer but we're gonna see that yeah, this person is using 275 characters on average which is quite a few. They must have a really interesting lunch or just went for a really incredible run or something. We can also do some even more interesting things like say, if a user is posting with hashtags, these words that start with an octa-thorpe, we wanna say one of the most popular hashtags that we're seeing in our social media stream and we'll start by just sort of like we do with word count turning each update into an array of words and then we'll explode each of those arrays into multiple rows so that each one has a single element. So we're gonna go from something like this, a row that looks like this, one, two, foo, bar, blah, to three rows where one, two, foo, one, two, bar, one, two, blah, right? And then we're gonna filter out things that aren't hashtags so that we can identify the hashtags that are the most popular, sort of a bunch of steps there to solve a problem that no one in here cared about two hours ago, right? All right, well, let's pretend. And just sort of to show you the kind of thing you can do. And so we have a data frame called words where we've split these things up into words and exploded them into different rows. And then we have this data frame called hashtags where we've filtered out so that we only keep the ones that start with the octa-thorpe there. And I wanna just sort of show the top 20 of the words and these are just taken in arbitrary order. You can see that they actually look like sentences. They're coming from a message. And that's just the first few words that we ran into in our data frame. If I wanted to look at the top hashtags, those would be different, right? There are other hashtags there. So if I wanted to aggregate over those hashtags, I could do something similar to what I've done before where I just find the most popular grouped by the hashtag. And as you can see, a lot of people are very excited about commenting on something before anyone else has. So hashtag first and hashtag one are very popular. They have a bunch of proper names, a bunch of other sorts of nouns in there that are popular hashtags. But out of 17,000 some tweets, we have 371 of them include first, which is sort of an interesting distribution. So any questions so far? Anyone not? So we are operating on synthetic social media data because if you've ever seen a demo involving real social media data, it's truly horrific. It's like what people put on the internet. Well, but what people put on the internet is awful. It's like, yeah, no, I mean, it's just, you'll just be either sad or scandalized or both. We might have to censor some of the data. Yes. So we do have, we are using synthetic social media data. I don't know, Mike, do you think we have time to, is anyone interested in seeing how we get synthetic social media data? Is that a yes or a question? Yes, okay. We have time to go over this quickly, right? Yeah, okay. So all right, so we have another notebook that basically replicates the script that we're running that we're using to talk to Strimsy to generate this stream of updates. And if you just go to this generate.ipython notebook, click on it with the correct mouse button unlike the way that I just did that. And basically what we're doing here, you don't need to actually run this. There's output in this notebook. I think you can run it if you want, but it's not necessary. What we've done here is we've taken the complete works of Jane Austen, the English novelist, and we've generated a Markov chain which is basically in the works of Jane Austen, how likely is it for, given a sequence of words that some word will come next. And we're using that to generate random text in the style of Jane Austen. Now, Jane Austen is an amazing sort of world class all time author. Most people who write on social media do not write as well or as cleverly as Jane Austen. So we can't just use Jane Austen as our source or it won't look like social media. We also want hashtags because Jane Austen didn't use hashtags, but people on social media do. So we're using a natural language processing library called SPACY, and we're gonna use SPACY to identify named entities in our text. And basically, we're not to dive in too much to natural language processing yet, but this basically just means we're gonna say, what is it a noun? Is it a noun phrase? How is this thing functioning in this sentence? And we're gonna turn nouns and noun phrases into hashtags. And so let's see what this looks like. So we have some example sentences up here with no hashtags. And then we're gonna turn them into, we're gonna see what it looks like when we turn them into hashtags. So you can see we have a sentence with no hashtags in the same sentence where we've turned the things that SPACY identifies as named entities into hashtags. So unfortunately, we have a lot of these chapter 14s, which is what I get for taking the complete works of Jane Austen from Project Gutenberg without cutting out chapter headings. But you see Charles and Mary become hashtags. You see here, we have a noun phrase, Fanny Price. Fanny Price was at this time of year. Another noun phrase becomes a hashtag. So even if it's more than one word, SPACY is sort of clever enough to identify or has a good enough model to identify what these things are having like a very few weeks, right? Like you probably know someone who makes up sentence length hashtags like this. I know a few people like that. Hashtag this time of year. Hashtag this time of year. Okay, so now because the level of discourse in Jane Austen novels is dramatically lower than that on social media, we're gonna have some other sources too that we're gonna incorporate into our stream. And the other sources we're gonna use are public domain Amazon product reviews of fine foods taken from a Kaggle competition. And we're gonna use some of the positive reviews and some of the negative reviews and incorporate those into our social media stream. So I didn't need to do that, but we're gonna operate a, I'm running a cell in the middle of the notebook just by, this is what happens when you're on a cell in the middle of the notebook. You haven't imported the libraries, you're gonna use your problem. So this is just sort of some code to read from GZIPs text files that have these reviews and we have a model of negative reviews and a model of positive reviews. And then we can see that if we make short sentences based on the models in the positive and negative, based on the positive and negative models, we actually get a negative sentence from the negative model and a positive sentence from the positive model, which it's nice that it works out that way. These are short sentences. But we can actually combine all of these models together and say, let's build a Markov model of all of these things combined and we might just sort of mid sentence go from being Jane Austen to being like, why did I buy these cookies, right? And that gives us some really, really strange results. But I've included it in the notebook because it's sort of amusingly strange if you're easily abused. And as we saw with the queries, we're assuming a distribution of users where some users are really chatty and most users are not, right? So this is just what the distribution of user updates looks like. We have sort of a histogram for each user of when we use this code that we have to generate how many times we're gonna see tweets from a user. Some users or a few users are responsible for almost all of the updates and most users are responsible for very few. Stop lurking and engage people. So we put all this together and we have a function that sort of generates synthetic social media updates. We did add some hashtags that don't appear in either Jane Austen or product reviews but that appear a lot in social media like Follow Friday or YOLO or retweet or so on. And then we have basically a Python generator that we can use to generate a pair of user ID and tweet. So you can see what the output from that looks like and if you imagine what it would look like to put that output onto a stream instead of just into a notebook cell then you basically know what we're running this lab on. Thank you. We'll be here all week. Any questions? Yes, absolutely. If you go to github slash rat analytics IO slash streaming dash lab, yeah. Everything's there and there are instructions on how to deploy everything that you see here on your own open shift infrastructure and you can play with all the code. And there's actually some bonus material in these notebooks that we're not gonna have time to cover today with computer vision. So if you wanna dive deep, it's available. All right, so that's and you can obviously sort of see how efficient this is to make sure you can run at a given rate and so on. But that's just sort of a quick sidebar there on actually generating this stuff. So the last thing I wanna do before handing it over to Mike to talk about applications is look at how we would actually do an intelligent application with this data, right? We've looked at how to process streaming data at scale. We've looked at how to sort of do these database style queries and we wanna do something useful with this data. So let's look at the problem of sentiment analysis. Basically sentiment analysis is a sort of common natural language processing problem and the idea is looking at a sentence, can you say whether it's positive, negative, neutral, what? This is useful for a lot of applications but I think a lot of them have to do with actually interacting with humans and if you have an automated chat service for support for your product and someone is becoming increasingly enraged with the answers that your automated chat bot is providing, you wanna maybe escalate them and put them to an actual human. If someone is tweeting about your product, you could write a very simple script that just automatically retweets or puts on a video screen somewhere, everything that someone is using your hashtag but that is almost certainly gonna lead to hilarious results. Maybe you want an automatic way to filter out things that are negative before you do that. There are a lot, I think a lot of cases of people broadcasting social media updates that mention their product not having the results they'd hoped for which is why we don't use live data and live real data, live synthetic data. We had a long discussion about what the data sources we would use for the, how could we make it appropriate for everything? And not taking anything off the public internet is really the only answer. So there are two, so again, like the focus of this workshop is really like, don't become a machine learning expert just to add intelligent features to your app. Learn enough about these techniques so that you can figure out whether or not they're working, you can evaluate whether or not they make sense and learn enough about them so that you can integrate hard work that other people have done. This is really the sort of standing on the shoulders of Giants Lab in a bunch of ways. And we're gonna delegate most of our work to two different libraries. Spacy is a really cool natural language processing library for Python. It's really fun to interact with. It has pre-trained models for a bunch of different natural languages. So we're gonna use English, but if you'd prefer to analyze text in a different language, Spacy has a model probably for many languages that you would be interested in working with. And Vader is a library that can actually look at an individual sentence and tell you whether it's positive, negative or neutral. So we're gonna use these libraries in conjunction because actually dividing a social media update into sentences is sort of an interesting problem. And if you think about like, how would you divide something up into sentences? Well, you could write a regular expression, but I don't even need to finish the joke. Like you have multiple problems once you have a regular expression. You can't just split on periods because their periods appear in the middle of sentences. It's more complicated than it seems. You really want a model that says, you know, people don't punctuate properly, right? You want a model that says, how is this grammatically fitting together and where are the sort of sentence boundaries likely to be even if we can't sort of detect them syntactically? Does it make sense? So we're gonna use Spacy to split tweets up into sentences and we're gonna use Vader to identify the sentiments of each. And we're gonna do it all in a streaming database query. All right, so we're gonna start by, again, we can go up to the cell menu and select all output clear if we wanna play along. If we just wanna look at the pre-rendered output, we can do that too, but that's less fun and we're paying for the infrastructures. You guys will use it. So we're gonna start by importing the Spacy library and loading a model of English. And then we're gonna import this sentiment analyzer from the Vader library. So we're gonna start with an example just to play with Spacy and see what it looks like. And this is the first two sentences of Pride and Prejudice. We're staying with Jane Austen for this entire lab. And we're gonna have Spacy use the English model to parse this and make sense of it. We're gonna use it to analyze this text. So we have a result from using Spacy and Spacy has turned this into a Spacy document. So we can look at the document and we can do various things with it. So how many people here are really comfortable with Python? I should have asked this earlier. How many people here are somewhat comfortable with Python? Okay. How many people here have not used a lot of Python before this lab? Okay. So the really cool thing about Python or a really cool thing about Python is that most code is self-documenting, right? So if I don't know what this Spacy tokens doc-dock thing does, I could search the internet for API documentation or I could just use the built-in help function. And the built-in help function shows me that I have a bunch of different things I can do with one of these Spacy tokens doc-docs, right? So there are a lot of really cool things I can do here. I would say if you wanna, like there are a lot of fun things you can do with Spacy and I think just to give you an idea of how powerful it is, we can use it to identify the parts of speech in this natural language text. Those of you who did not enjoy studying grammar in school probably wish that this were available then, right? But we see that we have it pronoun is as a verb, you know, truth, noun, universally adverb, acknowledge verb. So it's really, it's really does, I mean, it's not perfect, but it does a pretty reasonable job of figuring out what's going on in this text. And there are a lot of, and this is again what we used to sort of identify things that should become hashtags earlier, like we just looked for the named entities, right? So this is pretty cool. There are a lot of other things you can do with this document and Spacy is a really cool library to know about even if you never use it with streaming. So I would say this is sort of enough to get started and read the docs and think about it if you need something for natural language processing. So again, for this notebook, we're just gonna use Spacy to say where are the sentences and then feed those sentences into Vader, which has a model for English to analyze sentiment. So we're gonna set up one of these Vader analyzers and let's get the sentiments of that, the sentences in that Jane Austen excerpt. So we have, we're gonna get the sentiment scores for every sentence in the sentences in our document. And we could just use this sense, sense, it's hard to say that without sounding like sense. The sentences from this result and analyze the sentiments of the sentences. As we can see, we have two sentences and they're both sort of generally positive, sort of neutral. Again, like if you're analyzing Jane Austen, you really want something is this also laugh out a lot of funny, but we don't have one of those. We just have positive, neutral and negative. So if we wanna see some really negative prose, we can take some raw text from the negative product reviews. So let's look at these two sentences. The first one is this oatmeal is not good. It's mushy, soft, I don't like it. We have a few sentences there. The second review, seriously, this product was as tasteless as they come. There are much better tasting products out there. So we'll see that these models aren't perfect, but we'll look at the sentiment scores for each sentence here. This oatmeal is not good. Negative, high, neutral, sort of there. Compound is the number you really wanna look at and that's less than zero. So it means negative. It's mushy, soft, I don't like it. Negative, Quaker Oats is the way to go. Neutral, I guess. I mean, it sounds kind of positive to me, but I don't know what we're using Oats for in this case, so out of context maybe. Seriously, this product was as tasteless as they come. Again, that's sort of a negative score. This one is interesting because this sentence is clearly not that positive, right? Like, this is sort of damning with faint praise. Is that an idiom that's familiar to everyone? It's where you say something that's like, sort of like, wow, it could be worse. It's sort of, but although this is really not saying anything positive about this product, it has a bunch of much better special, right? It has a bunch of words that sort of are associated with positive sentiment, so it winds up getting a high positive sentiment score. So again, these models aren't gonna be perfect and it's something that you sort of need to sanity check if you're gonna use it. So we see how this sort of looks and it's a simple way to get reasonable results, right? Even if they aren't perfect. So it's adding intelligence to stream processing applications, not adding intelligence to individual sentences that we've taken from data sources and put in a notebook, right? So we wanna do this. We wanna do this in a stream, right? All right, so we'll show you how to put this stuff actually hooked up to Spark. As before, we're gonna connect to a Spark session, create a new Spark session and create a data frame backed by the Kafka topic that has our synthetic social media updates on it. And what we're gonna do to save us some time is we're gonna do all of that destructuring that we did in the last example right away. So instead of doing all of this from JSON stuff, we're just gonna sort of go from the Kafka topic to a table that has three columns. So the thing we're gonna see next is we looked at a Spark user defined function two notebooks ago. I know that was a long time ago. We may not remember all the details, but basically you declare a type, you declare a function, you register it with Spark, you can use it in your database queries. We wanna use both spacey and Vader in these kinds of queries as well. And we sort of have to do a little bit of magic to make that work because when you do these distributed computations with Spark, you need to distribute everything that your function depends on either by serializing it so that you can send it to different cores on the same machine or across the network if you're running on a cluster. And the models that spacey uses don't serialize very well, so we need sort of a sneaky way to get those spacey models on other, you know, from within our Spark jobs. And it winds up, for those of you who are familiar with pthreads, it all comes back to pthreads. Actually, it doesn't really all come back to pthreads. We wind up simulating something like thread local storage. I'm not gonna explain what's happening here, I'm just saying exactly what's, I'm not gonna explain exactly what's going on, I'm just telling you why we have this cell that you need to execute or else your notebook will not work. Okay, so now we can use the spacey model to make a user defined function that splits an update into sentences. Okay, so it's the same thing we were doing earlier on the Jane Austen and on the product reviews with creating a spacey document and splitting that by sentence. But now we have a user defined function to do it on a data frame and on a data frame backed by a Kafka topic. Makes sense? So we'll run that and let's see what it looks like just running it on a few rows of the data frame and this will take a little bit longer but we see sentences, there's one sentence in this update. So on, we'll see some sentences, sometimes these strings of hashtags that people add to their updates to make them more discoverable and increase engagement wind up looking like a separate sentence. But you get the basic idea. Hashtag blog post, hashtag help. Social media, YOLO. And you see that this person who really loves ellipses so much that they've invented new ellipses that science has not yet discovered is not confusing spacey. We're not identifying a sentence that's ending here, for example. So that's, and we see that with these updates there are a few sentences in each update there that we've split up. So, or we have a few updates with the same ID because we have run it twice. Yeah, okay, so we've run the generator twice. All right, ah yes. So we're gonna, again as before, we're gonna split these things up into different rows and see that we have, have, you know, a different row for each sentence. So now we're, the next thing we're gonna do is we're gonna actually create a user defined function that we can use to annotate each sentence with a sentiment score. So the idea is I'm gonna use Vader and I'm gonna do what's called a broadcast variable in Spark, which is an efficient way to communicate this sentiment model to all of the workers that might be working in my distributed system. So I take the Vader analyzer here, I tell Spark to broadcast it, and then I have basically a handle for it. This is also kind of sort of like thread local storage. Hall comes back to P-threads. If you take away anything else from this lab. You mocked me this way. I didn't mock you at all. So what we're doing here though is we have a handle to a value and then inside our actual user defined function we need to dereference that, right? We just need to say dot value to make sure that we can access it. So what this does in distributed context is instead of serializing the sentiment analyzer with this function when we ship it off to the other nodes in the cluster we can rely on a more efficient way to communicate this sort of larger and more complicated data. And for reasons that I do not wanna get into and actually will insist on taking offline this works for the Vader model but it doesn't work for the spacing model. Thank you for not asking full of questions respecting my wishes there again. All right, so we're creating user defined function to add sentiment scores to sentences and then we have a way to write the thing we wanna do which is annotate sentences with sentiment scores in a query. Okay, so Will you say this lab is entitled adding intelligence to stream processing applications. You don't have an application. You have this interactive notebook. I pressed shift enter many times but I don't have an application. I will address your concern in just a second. First, does anyone have any questions? All right, I will address your concern by handing the microphone to Michael McEwen who is gonna show us how to turn this into an application. All right, so we saw how easy it was to take this notebook that Will's been working on and each one of us is loaded it up now, right? So we've all gone through the process that Will went through in learning about the data and in creating this kind of enrichment of the data, right? So at this point though, like Will's saying this isn't anything we can hand to users or customers or whoever it doesn't really do anything aside from just explore the data and show us what's going on. So we might come together and let's say, Will as the data scientist and myself as an intrepid web engineer and we say how do we turn this into something that we can put in front of our users now? And the first step would be Will shares this notebook with me and I start to look at the code and I say all right, I see how these things work. I see how it comes together. I'll start to think about what we can do to do something with this. Now I'm gonna just open our GitHub repo here because we have some diagrams and I'll kinda show you the process of what we're gonna create and then I'll deploy it and we'll kinda look through what's happening. So this is the repo we mentioned before and you can see we've got some instructions here on what's gonna happen. This is basically a diagram of what we have been doing, you're not seeing Kafka because we haven't deployed in a different part of OpenShift but we had our generator, it's sending messages and the Jupyter Notebook is picking them up. So what I would like to do now and this is the screen's a little bit, okay. So what I would like to do now though is those updates are coming directly from whatever our social media platform is and they're getting put onto a topic in Kafka and what I would like to do is create a Spark application that will live as a microservice and it will read from that first topic, apply the sentiment analysis and then put it onto a second topic and now the second topic contains all the data with the sentiment analysis on it and then we can have other applications that read from that sentiment-enriched data that we can do, you know, fun things with it that way. Now I need to go back to full size so that I can read this. Now the way to deploy these things if I were working from the command line is to use this, you see these OC commands and for those of you who are familiar with OpenShift, OC is like the OpenShift client and everything that you're seeing here I could also do through the web console, it's a little more verbose because there are a lot of options we have to fill out and tell it where the repo is and stuff but what we're gonna do is I'm gonna use this command to create a new application for myself and I'm gonna use a template that we've created and this is part of the RAND Analytics project, we've created some templates that allow you to bind Spark clusters kind of ephemeraly at launch time with the application that you're creating and I'm gonna tell it where to get the source code from which is this repository and the code for this is in the update transformer sub-directory. I'm gonna tell it where the Kafka broker is and I'm gonna give it an input topic and an output topic to do the transformation we talked about. Now before I launch it though let's take a look at the source code because we can see how the work that we'll did has actually come directly into the code that I created that kind of wrapped those algorithms. So we'll go into the update transformer, you can see this application is actually very simple, it's just a single Python file and it's actually only 134 lines. Again, here's this magic code that we talked about before that has to be implemented but we'll discuss it no further. Exactly, yeah. Let us not talk of the shortcut again. And then this gets into the main part of our application. Now this looks a little bit different than what we did in the notebook but essentially it's very similar. I'm getting a connection to Spark at the top. I'm setting up the same message structure that we use to decompose the messages coming off the stream. I'm creating my, this is the structure that's gonna use. I'm setting up the sentiment analyzer and this broadcast variable so that when the Spark workers are doing this they can use the proper context for that. And then this here might look very similar to some of the things that we did in the notebook because I took this right out of the notebook in terms of how do we pull out the various pieces of it? How do I get the sentences and then how do I turn those into sentiment and what this is doing is this is the user defined function that I'll use to add the enrichment to the data. And this code came directly from the notebooks. I wrapped it up a little bit to fit in here but it's essentially the same code. We've got the JSON converter to pull it from the JSON and turn it into this. This is actually gonna send it out to the output and we're setting up the UDFs and then finally we get down to something that looks very similar to what we did when we connected to Kafka. This long piece here is how I'm telling Spark to set up the structured streaming. So what I'm instructing it to do is put together a series of operations that will happen on the stream. And so at the top, this looks very similar to what we did before. We're connecting to the broker and we're connecting to a topic and in this case, the values are not hard coded in here because we want this application to be useful beyond just our one specific setup. So this is part of the work of the application engineers to make this more generally useful. And you can see here we're selecting the value that comes off of Kafka. We're turning it into JSON. We're pulling out the various fields that we want or the columns from that and then we're putting it into our sentiment generator. This is gonna return an object to the next operation and this next operation will convert that. It's gonna take the user ID and the update ID and the text and the result of the sentiments and it's gonna turn all this into a value that will get passed down into the output. Is this all making sense? Any, am I going too fast? Okay. At the bottom here, we're setting up what we want to do with this stream. Now in the case of the notebook, we created everything above this and then we just kind of took values one at a time to see them. But what I'd like to do is I'd like to tell Spark, you know what, I don't need to sit here and look at them one by one. I want you to write this and stream out to another Kafka instance. So I'm telling it with a broker to go to. I'm telling it the output topic that I would like to go on and just doing a little setup here. Then I tell it to start and my application will just run until something interrupts it. So this is just gonna continuously run, read messages from the input topic, apply sentiment to them and put it to the output topic. There's a little more at the bottom here that's just, for lack of a better word, boilerplate to help set up the application to do the handling of the input and output so that when it runs, it can read environment variables that will be the where's the broker, what are the topics you'd like and what not. Sue, what I'll do now is I'll go back and I'm gonna just copy this command and hopefully not screw it up too bad because I have a habit of doing that. And my colleagues here find it extremely hilarious when I do. This is actually totally false. I've seen Mike be completely global, debugging my catastrophic infrastructure failures in front of like hundreds of people. So I'm tapping over to a pre-rendered notebook and Mike is like, well, let's figure out what's going on here. So, okay, this might be a little tough to see. Let's see if I can maybe get a better, it's all better here, here. Okay, is that big enough, everyone can read that? So I'm just making sure I'm logged into the cluster, see who I am, I better make sure what project I'm in. That's a good call there. Okay, so I'm in the right project, I'm in the lab project for myself, but what I'm gonna do though is log out of Will's account, I'll log into mine and you can see a little bit behind the scenes of how we're setting this up. All right, so this looks kind of similar before. We're looking at, we're looking at OpenShift here and these are different namespaces inside of Kubernetes and just so you see what's going on, I've got a namespace here called Kafka and I've actually deployed the Kafka brokers in there using the Strimsy tooling. It's just a couple of simple commands and Kafka's running for me. So I've made this available to everyone's project so that we can all access that same Kafka. And then inside the lab, right now what I've got running is just the generator application. So this application is the one generating the tweets that we're seeing. And what I'm gonna do now is I'm gonna hopefully launch this other application. So let's paste that in and so I'm telling it the application name, where the Git repository is what directory to look into. I'll tell it where my Kafka broker is and I see there's a mistake in the readme file. If anybody can beat me to a pull request, that's cool. So I'm gonna tell it, I'm gonna read from this social fire hose topic, I'm gonna write out to a topic called sentiments. This is telling, oh, you know what, I did the wrong thing. So, okay, I've got, the cluster that we're running on here is a version of OpenShift that came out last year and we're doing that because of some infrastructure issues. And I've got a different version of the OpenShift client running here. So if you see if I type those, yep. You can see the client that I'm running is actually a 3.11 client, but I'm talking to a 3.7 cluster. So it doesn't recognize some of the objects that I'm trying to tell it, these are Kubernetes objects that I'm trying to tell it to create. So what I'm gonna need to do is use a previous version of the client to make sure that the compatibility is a little better. So go back up to my command here. You see what I said, right? What I should see is, okay, now it's created. It's telling me it failed because previously what happened was as it was telling OpenShift to create the objects in my application, it actually made the service, which is one type of object that connects networking, but then it failed to make these other types. So it didn't deploy the application, but it did deploy the service. So it's telling me this time, oh, you know, the service is already there, but since the other ones can be created, it created them. And if we go to OpenShift, what we see now, I should have switched over faster because it's already done. This is the transformer application running, and these are a Spark cluster that was deployed with the application as I started it. So this is the Spark master and this is the Spark worker. And the application is probably already using these. We'll just look at the logs quickly just to see what it's doing. It should be at this point, chugging along, reading from the topic. So this is the Spark logs, it's reading information and broadcasting it out. It's actually a little quieter than I might have expected. So at this point, what I'm gonna do is, I've got this information coming in and going out, what do I really wanna do with it? I have some simple applications that I can just monitor to look and see what, to just look and see what the raw data on the stream is. And I'm gonna, so one of the other projects that we have, we've created some microservice skeletons that are really easy to use. And so for someone like this, I've got a microservice here that all it does is attach to a Kafka topic and read from it and print the logs out into its logs. So I can make it really easy just to attach a simple microservice and see what's going on in Kafka. There are other ways to do this. There are tools in Kafka that you could use to just read the stream. But since this is all happening in a cloud environment, I might just use the microservice because it's just as easy and I know it'll connect and I don't have to worry about shells inside of containers and whatnot. So I'm just gonna grab this, I better just, this would, I don't think I need the new command for this, but we'll find out in a second. And I'm gonna start by telling it to listen on the source that we're reading from, that we saw, and then I'll switch it to the other one. All right, so this is a much simpler application to create. I don't have a complex template in place. So it can use a newer version of the client. You'll see it starting to deploy and it's reading a bunch of information that's going by very quick. And if I stop following for a second, let me close this down. We can see this is the information we just looked at previously, right? We've got our update ID, some number, the update and the text. Okay, great. So this is pretty much what we expected to be there. Now one of the things that's kind of, I find kind of fun and interesting is that once these microservices are deployed, we're using environment variables inside the container to inject configuration parameters. So in this case, the listener application had a couple of environment variables. I told it where the Kafka broker was and I told it what topic I wanted to follow on. So now I wanna see the output topic and I can actually just change that environment variable and OpenShift will now start to deploy the new one and it's already deployed it. And if we look at the logs now, stop following. What we see is that we've got our text and we've also got a new sentiment entry now, right? So we're adding the sentiment information to what we had previously, you know, the update user ID. Okay, great. So now we've transformed the data from the first stream onto the second stream. What's something we might do with it? You know, we might make some sort of dashboards so that we could see the values going on. We might wanna look for the most positive, the most negative and what we have in this repo is something we call the visualizer and that's this last piece here that we're gonna deploy. And we're running a little short on time so I'm just gonna kinda run through these and deploy these but you can follow the instructions in here at home and all these things will work the same way aside from the typo that I'll fix later. So okay, so let's grab this command to start the visualizer. Just gonna close this down so we can see, okay. So it's building that now. Now at this point, you know, this is actually an application that exposes a REST interface. So this is more like something I might imagine as a developer, an API that I can start to interact with. So thanks. So if I look at the logs, I'm not really gonna see anything. I'm not gonna see the data moving through. So what I need to do is one more thing here. I'm gonna tell OpenShift to expose a route to this service that we called visualizer. And you can see now I've got a URL here that it's created for me. It's created a domain name entry in its own edge router so that we can route into that application now. And if I visit that, what we see is this is a JSON plugin I'm using in Firefox to explode the data. What we can see is the top entry it says last scene. So this was the last message that it saw that came across the bus and this is it and these are the sentiments. And then we've also broken out the most negative that it's seen. So we had something here that was a minus 0.87. Who knows what that is? Maybe someone's trashing our website or something. Her representation of her cousin's state at this time or the evil would have been desperate awkwardness but their straightforward emotions left no room for the little zigzags of embarrassment. I mean, I don't know why Vader is doing what it's doing but it thinks this is really bad for some reason. Maybe it doesn't like zigzags, I don't know. So, and then likewise, we've also pulled out the most positive. So this 0.916 is showing us a pretty positive result. So it's, you know, chapter hashtag 60, you know, Elizabeth Spirit soon rising to playfulness again. She wanted more clearly to understand what Mr. and hashtag Miss Crawford, the children of, I mean, this doesn't even really make sense but like Will was kind of talking about, maybe it saw love and you know, something else and it liked that. But you can see we've started to create something that has value outside of just this one area because now I could turn this application over, I could expose the interface to other developers and say, all right, now you've got an interface to start looking for the most positive, or you could create other applications that could do this. And if I reload the page, we'll see at the top that the last scene is different now. And the most negative has also changed too. It's seen a new most negative, this was very bad. That sounds pretty negative. I don't know how much more negative you can get without getting explicit. But the most positive is still our weird sentence here that Vader seems to like for whatever reason. So I hope you can see from this how we've taken a workflow process where data scientists and application developers can exchange information about extremely complicated topics. I mean, I don't understand sentiment analysis, I don't understand how it's doing all this stuff and really I don't need to because what helps me out is when Will gives me these notebooks and he's been pretty rigorous about what he's done, he's telling me everything I need to know here. This is like me documenting code. This is him documenting code, right? So from his instructions and from reading this and stepping through the notebook, I've learned something about the process that he's going through and the algorithms that he's creating. And this becomes then a very powerful tool because we don't have to say, let's have a meeting, let's talk about what you want to do, show me code samples, all this other stuff. I can work with this for an afternoon, play with it, break it, that kind of thing. And then when we get together, I don't have like stupid questions or crazy questions. I can just say, well, I don't get this part, can you describe it to me? So we're getting really close to the end of the time here. But I hope you see how this is kind of the process of how we would add that enrichment then to the applications. And this part is very bespoke because I just took what he did and kind of made up some phony baloney application and said this might look like something useful. So I guess with any questions or... Yeah, you just run it in more pot. We don't run the actual transformer, right? So the transformer here is the driver application that's running Spark. And if we noticed that it was not moving fast enough, there's a couple of ways we could do this. One, one way I could go to these Spark workers and I could just scale these up. And this could be done, we have some experimental tooling that does this automatically by looking at Spark and seeing the load and then increasing the number of executors that exist there. And so I can just keep scaling this up. And because the code is being distributed from the transformer to the Spark workers, by increasing that, we're getting more processing power in there. I think one thing we'd have to consider too in this case is you're sort of limited on how much perils you can get out of Spark by how much perils you have in Kafka. So that's sort of a question that we're deliberately not answering. Yeah, I mean, it's a good point. And what I'm doing now while we was talking here is I'm deploying, we have another tool that we've created through Red Analytics, this Oshinko web UI. And it's a microservice that if it deploys in time we could visit that website and it'll show us all the Spark clusters in this project. And we could use the scale operation from there to scale it up. There's also a command line tool we have that does this as well so you could say, here's my cluster, scale 10 workers or something. And it'll scale it for you. Good question, thank you. Yes. Don't get me on the cluster. Yes. That's a good question. So the question is, if we have Spark running in Kubernetes, can we get a driver application that's running outside of Kubernetes to talk with that Spark cluster? No, it's under there. Oh, the Spark cluster outside of Kubernetes, Spark driver, okay, that's easy to do because you could expose your application and I think you can get the binary traffic back. Not without heroic effort. Okay, so this is something that's very complicated about Kubernetes and about the way Kubernetes has what's called an edge router, right? Which is the edge of Kubernetes exposed to the outside world and these URLs are what allow us to go through the edge router and route to our applications. I'm pretty sure right now you can do HTTP, HTTPS traffic and I think there's a way to do GRPC traffic through there, but because of the way Spark communicates back and forth between the two, I think the edge router would probably reject it. We've tried to do these things where we've moved the Spark cluster in and out and I think a lot of it comes down to the communication between the workers and the driver. The workers and the driver need that the application and the existing point-to-point communication to each other, right? So I think in that case, you could prototype on yarn outside of Kubernetes and if you have Kubernetes, you could absolutely run a Spark cluster in Kubernetes that's sized for your application dynamically. Yeah, and there are more complicated imperatives in Kubernetes that might allow you to do this. There are ways to expose specific ports and tell Kubernetes this port on this host goes to this container, but then that starts to become brittle, right? Because you need to maintain these mappings and so if you scale things and you change things, then that'll change how it's... So really, you did a bunch of performance work on this a couple of years ago you really don't pay anything for running in Kubernetes, right? You're not... You're running on bare metal to the same extent that you're running on bare metal in Mesos and to the same extent that you don't have the same sort of fine-grained control you have in yarn, right? But this is... I press this question, I'd really like to take offline, but basically the only thing that we came up with that might be a problem for running Spark in Kubernetes is maybe the network overlay is gonna introduce some overhead and we found that if it did, we couldn't identify it. Like running benchmarks on the same hardware under OpenShift and outside of OpenShift with the self-managed Spark cluster, the networking didn't really do anything to impact our performance, even with very network-intensive jobs like machine learning model training or you're broadcasting a large model, relatively frequently, for each training product. That's a great question, though. And just for the viewers at home, the question was, is there a performance hit for running Spark on Kubernetes? All right, I think we're almost out of time, but maybe another question, no? Should we stop? The time is over, so... Okay, well, thank you very much then. Thanks so much. Thank you.