 Can you hear me? Testing? Awesome. All right. Good morning, everybody. We're going to have the next talk on scalable monitoring using Prometheus with Apache Spark by Dan and Zach from the CDO office at Red Hat. Thank you. Thank you. So we're going to be talking about scalable monitoring with Prometheus and using Apache Spark. So let's talk a little bit about that. So I'm here with Diane Federma. My name is Zach. And we work in the CDO office on AI machine learning. Today we're going to go over, we're going to spend some time talking about observability, talking about performance tuning when you're running machine learning workloads. And I think this is a very important topic. And let's look at some of the use cases here. So whenever you're building something, right, like the same problems developing software exist, whether you're doing machine learning or whether you're doing Java programming or whatever development that you're doing, I think performance is a very important thing. If you're shipping something, the quicker you ship it, the better it is. So let me ask you a question, Diane, in terms of getting something. If I were to order a Tesla today, how long would it take? I think it might take a little bit too long to get one. And I'm impatient. So I think with machine learning jobs, sometimes it takes a long time to train a model, right? Depending on how much data you have, depending on your hardware setup, it could take months to run an experiment. And I don't think it's good to kind of lose time. The quicker you have experiments, the quicker you get rapid feedback, the quicker you can experiment and tune and improve it. So let's talk about maybe, if you turn something that takes a long time, improve the performance, make it run within a week, that's better than waiting a whole month, and so on and so forth. So one of the things that I think is important is in order to improve things, you need to be able to see what's the bottleneck, right, in terms of performance, right? So I'll tell you a little bit about the history of why we started building all this infrastructure and tooling. So we were basically running a Patrick Spark in Kubernetes pods and running them, and we had data scientists writing notebooks and doing different experiments. But then we don't exactly have visibility of exactly how much memory they're utilizing, what's failing. So we ended up basically instrumenting a Java agent to run alongside a Patrick Spark and expose metrics. For example, metrics like the DAG scheduler, the JVM pool, the block manager, and other internal JVM metrics that are running within the container. We later experimented with CRDs and continued doing more experiments. Got in contact with Diane because Diane has some experience working on high-performance supercomputing. So we did some more work there. So anybody here know Prometheus? Okay, so about 40%. So just for the folks that haven't heard about Prometheus or don't know about it, I'll just give a brief explanation. So Prometheus is a time series database system that collects metrics. In order for it to collect metrics, you've got to tell it where, where to collect metrics from, what's the location, and how often you want those metrics collected. So you have a couple of choices when you're designing a system like this. You could instrument, if you have access to the code base, you could instrument your code, import some libraries, there's libraries for Go, Java, and Python, and then you can expose specific metrics that way. Another option is to use exporters. The nice thing is with Kubernetes, it already comes out of the box with instrumentation that exposes metrics through Kubernetes. The problem was that Spark doesn't expose those metrics in Prometheus format. So we had to do some experimentation there. Anybody here know Apache Spark? Okay, a lot more people. So we won't go into the basics of it, but just high level what we used it for, for example. So the nice thing that we liked about Apache Spark was the fact that you could do batch and streaming. You could choose to get messages from a Kafka topic, or you can just pull from a file somewhere. It has machine learning libraries. It's distributed. You could do graph processing, and it has a nice SQL API as well. There's some interesting news. Does anybody know about the Spark-Kubernetes scheduler? Okay, one person. So there's actually some work being done upstream at Apache where some folks at Red Hat and other companies got together to create a Kubernetes scheduler. The schedules, instead of using Hadoop, Yarn, or Mesos or standalone, then they would use Kubernetes as the scheduler. And then there's a lot of programming language findings. So you could write your program in Java or Scala or R, and you could run your workload on Spark. And then obviously the data access options and then the different data formats. So high level, what does an application look like when I'm using Spark, right? So over here, you have a data source. So this could be streaming, or it could be a file that sits on S3 or HDFS or some other file system that you have locally. And basically, the middle there, you have some processing that's happening. And then whether you could do some ad hoc processing or you could also use machine learning if you want and then produce a model. And that model, you would store it somewhere like S3 or another system. So high level architecture design of how we have this system in place. We use Spark from the Radar Analytics IO project where we have pods. We have a driver, and that driver is the application that's running against the cluster. And when we submit this application, we actually do some things. Like for example, we have a Java agent running within these pods, and that Java agent is running with the Spark master, the worker, and the other driver application. And at a particular interval, Prometheus will scrape each endpoint, collect those metrics, and one nice thing that we found is this very interesting feature about Prometheus is if there's a problem, for example, if you want Prometheus to send you an email, you could set up something called alert manager. So you set up particular rules, and I'll show you an example of a rule in a later slide. Anybody here work with Java agents? Okay, one person. So Java agents, it's basically, so Spark is a JVM application. So it has JVM, it has metrics that you can expose through different options. So one of the options is JMX. And JMX provides these things, and I don't want to go into, you know, too technical, you know, deep dive into JMX and JMBs, but basically for you, you get to have your metrics accessible in a format that Prometheus can understand by adding this agent and exposing your metrics, and then Prometheus will scrape those metrics. So we're getting all the metrics from Kubernetes, from the network, from the Spark, all in one. So what does the configuration file look like? So you have the time, you're telling Prometheus how often do you want me to go and collect your metrics? And if there's a problem, where do I send notifications to? Alert manager. And where is the rules that you have that you want notifications for? And what's the URL or location that you want me to collect metrics from? So there's a couple of options. You can do static, or you can use auto-discovery. For example, in Kubernetes, we use auto-discovery to discover the metrics within Kubernetes. So this is an example of a rule. So this rule is pretty simple, right? If there's a particular issue on your cluster, you could say, okay, this expression, if it evaluates the truth for five seconds, send a critical error saying that this cluster is down and out. So these alerts, you can set different types of alerts. For example, you could have alerts that tell you the health of your cluster, you can have alerts, or if there's too much CPU or memory being utilized, and then maybe somebody can go in and look at what the problem is and do some more analysis. So we use Spark SQL, we use PromQL, sorry, to get metrics out and then put in a graphical format when we're using Grafana. So there's a couple of different terms that exist, gauges and counters. Gages are, if you only care about this metric, that's like the latest value, then you would use a gauge. If you want to collect incremental numbers, then you would use a counter. So I'm going to pass it on to my teammate, Diane, who's going to take it from here. Okay, great. I think I'm tethered to this spot. Can you hear me? Okay, great. So in the second half of the presentation, I'm going to talk about how we can use Prometheus to do performance analysis, which is my background. I did performance analysis for years with HPC on supercomputers with parallel climate models that used MPI. So it's a little bit different, but same general idea. So basically, when you've got a cluster running like this, you want visibility of everything that's going on. You need to know how much memory you're using in real time. If you're going to see a problem, you're scrolling right in front of you as the job is running, and that's what we're going to show you in a demo. If you hang on just for a minute, I'm going to give you a little background before the demo. It takes time. It's a network. This, right? So in our example application, you can tell I haven't used this, we are going to show you a snippet of code and explain how we're going to do an optimization. We'll see it before and after running real time. And Spark, first I should explain for those of you who don't know, it's a very memory-intensive framework. You can use, you can sometimes run into 100 gigabyte JVMs, which is pretty unusual on the Java world, but it does happen fairly often with Spark. And so it's worth optimizing your memory usage with Spark. It's worth paying attention to this. It's a good place to start. And so we are going to look at, in our example, we're going to look at running this Cartesian product, and we're going to show you before and after with caching, an RDD that we're going to reuse. With the Spark in memory framework, even when you have nodes that have, say, 200 gigabytes each, it's worth it to pay attention to the memory that you're using. So, before I show you the code that we're going to run, I'm going to explain how the Spark memory management model works. There are two memory use categories in Spark. There's execution and storage. And they share the JVM space, and that boundary that you see there is movable. So, Spark allocates memory in blocks, like this. On the left, those execution blocks, in our example code, we're going to do a group by. Spark builds a hash table for that group by to perform the group by. That hash table is going to be created on, built in the left side there, in these execution blocks. And then, if we cache the result, it will be stored in the storage box over in the right. So, you see that empty space in the middle is shared memory. And we're okay as long as there's no contention. But once there is contention, a block is going to have to be evicted, and execution memory can never be evicted. It takes precedence. So, in this case, that storage block is going to be spilled to disk, will have to be recomputed, or, yeah, will have to be recomputed. But that's okay. We're going to do cache things that you're going to reuse. And, of course, we always want to avoid spilling to disk if we can, because you pay a performance penalty for that. Now, as execution memory goes, you know, is requiring more memory, it is allowed to evict blocks from the storage area up to a point, which is user definable, is something the user can set. You can figure out how big your RDDs are. You can figure that so that they won't be evicted, and they can be reused. And at that point, the block for the execution is actually going to be spilled to disk. So, execution takes precedence, execution memory, and storage there is an unavictable amount that we can tune. That's another thing we can look at our tuning of that, and we can run this Prometheus Grafana dashboards, and just see how the different settings come out for us. So, one more thing I want to explain before we show you the code and do the demo. And that is Spark SQL, which gives you this nice, if you like, working with relational databases, you can interact with Spark through the Spark SQL API. So, you get the SQL syntax, you get data frames, which we are going to use here in our example. Data frames are like a table of relational database. They are strongly typed. They have named columns, and you interact with them in a declarative way, just like you do with SQL. I mean, it's that you use very much SQL-like commands to interact with them. So, I just want to show you here where Spark SQL sits in the Spark stack. You can think of it as a library that sits on top of Spark Core. Our program is like this library program up here in the right. It's written in Python. It is using both the Spark SQL API and the Spark Core API. You can intermix those any way you want. Some of the benefits of using the Spark SQL, which is kind of Spark SQL is sort of the future of Spark. You get a lot of optimizations with it. It has a catalyst optimizer that behind the scenes on the back end optimizes your queries. And it also has a more efficient memory model than the JVM. So, we're going to show you a cache and non-cached example. One of the things you want to know before you decide how much memory you're going to set aside for storage is how big your RDDs are. So, you can just cache your data frames, which get turned into RDDs. If you use the Spark UI, you click the storage tab, and you see right there we have a 6 megabyte RDD. So, we want to make sure that that unavictable storage is at least that big, so it will hold our RDDs. This is our code example, non-cached. We generate a random RDD that we convert to random RDDs. We convert them into data frames. The cross join, which is a Cartesian product, and we do a group buy. One of the difficult things about a demo like this is you have to make a small example that will run quickly. You know, so this our code actually runs in under a minute. The cache version here is exactly the same, except we're caching that RDD that we know we're going to reuse. This is what the Catalyst Optimizer does for us in the Spark UI. You can see how it is optimized your code on the right. You can see in our cached version there. I'm not sure if you can read that, but those are in-memory tables. This is what our dashboards look like that we're going to show live. The non-cached and cached version, because that's hard to read. I'm going to just expand that a bit and show that by caching this one RDD we reduce the memory usage by about half. And those are workers, if you look at the first four entries. So, comparing non-cached versus cached, this is on four nodes. Seeing that, you know, we get 65% reduction in memory and we get a reduction in our timings. If we run it on eight nodes we also get over 50% reduction in memory and in our timing. The thing that we do leave on the table here though in this example is see how it's a little bit load imbalanced. You probably want to go back and run this again and run it through Prometheus and Grafana and maybe increase the number of partitions that we had from 200 to maybe 400 to see if we can load balance the cached version. So now we're going to run the demo. So I'll mirror it. I'll mirror it. So we start here at the OpenShift console. We've got to refresh that. One second. We're going to have to... We're going to connect to VPN just to show you that this is live. Definitely a live demo. It's a lot more fun to see a live demo than something that has a safety net, a video that you know, it's more fun. It's still connecting. It'll take a second. We are on one of the students accounts so it should work. It'll take a second. I think we're on now. I'm afraid it may have dropped the students. I'm sure we're... broken or... It's way more fun to do live demos. Oh, we're back on guest. That's the problem. We were on a students account and now we're back on guest. So it's not going to connect. So... So is this student who helped us? One second. One second. This one? We're still good on time. So I'll explain some of it while Diane's setting up everything. So what Diane's going to show you is she's going to be running SparkJob against a cluster with the sample code that Diane showed you. She'll have two different versions of the code. The two different versions that Diane showed you in the slide deck. And then once Diane deploys it to the cluster, we're going to see live metrics and graphs and charts. And our demo is okay. Yay, it's back. Perfect. Thanks for being patient. Okay, so first thing... We can see that our targets are up here. These are... This is in Prometheus. This is a view. If you're using Prometheus and if you've used it much, you know that one of the first things you want to do is check the status of your targets to make sure that they're actually being scraped and that they have a status of up so that they're sending you data. So our targets are up. Now I'm going to go to the deployment config that we have for the non-cached version. Perhaps I should explain what deployment configs are, right? So, if... So for example, since Diane already wrote her Python code and she wants the platform to handle creating a docker image, doing all that stuff and then this makes it a little bit easier to rerun your spark job so you can just rerun the deployment config by clicking deploy. So... But basically in order to create that deployment config, you would just use a template and then S2I would go and create that deployment for you. So... Here's our job. It just showed up. This is the non-cached version. I'm going to prove to you that it's not cached in the storage tab like I told you earlier. We have nothing there because nothing's cached. So... Now we're going to... But you can see that the job is running and these are the tasks that have succeeded so far. Then we'll go to Rafauna and we look at everything in the system is pretty quiet but the CPU usage is starting to pick up. And we're going to see in a second how the CPUs start to pick up. So... We've set scrape intervals at a particular... like a little bit of a... like a bigger number like 10 seconds or something. You can actually set scrape intervals to be a little bit quicker to scrape more frequently if you want to get more snappier graphs. So here we can see that we have these workers coming in the spark clusters at the top there. And they're using about five gigabytes of memory each. So we want to save on that. By doing caching we want to see how much we can improve on that. We have all these different dashboards that we created and it actually... especially if you love to measure things like I do it's a lot of fun to make these dashboards. There's kind of like infinite things you can do with this. Down here on the bottom we have this case. So say you're doing a lot of garbage collection. We don't happen to be doing that in this case. But if you are, you can go down and see how the JVM is being used. And this area at the bottom is the storage that is is cached. So if you had a problem you could see it here if you're garbage collecting too much. So here we see that the Cartesian product is happening now because we're up over 19%, 20%. It's a CPU intensive operation. We have one pod for container. One container for pod. And so these values are the same. That was from the diagram that Zach showed you earlier with the workers each having their own pod. And so that job is over now and we can go back here. Here. I need to look at SparkUI. See that our job finished in 1.2 minutes. So that's going to improve with the cached version. So we have the deployment config with the cached version right here. Push the deploy button to rerun that. So that was explaining how easy that is to do. Drive your job all set up. And we're going to see this job, the cached version coming here. So the tools I used to use in the past for viewing this kind of thing were callmux and collectall and it was all ASCII based. And you kind of saw what was happening over the whole cluster but you didn't get these nice graphs and you didn't have the control of building your own graphs so easily. The latest version of Grafana lets you drag and drop things all over drag and drop your graphs wherever you want and it's just a really nice interface. So see this. There we go. There's the cached version. It's proved to ourselves that it's really cached. Look at the storage. There's the cached RDD. Now we'll go to Grafana. You see that the cluster CPU usage and memory usage are down again between the jobs and you see the we're scraping every 10 seconds up there on the right. You can see that this is an 8 node cluster here and you can actually look at these statistics per node if you like or per application. I'm going to look at all of them instead and the new job coming in now is probably memory usage is half what it was with the 2.42 gigabytes and we have that benefit of caching shown here and like I said there are all kinds of knobs in this part that you can change and then you can quickly just see your improvement or degradation here through previous. And like when you're running things on even cloud workloads where you're paying by the minute for particular hardware resources, I think it's important to know exactly what's being utilized and to fix the bottlenecks and try to run your application with and be able to optimize your application a little bit better. I remember back when I was in college I did a startup where we built a search engine and did a lot of experimentation and we managed to optimize our code to run more efficiently and ended up saving like maybe almost 120 percent cost on AWS so it's always good to optimize for performance, save money too. So the timing on this was 44 seconds and yeah I think anytime you're doing some optimization you just don't know where to start unless you have some visualization of the utilization of the resources and this is going to work for GPUs and you can also put GPU information into Prometheus it just allows you to get the most out of your hardware and without it you have no idea what's really going on. You have to have some visualization of what's happening across the cluster. So does anyone have any questions? So we started with the Python code that we showed and then we ran using Oshinko at Rad Analytics I.O. They have S2I templates that you can run on Kubernetes and you input my code is sitting out on GitHub. I tell the template that my code is on GitHub it creates a Docker image for me. So I think I understand what you're asking is your Python code that you have let's say it's not Spark and you just have Python code you just import Prometheus PIP library and then you basically put all inside your code you put all the metrics in there and then you would get an endpoint and then that endpoint you would just add it into the config map in here and then Prometheus is going to be able to go and scrape it. I can't hear your question. I'll repeat the question. So in Europe and that's how you measure metrics and different things that are happening in your application. So there's some interesting projects so there's different ways to measure things. For example there's Open Tracing which is a Jager project. There is this type of metrics that we want to just get out of the application. We can instrument our application for metrics and then there's the log method that you have which is I want my application not to be too much tied to the external monitors that monitor my application. So that's one of the things I want my application not to be too tied within code not to be too tied with external monitors of my application and so if I need to import a specific Prometheus specific code and write Prometheus specific code in my application I think that's kind of anti-pattern. You can instrument your code also that's an option to instrument your code and output metrics in Prometheus format if you want to do that, that's an option. Can we add another question in the back? Yeah we have a question in the back. Thanks. So I can see where this would be very beneficial if I'm running a consistent workload to optimize for a workload. Are there any best practices for applications that are processing like dynamic data where you don't know what the exact volume of the data is going to be so like the distribution of what should be storage and what should be execution might change over time. Are there rules of thumb to start with if that's your use case? So in a Spark general Spark application are there rules of thumb is that what you're saying? Yes. Okay yeah there are a lot of rules of thumb like many of these settings with Spark for instance like the number of partitions you want to set like you want every one of your cores to get at least one partition so that's kind of a way, a load balancing thing and like Spark is very memory intensive so I in general start very high if possible with my memory usage just to get something running first so I'll start with you know like five gigabytes per worker that sort of thing that doesn't work I'll double it and then I'll back down so yeah there are rules of thumb there are a lot of knobs in Spark to set but yeah if you start with the right number of partitions and plenty of memory you've got a starting point you'll be able to look at it like this and then tune it down from there So I think we ran out of time but if anybody has any more questions please be in the hallway and we can just have discussions out there and thanks everybody for coming