 Welcome, everyone. I'd like to thank everyone for joining us today. This is the CNCS webinar, integrating multi-location ADC with Prometheus and Grafana. I'm Kazan Fields, a developer advocate with Google and a CNCS ambassador, and I'll be moderating today's webinar. And we would like to welcome our presenter today, Dave Blakey, the CEO from SNAP Inc. And before we get started today, I have a few housekeeping items to go over. During the webinar, you will not be able to talk as an attendee. There is a Q&A box at the bottom of your screen. Please feel free to drop any of your questions in there and we'll get to as many as we can by the end of the session. This is an official webinar of the CNCF and as such, it's subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all of your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinar page at cncf.io slash webinars. And with that, I'll hand it over to Dave to kick off today's presentation. Thank you and thank you for the introduction. Good morning, afternoon and evening, everyone. And as mentioned, my name is Dave Blakey and I work at SNAP. SNAP is an ADC company, which I will briefly explain in a sec. Don't worry. But today, as mentioned, we're going to be looking at a fresh setup of Prometheus and Grafana in specific to measure metrics and performance information from our ADCs. But really, it's quite transferrable for any kind of monitoring or graphing dashboarding that you might want to do. So we're going to be using our product Nova as the ADC to deploy, which does have a community addition. So you're welcome to sign up, play with it, replicate this stuff yourselves, as well as Prometheus, a popular CNCF project. I'm sure you guys are familiar with and another one, Grafana, of course, a dashboarding and monitoring solution that most likely most people here will know. So everything you'll see us do now today is available, open source free of charge, etc. And you can play with it yourselves. And like I mentioned, I think really the foundations of, you know, of collecting time series data in Prometheus, representing that in Grafana, it's very transferrable and in a great way of monitoring kind of the performance of especially web based or microservice applications, etc. So one of the hard parts about talking about metrics from ADCs is making sure that everyone understands what an ADC is, right? And where it fits in. And more specifically in east-west deployments and the complexities of that and the metrics and the monitoring and modern kind of deployments around load balancing and acceleration and so on. So I'm going to spend about five minutes discussing, you know, just what that is, where it fits in everything like that. And then we will set up an ADC deploy it and then get into Prometheus and Grafana, start getting data from it, start drawing up some dashboard stuff like that. And as mentioned, you know, please pop in any questions. I'll try and answer them all at the end. But if there's something along the way, you know, maybe I can answer it as it comes in, if it's on sort of the page we're on. So quickly, let's get what is an ADC, right? Load balancer. This is what you all kind of know, right? So a load balancer taking the traffic in, historically very north-south. So it would sit at the top of your stack, accepting your traffic, passing it on to all of your servers and making sure that it goes to the online ones and so on. In today's world, often kind of I suppose the mesh between various microservices, multi-cloud locations, multi-location deployments, things like that. And ADC is really load balancer plus, right? So it's kind of a stack of features largely focused at HTTP, so APIs, websites, e-commerce, things like that, that add value to that HTTP traffic that's being load balanced. So in specific web acceleration, web app firewalling, and then today's focus, telemetry, tracing, monitoring, things like that, that's what we're going to be using the integration from ETS and Grafana for. So an ADC is essentially that stack, right? But it provides, I suppose, a lot of visibility, etc, which is what we'll look at. So Nova is the product we're going to be using today, the community edition of Nova. So Nova allows you to deploy, scale, and manage and observe any number of ADCs. You can see I put a little note in there, five on community across multiple locations environments. So in this kind of space where you've got, you know, load balancers, ADCs all over the show between different applications, different services, you can imagine how telemetry and observability and monitoring and alerting and so on is so critical, hence the reason why. Both we use Prometheus and Grafana internally, as well as, you know, providing integration for it to the product. Now, like I mentioned, an ADC is a load balancer, web accelerator, an application firewall, and a platform for telemetry. And really, in cloud native, what those are actually delivering is availability, performance, security, and observability. So you'll see just by a factor of us setting this up and actually exporting the stats to Prometheus, kind of the setup of an ADC. But obviously, specifically, we're focusing on observability and how we can monitor our application. In order to do that, we're going to be load balancing a couple of web servers, right? Very basic example, obviously in production, this is often deployed across many different containers or many different cloud environments or, you know, on-premise and plus cloud, etc. But obviously, the bigger the problem, you know, the bigger the deployments, the bigger the problem when it comes to observability and monitoring. So for us, we're going to set up a website, load balance that and then start to really get into the data of it and how you monitor the performance of it and the reliability of it and so on. So on today's demo, quick setup of an ADC, enabling Prometheus on Overnodes. So enabling, you know, the nodes to be able to export to Prometheus and then collecting that data, how do you get it into Prometheus, how do you store it, etc. And then how do you get that data in Grafana and create reports around it, what to look at, you know, what are the main things we focus on, tips around, you know, kind of predicting problems before they happen, etc. And then Q&A at the end. So I'm going to switch my screen share now to the first window I will be sharing, which is my SSH window. So what we've got here is just the stock standard kind of basic Linux box running Ubuntu. And we have installed Prometheus on it, just an apt-get basically to install it. And here we have got a very simple config structure, right, pretty much the basics. So if we have a look at my Prometheus config, you can see here, nothing out of the ordinary here, this is the defaults, I've changed the scrape interval to 60 seconds. And then we will be putting into this Nova job our various endpoints once we configure it, right. So this is ready to go, but not really configured, not running or doing anything. And this is where we'll actually collect that data. And this system is also running for us, the Grafana server that we will then report on, right. So we've installed them both on one system. And we'll use that. Now the first step is going to be to set up our ADCs, right, our load balances. So I'm on our Nova site here. And I'm just going to log in the login button also lets you register the community account. And here you can see my account. So I'm not going to spend too much time going over kind of the setting up of an ADC and all of that sort of thing. I'm just going to provision one that we can actually use to get the statistics from. But one of the important keys I suppose to know is that on Nova obviously it's a system for managing many ADCs and the systems that actually run those are called nodes, right. So we will call a node a container that is running our client a VM, you know, a cloud instance, whatever it might be. Now to do this, what we'll do is spin up a new Nova node in a cloud. So I'm going to use our test resolution account here, which also by the way has the web servers we want to use. So you can see at the bottom here our endpoints. We have two in San Francisco and two in New York. And if I go to this IP address we get just a snapped hello page. And this is going to be the mock website that we're going to load burns and then collect metrics and statistics for and monitor, etc, right, just this kind of simple page, but it will be good enough for us. So what I'll do is deploy into that same environment, some demo nodes, right, let's just say prom demo, and we can just deploy one. So this will set up an over node for us, where we can deploy our load balancer, and we'll be able to use that in a second. And while we're doing that I'll set up the backends as well. So back end is where you're sending the traffic to and for us we can just say we'll create, we'll say from back end, we'll do cloud API. And then we'll specify to send the traffic to port 443. And that we are going to use digital ocean. And we'll send it to our New York tag on digital ocean. So this will just automatically pick up the servers and IPs for us, we want to send the traffic, right, those sites that I showed you. So if you remember, just the snap to load page. Now the next thing to do is to create an ADC. So here I'll say, okay, let's create one, we'll do SSL termination, and we'll say prom ADC, right, and submit here. So this is just setting up the load balancer. This is where we say, okay, let's use our snap wildcard certificates. Let's send the traffic to the Prometheus back end ADC that we set up. And let's turn on some things that we can generate stats for. Like, for example, let's do some HTTP redirects, and we'll turn on the WAF maybe so that we get some firewall blocks and stuff like that. And then we can save that. So obviously you can configure a whole bunch of stuff here, and you're welcome to play with that yourselves, but it's not really the focus here. And then I will deploy this ADC, right. So when you deploy an ADC, you're linking that ADC to one of your nodes, right. And you can link it to many nodes. And this is where kind of one of the big observability challenges comes in, as people scale up, right. So to give you kind of an idea of the space, a bit more of an idea at least, we have clients that have got 200, 300, 400, 500 ADCs in production. So they will have 500 endpoints that their traffic is going to, you know, and they need to monitor the statistics on that. So while this is kind of really applicable to I think a single installation, you can imagine, you know, as you kind of scale out and grow, it becomes more and more pressing to be able to do that, right. So you can see here, our node is online, this new one that we've deployed, that's what we're going to monitor. We've also set up monitoring for these three, so that we'll be able to see some more stats because they actually have traffic as you can see. And then as soon as this is done provisioning, we can attach the ADC to it. But in the meantime, while that's running, what we can do is copy its IP address. And if I share this window again, you'll see here, we can tell Prometheus to collect stats from that IP, right. So you can see I've got a job name here. I've got a couple of jobs. The first two are defaults, right. So by default, the Prometheus and node job will exist in your configuration always. And then the last job here, you can see I've added an overjob where I've put in my target. So you can use service discovery and all sorts of means to do this, but I'm just putting in a list of IP addresses to keep things easy. And we can put in, sorry, we can put in this new IP address that we got, right, from this new node that we've launched. And we can say, okay, 102, 109, 83.55, 911, we're going to put that in there. And we are going to correct the tab in there. And then we're going to save this. So I can save that. And that will then have our new configuration, right. So we've now got four targets, number one, number two, number three, and our new node specifically there. And we can then restart Prometheus to apply that config, right. So this will tell Prometheus to collect metric information from these four NOVA nodes. And that's literally all the configuration. Obviously, if you want to do it automatic and then, you know, use service discovery to discover them, like in something like Kubernetes, et cetera, you can. But for this, we can just use the IP, right. So let's go back here. Let's see if our node is online yet. It is. So we've got this new system online now that we can use. And that means that we can attach an ADC to it. So if we link this ADC of ours to that system, we should then be able to go there and we'll get an SSL error because I'm using an IP and not the host name. But once we accept that, we should see the snap logo and the hello message. And then we'll know that we can generate some traffic to get some stats, right, to actually report on those. And this should take probably about 15 seconds. It should be online in a second. There we go. And now we can actually go there and see that. So let's just get a type here again. We take this and we'll go there and then we'll get our SSL error. Okay, we accept that. Then we see the page. Great, working. So we've got an ADC running. And you can see some of the basic stats here. We should be able to see some traffic. Then we know that it's going through here, which we do, right? We had five connections, probably a four for the favorite icon. Yeah, we did. Great. That's cool. So we'll be able to report on that stuff. So now we're going to turn on Prometheus for Nova nodes, right? So this will deploy to all of our Nova nodes. It will enable Prometheus exporting on all of them, right? So that'll be rolling out now. Now the next thing to do is to go to the Prometheus user interface, which by default runs on port 1990. So you can see here I am on it, right? This is the default that you'll see after you start Prometheus. So you'll be familiar with this. And if you go to the status configuration section, you'll be able to see our config, right? Here's that config we set up with our new node IP, the one we just tested there. So we know that it's set up and running and it's applied our config off to the reload. And then we can check targets, right? So we've got some that are down, some that are up. They'll all be up soon. It checks every 60 seconds. So if one checked while I was before I had reloaded it, then it will still be down. But as soon as this gets to 60, it should pick it up as up. But what we're really seeing here is that it's sending, it's got the IP addresses here and that it's scraping them. So this status up means that every 60 seconds, based on our configuration, it's going to go there and get the latest metrics. And then we'll collect everything that it can, right? So it's like a standard format. So what we export from that node, for example, is this, right? Like here you can see things like Nova back end bytes in total and the back end names, right? You can see where's some timings. There we go. Connect time, right? HTTP connect time. How long is it taking us to connect to the back end? How long is it taking us to get a response? And you can see it per back end, right? So a back end being per one of these per back end that we set up here. So what we're going to look at is our prompt back end, right? Which is actually sending it to our web servers. And so those are the stats that we want to look at. So we can start to do that already. So what we can do here is we can say, okay, Nova and it auto completes for us Prometheus, right? With whatever data it has in it. So maybe let's do something simple like the response time one I was saying, right? Like how long does it take to get an HTTP reply? So when I refresh this page, you know, what's the delay from the web server generating the page? Often a great metric to look at, right? And we can execute that and then we'll get all of those. So if I was to look for what do we call it? We call it prompt back end. So let's look here, prompt back end. There we go. We see that there. And we see it's currently zero, right? So it hasn't collected any information yet from my various page refreshers, or the response time is actually zero seconds. You know, if it's very small, but it won't stay small for long. So we'll start to get more and more data there. But you can see I've also got some other notes here, like I mentioned, some production ones where we can actually get some real stats from and we'll have a look at those soon. So in total, we've got four nodes deployed, each running at least one ADC. And we'll be able to look at that. And so we can graph stuff like that. And you can you can interact directly with Prometheus to do things like this. So we can say, okay, what about bytes traffic, right? How much traffic is actually going through? And so you can see here, we can say, okay, front end bytes out. Whoops. Front end, front end bytes. Also, did you make the font a bit bigger? Yeah, absolutely. How is that? Much better. Thank you. Great. Cool. So front end bytes out total, right? We're going to execute. And then we can see all of this. And we can switch to graph. So we can see here, it's starting to store all of our statistics, right? Obviously, we're going to actually graph all of the stuff in Prometheus. But this is great for testing, right? You can see which systems it is, etc. And what's going on. And the other thing you should notice is that this is a total counter, which means that eventually we want to do this as a rate. Because, you know, it's going to say, well, we've had 10, now we've had 20, now we've had 30. That means 10 a second, right? Not 30. And so we'll account for all of that kind of stuff soon. So this is really just for your debugging, your monitoring and so on, and checking that your nodes are online, right? So you can see, okay, cool. We're collecting all of that information. We're getting it all from those nodes. And when we send traffic there, like for example, let's generate a 404 page not found. And let's generate a block. Like if I say test is ETC password. Then we get blocked, right? So we'll have a block in the logs as well. Now the next step, now that we've got Prometheus set up, it's storing all of this information, right? So for anyone who's not that familiar, it's a time series database. So that means that, you know, it's going to store this according to time basically, and we'll be able to do historical reports on it as well as look at the live data from it. And to do that, we're going to use Grafana. And this is my Grafana dashboard. I wonder what zoom I should run this at. Probably that's pretty good. So this is a blank one. Again, this is what you get if you just install it, go to the website, see the installation instructions and just use, you know, on Ubuntu, I used act to install it. But you can create a dashboard now. And your dashboard is going to be where you put your graphs and your tables and your stats that you want to monitor for this. So you can have many dashboards, you can have, you know, one set of more details, less details. Maybe you've got ones for ADCs and, you know, one to monitor the web service. In our example, though, we're going to be looking at the ADCs and specific, right? So we created a dashboard. And maybe we can say, when I save it, save it and we'll say demo prom dash. And we save. And that gives it a name. And then we can create, whoops, we can create a new panel here and say, cool, add a new panel. Now this is where we create where we select the data we want and then create the graph, the visualization, right? So we'll create, you know, some graphs, maybe some gauges, you can do tables and things like that. So this is where you start to say, okay, well, you know, what information really, when you think of an ADC, especially a web-based one, what is important, right? So obviously the first thing that comes to mind is, well, all my web servers are actually online. Are they running? You know, are they like when were they lost down, things like that. And so that's quite easy to do and is quite a nice way of showing some of the different features of Grafana. So if I was to say nova underscore, and you can see, you can scroll kind of through all of the various options that it has, right? So it lets you, you know, search for autocomplete and stuff like that. And you can say, you know, backend, and then you can see, you know, what are the backend options and so on. And what I will do, sorry, I'm just, I think it's better if I zoom out here. What I can do is say, okay, nova status, let's just say status. No, nova, where is what I'm looking for? Here we go, server. That's what it is. Server up, right, is our query. And you can see here, we get a whole bunch of data, right? We're looking at too much information. But what we can do is say, okay, let's look at that as a table and instant. And then for our visualization, let's change to a table as well. So you can see here now, these are all the records that we have from the various nova servers, like I mentioned, there's a few, you can see all the IPs here. And the server name and the value, right, one being up, obviously. Now for us, we want to look at our prompt backend, right? And so we know that that backend is called that. So we can say, okay, well, let's filter this for where the, well, let me actually show you the autocomplete. So it pops down, right? And you can say, okay, I want to look by server, by job, by instance, by backend. So obviously, Nova is putting in backend. But we can come in there and we can say, okay, for that backend, let's see the data. And there we see now, okay, we've got two of them, right? So the reason we've got two is because there's two servers, prompt backend zero and prompt backend one. And those tie back here to these two systems. So if on our ADC management, I come and have a look here, let me zoom this one in a bit as well. You can see here on node prompt demo zero, which we created, we've got two upstreams. And we can come in and see those here, right, prompt backend zero, prompt backend one. Of course, you can name these and so on, but we can see that coming through now from Prometheus into Grafana here. And what we can then do is to start to, to kind of filter the status, start to apply effects. So if you remember our original goal was to say, okay, well, let's just make sure they're online, right? Is the ADC saying that these services are functioning, which rises port 443, right? And you can see it returns the value one. Now that's not that useful. Let's rather say, okay, let's add a value mapping, right? And if the value is one, let's make that text say up. At the same time, if the value is zero, we can make the text say down. And then we can say, okay, if the value is zero, let's make it red. And if the value is one, let's make it green. And the way we will color them is with a background, right? Now we've got up or down, and it's green or red. Great way of simply seeing the status of the backends, right? These two web servers that we've got. And obviously, you know, if you had 200, of course, that would be the same thing. There's a bunch of options here, of course, right? You can color just the text, you can color the backgrounds, you can get really fancy. But this is quite a nice way of showing you something that's not just a graph, right? It's, you know, actually the status. So let's say, you know, web server status, right? Is what we'll do here. And we can then apply this, we can save it actually, and apply. And here we have our little kind of web server status output. And if you had multiple ADCs, you would see them for each ADC, right? So the next thing we say is, okay, well, what about the performance? You know, what's the throughput? How much request per second are we getting? Things like that? We want actual graphs, metrics and stuff like that. So you can add another panel now, right? And obviously, you get to know the various features, the things that things report, but you can also kind of just explore, right? So we could say, okay, well, we want to look at throughput. So let's just see what bytes values there are. And it will give you everything that matches bytes, you know, so you can see all of the various Nova ones, for example. And like we can say, okay, Nova front end bytes in total. So let's say again, okay, well, what fields can we filter on here? Because we want to look at specifically our this ADC of ours, right? This one that we set up here, which we know is called prompt ADC. And we can then filter on that, right? So we can go through, you know, what's things we've got set up and which ones we want to manage, and we can see the configurations, etc. And so I happen to know that it's this one, but you know, you can filter on that. And we can get that data in here. Okay. So we see here, on this back end, we've got our bytes in. And we can call that bytes in. Now you'll see correctly, it's saying this seems like a counter, right? So what Nova gives you is your total number of bytes in Grafana handles that for you, well, Grafana, you can say rate five minutes. So that's going to give you the rate of traffic, right? This is bytes like the BPS that's, you know, sampled over the last five minutes. So we can have a look here now and we see, you know, this starts to make more sense, right? We're starting to see the data there. And then we can say, okay, well, that's bytes in, let's duplicate that and say, bytes out. And we'll name that bytes out. And then we can say, okay, this is looking pretty good already. But I want to know total throughput. So let's stack them. And we can come here and we can say, okay, we want to stack. And this will now say, okay, cool, this is giving us our total bytes in and bytes out, our throughput for this ADC, right, for our website. This is how we know how much traffic is actually going to our websites. And we can monitor that live, right? So this will, you know, we're saying last six hours now, but we could say, okay, let's drill into the last 15 minutes. And we can then see, you know, some nice data. Remember when I was sending traffic, obviously. And if we send some more data, we will start to get that coming through here. I'm just making a bunch of requests. And then, you know, it scrapes every 60 seconds. It's obviously optional. You can set it to whatever you want. You'll start to see more and more data there. Right. So now we're monitoring, okay, what's the status of our servers? What's the throughput of the servers? And now we might want to say, okay, well, what's the quality of the service, right? And there's an easy way of doing that as well. Well, there's a few easy ways. So what I like to do here, and again, this is all personal preference, right? But what I like to say is, okay, I want to know two things. Firstly, how long is it taking the ADC to connect to the web server? And how long is it taking the web server to reply? That tells me two things. The first is do I have a network issue, right? Like if my connect time shoots through the roof, I might have like a network latency service, something weird like that, or if my response latency shoots through the roof, then I know that the web server is taking a long time to generate my page, maybe the database is slow, maybe there's some, my authentication, microservice, you know, whatever rate we start to dig. And you can imagine if you've got an ADC sitting in front of your each microservice, you can really start to get a nice kind of dashboard built around saying, well, you know, I can see my website's speed or my API speed deteriorated by 50 milliseconds. And I can see that at the same time, the authentication service deteriorated by 50 milliseconds, it's clear what the problem is, etc. Right. So, but for this, we want to say, okay, connection, let's see what pops up for connection. So here we can say, wait, I actually want to say, Nova back end. What is it called? It's not connection, it's time connect, connect time average seconds. Right. So this is going to tell us what is the average time in seconds. Obviously, it's some portion of a second in most cases, unless you've got real problems. And from the ADC to the back end, right, the network delay. So we can say, okay, we want to see that. And we want to see it for, right, what does it give us back end? Okay, cool. We've got our prompt back in again. So let's see it for that. Right. And we'll call this, you know, network delay. Call it whatever you want, obviously. And for this, because we've just got this one number, and the number, the current status of the number is quite important to us. We can switch to a stat one, right? Now this is showing us what is this really, right? That's a second. But I know that that's actually in milliseconds that we want to be reading this. So I could times it by 1000. And I could say, okay, cool. This is our value now, right? 0.86 is the millisecond, the average millisecond delay connecting to the back ends, which is great. If we see that spike, then we'll know we've got a problem. And you can start to like, you know, you can format that time. And so I can say unit here, we can say time, and then we can say milliseconds, right? So you can see, you know, how you can really kind of graph everything. And what you can do is you can say, okay, well, I give it, you know, if it was ever above one, then that would be a problem. So, you know, for the example here, like, like, let's just say 0.5, you can see it would switch to red, right? Because it's, it's too slow, you know, so I might say, okay, well, two is my, my window where I, you know, need to be worried and, you know, things like that. And you can have like lines across the dashboard so that you can see that, you know, whatever you might want to do. But that's my, that's my one metric, you know, so let's call that connect time that I decide is important for us to, sorry, for us to, you know, measure the kind of the health of the system. And I also want to, let's duplicate that, measure the reply time of my backends, like I mentioned, right? So here we can say, okay, HEP response time average. And then we see the same thing, right? So we get that same data in the same way, and we can monitor the same things. We must change the name of that, of course. And here we'll say response time. Don't worry, I'll show you a server just now that actually has high response time so you can see. But we start to get all of that kind of information here. We can save our dashboard just to keep things fresh. And if we refresh there, you can see a bunch more data coming through and so on. Now, the next thing, the final thing that I'll show you actually creating and setting up before I show you a dashboard that, you know, has some real data on it is now the quality of service of that server, right? So one of the great things about HTTP, and you know, this is from APIs to websites to microservices to whatever you might be doing, is that it's very easy to monitor the telemetry of an HTTP service because you can monitor the response codes. So the ADCs report on what reply code actually came back from the web server. And that lets you do a whole bunch of stuff, right? So we can say here, nova backend, and let's browse here. And we can say, it's HTTP responses total. Yeah, then we can say, okay, for that backend, right, let's see the total responses, right? So here we see kind of all of our data stacked up, etc. And then we can say, okay, for the code, where the code is 200, right, 2xx, that's how we will show that. Now, this is actually a rate, again, you know, it's the amount per, you know, we want to be looking at the amount per second. But here we can see, okay, you know, how many 200 replies are we generating per second, right? And you can understand why, you know, monitoring these things and alerting on these things would be great, because what we can do is we can say, okay, well, how many 500 replies are we generating? And we want to color that red, you know, because that's dangerous, server error is a 500 reply. And we can monitor 400s, you know, you can group all these things, however you want. But, you know, let's just say we wanted to monitor these two things. Let's restore my color red there. And we can save that like this, right? And we can put in all sorts of things. And you can create alerts on these as well, right? So I get into all of that now, but you could, you know, set it up to cause an alert if, you know, the 500 reply rate went over a certain number, or if you were concerned about, you know, spike in traffic, all that kind of stuff, you can stack these obviously, because they're cumulative. And so here you see like, well, now my red line's on top. So, you know, you can change and tweak all of those things, right? Like we can say, cool, we want to actually fill this, because it's completely filled in, and we don't want to line it all. So now we can just see, you know, whatever color we see, if we see red, we're in trouble, if we see green, we're safe. And this would be reply codes. And you can see how, I mean, I'm literally going through the discovery process of finding this data with you, you know, now on this demo. And that's one of the great things that kind of Prometheus and Grafana enable is that kind of discovery process, right? I'm saying, okay, you know, I'm interested in the performance of the front ends of it. So, you know, the traffic they're getting, let me see, you know, what's happening here, right? So we can say, well, let's just show current sessions, we just find it right now, we want to see what the current sessions are on the system, and we can graph that. So you get the idea, obviously, this is how you set things up, and this is how you manage them, you can, you know, have multiple dashboards, you can get fancy, and you can do things like, you know, your percentiles, and your, you know, historical information, you can compare, you can look for anomalies, you can do all sorts of stuff like that. But we have so little data here to really play with. So what I have set up is also these two ADCs here, where we can see they've done a bit more data, right? They've got some traffic going, and their nodes actually have traffic flowing through them, which we can see here, 224, 226 connections. So we've got some data going through them. We are also kind of collecting from them, right? So we saw that in Prometheus, where we looked at targets, we saw these additional nodes here, and that's what those are. So we can have a look at that now, right? And we can say, okay, you know, what are our dashboards? Yeah, let's save that one. That's fine. Thought I did. And then we can come to this example, Nova dashboard. So this is a system where you can see there's blanks in it because I've had it turned off earlier from when I was testing it. But if I just kind of go to the last 15 minutes, we can see more recent stats. This is a system where we actually have some simulated latency and stuff. Let me let me bump that up a bit. Yeah, there we got some spikes in the connections and latency and so on. And so here you can see, you know, what are the various status codes that are coming back. So you can see here what I've set up is a table on my HTTP replies graph. Let me make that big. Yeah, here we go. On my HTTP replies graph, we can see, okay, there's definitely some 404s coming through here, right? Because there's a little bit of yellow on the graph. But that's fine. That's pretty much expected. But we've had no 500 errors, no server errors, right? So that would be a critical thing that we would want to monitor for. And this gives us an idea of how many, you know, replies per second we're doing. We're monitoring our throughput here. So we can see all of that, right? How much traffic in, how much traffic out, you know, kilobits per second. And it handles the formatting and all that. How many sessions do we have? So what's interesting about this example is for the shop ADC, we've actually got two ADCs deployed, right? Because it's an autoscaler. So it will deploy any number of ADCs, but it will pick that up. And then we can monitor the data from both. So when you look at my shop sessions here, I'm looking at a stacked graph from both of the ADCs. So in this way, you know, it's very easy to see, well, you know, I've got five, 10 ADCs, perhaps I've got two on the West Coast, two on East Coast, a backup one in London. I can see my total kind of cumulative amount of traffic that's going through the system. You know, what are my response times from each of them? So like, this is a great example of one where, you know, I've set up thresholds. So like here we see, okay, you know, from this one, the average response time is 140 milliseconds. And on this one, it's 137. We'll just go over 1500, and then over 2000, you know, that would be orange, and then that would be red and so on. Our latency on the system, our connect and response latencies, the statuses, etc. And you can see from, you know, just basically, this took about half an hour, how much information you can start to collect and report on. And once you get to the level where you've got multiple components behind ADCs within your kind of infrastructure, like, let me give you a real-world example. You might have a West Coast and an East Coast data center. You've got replicated environments in both. And so you've got your kind of North-South ingress, right, like data coming into your environment from the internet. It's your users hitting your website or your API. And then that API sends requests off to various kind of services, right, within your stack. So maybe you've got a couple of pods that are doing, you know, your user authentication. Maybe you've got some database stuff. You've got, you know, kind of a, where you store certain information, like maybe like even a rediscash and things like that. And once you can start to graph and kind of collectively view the latency that each of those components is contributing, right, because you can say, okay, well, what's my total reply latency? And then what's the stacked latency of each component in my stack, right? Each microservice that I have. And then you can easily see which pieces are kind of contributing the most to that pie. And you can monitor for changes and differences and variances, right? So like that's the biggest thing we do is, you know, I think ADCs of all load bands is that was focused very much on, you know, if it's up or down, right? But now it's like, I think slow is kind of the new down. And especially in a microservice environment, tracing and monitoring, all these things are very difficult. And I think you can see how, you know, exporting that data to something like Grafana and Prometheus allows you to really play with it, manipulate it, look at it, et cetera. And what's nice about this is most of our, I suppose you might use the word DevOps, DevOps like clients will be running probably a stack quite similar to Prometheus and Grafana. And they can easily manipulate that data just integrated right into other monitoring that's maybe outside of your ADC, et cetera. But yeah, this is a look at basically how you can configure, you know, a bunch of ADCs to export the statistics like I showed you here, you know, just this kind of pages what they provide, and that then you can pull that into Prometheus, where you can look at, you know, all of the various data and so on, your bytes in and whatever you might want to do. And then you can send that data to Grafana and report on it and do some pretty powerful stuff, get a lot of insights, a lot of monitoring. And it's very easy to kind of customize. And what's great is you can duplicate this entire dashboard as well, right? So you might save this as a copy. And then, you know, a different operator, different person, different team, whatever might want to different view of some different components, pieces that are important to them. Very easy to do that kind of thing. It's a very powerful tool. I think you will, you know, have clearly seen from this, hopefully. And yeah, that's kind of a look at what I wanted to show you, you know, the most important things really what we're trying to look at here is the status of the system. Is it up or down? The performance of the network and the server, right? Like what's our latency to the servers and what's the latency of the servers to respond to us. And then the quality of those replies, right? Are they, you know, HTTP 200 replies, i.e. successes, are they 500 replies, errors? And how do we monitor, you know, the total throughput, the variance between those things, the latencies and stuff like that, in order to ensure that the system is running well and ideally pick up problems before they happen. So if there are any questions, please, you know, you send them through the Q&A section. Obviously, also, you know, there's many ways of contacting us. But feel free to play with it yourselves. You can sign up, deploy these things, everything I've shown you is available for free. These are open source projects and you can, you know, you can play and learn and test with them. And I think that's it from my side. Awesome. And thanks for that awesome presentation, Dave. I did want to ask, you kind of covered it a little bit, but if people want to check this out or they want to learn more, what are their best resources that they can go to? Yeah, absolutely. So I mean, you can go to our website, nova.sample.net to register. Prometheus and Grafana are also obviously public projects that you can have a look at. You can check them out on GITS even. But if you come here and you click register, you'll be able to just kind of sign up for free. And then you can play with it. You can use everything that I've shown you. But also there's obviously a doc section where you can browse, you know, kind of the documentation, play with it, see what's around and all of that. And then, you know, throughout the site are ways of contacting us, especially for things like this. Don't be shy reaching out to us. Like we'd love to chat here, your suggestions, future requests, you know, if you have problems with the stuff, you know, we're always excited to deal with people kind of in the CNCF bubble, if you will. There was, yeah. There's one question here about recording. There will be a recording available of this presentation shortly. It'll be on the CNCF's website at cncf.io slash webinar or webinars. Let me check webinars. So you'll be able to find that shortly after this. Can I give an idea of a typical network or node loading that of all system, the kind of load that Nova would generate on a system? Yeah. So one of the ADCs is kind of an interesting space. For anyone who's familiar with ADCs, we'll know that they come from a hardware world. So ADCs used to be kind of big rack-mounted beasts of servers, and that doesn't really fit in kind of the Kubernetes cloud native container world. So we spec up ADCs actually on one core and two gig of memory. That's like our typical spec. And from that, you can generally get around 100,000 new requests per second. SSL will be about 10,000 requests per second. That's like an entirely CPU bounce. So the more CPU put in, the faster it goes. So at about 100,000 requests per second, you're looking at being busy. But to give you an idea, I mean, at probably 10,000 requests per second or less, you're hardly going to notice the load at all from Nova. It will be quite efficient. The next one is after I link Prometheus to Grafana data sources, will the metrics automatically show in Grafana? For me, it's not showing when I try my test environment. Did I do anything wrong? Yeah. Well, so this is where, yes, the metrics won't show in terms of a report, but they will show in terms of your dropdown, which I'll show you in a second. But the first step is going to be to make sure that they're actually there in your targets, right? So if I come to status and then targets on my little user interface with Prometheus here, you can see I can see here that the Nova group is up and that there's four that are reporting in, right? And I can actually go to those and see what they are. So like, for example, I can see here that it gives Nova back-end compressor bytes our total, right? So I should be able to take that and actually look for that. So then the next step in debugging that is to say, okay, well, let's go to graph. And in our expression here, you can put that in. And you can see it's also, it's already trying to auto-complete that for me. So, you know, I can see that I'm actually getting this. If you're not getting this, then the scraping is not working properly, right? So it's not collecting the data. If you are getting this, then definitely it should be in your Grafana. What you need to make sure of is that you, let's discard that, that you've connected Prometheus as a data source, right? So if you see I came here, configuration and the data sources, you'll just add data source and say Prometheus. In my example, like I said, I was running both on the same system. So it's just localhost 1990, but you know, however you connect to it. And once you've done that, if you've got the list of data here, you've got your data source configured here, then you should see the auto-complete when you add a panel. And that auto-complete is in here. So in your promql query, like I typed Nova, and you see it popping up all this stuff, that means it's discovering them all. So those would be kind of the steps that I would take along the line of, you know, finding out which piece is missing basically. The main thing is really check your scraping, check you get the data there, and then make sure you've connected a data source in Prometheus. A two Prometheus, I mean, on to live. How well does it work with an AWS auto-scaling group? Yeah, so the ADC, it's quite interesting. You will have seen that I added my backend as a cloud API backend instead of a simple one. So like a simple ADC backend is IPs and ports, right, where you want to send the traffic. Whereas instead, I actually use the cloud connection. And specifically, the reason why we do that kind of thing is for service discovery in clouds. So in AWS, for example, if you've got auto-scaling, there's no way of us knowing what IPs will be online to send your data to, right? Like even if it's two, you know, could be 10, and then they could shut them down, some could fail, they could repair them. So with AWS, what you would do is you would choose an AWS cloud connection here. And then instead of a tag, you can select one or more AMI IDs. So you can say, okay, well, these Amazon machine images that I'm auto-scaling, I want to send a traffic to however many of those images are actually running. So it will pull the whole time the EC2 API and make sure, you know, okay, you know, what's the situation here, etc. And then the ADC, if you mean the ADC itself, it can also auto-scale. So you can, you'll see here providers, auto-scalers, you can deploy it as an auto-scaled group that will kind of go up and down as it determines the, you know, it requires based on load. So you can see here, like, you know, reducing, increasing however many things it should run, etc. Is there any documentation for Grafana Prometheus-Crees that I can follow? There is quite a lot from another point of view. It's quite a new feature. So there's not that much on it to be honest with you. What, you know, we kind of publish the what we export and you can see, you know, what you can get from there just like, you know, you see all of the data on this list. But Prometheus and Grafana both have great websites, and there's a lot of online information around them, right? Because they are like both graduated, I think they're both graduated and pretty sure they are CNCF projects. So, you know, so having gone through that funnel, there's a huge amount of like, you know, people doing webinars, stuff like that, example setups and things. But what the great thing, like to me, the great thing about this is that you can literally just install them on like, you know, a very cheap cloud instance or a local VM or whatever, you know, stick them in your Kubernetes cluster if you've got one running, and just start to play, right? And because by default, Prometheus is going to collect information about the system that it runs on. So you see here node, node underscore, that's local host, right? It's the data from the local system that it's running on, like CPU seconds, disk writes, you know, all of that kind of stuff. So without any configuration, without any external service like Novo, anything like that, you can start to graph this stuff, like, you know, your memory usage and things like that. And that's a great way to play with it, basically. Are there a set of dashboard templates or a library, which can be readily used? The general metrics that an IT apps team would be interested in? Yes. This is something that Grafana does quite well, is that you can share panels, right? So there's like a public, you can share panels, but you can also share dashboards. There's like a public, you can see here I can import kind of list where people will put them on, you know, you can search for that, it's on their side. And then as far as Novo ones that we actually provide, yes, but we would provide those to you. So, you know, you can reach up and we can give you some of our example ones. We're still preparing ones to publish publicly. Our Prometheus integration is like, like three weeks old or something. So, you know, it's quite new. But yeah, and you can share all of this stuff. And what you ultimately get at the real like end of it all is this kind of link where you can embed stuff, but you can, I can't work out where it is now. You can actually get this as like just a JSON string, you know, the graph. So it's like very easy to share internally, basically. So besides application monitoring, can Prometheus also help with OS or system parameter monitoring? Yeah, absolutely. So in like a where Prometheus is great, like the default kind of configuration is that you can put it in something like Kubernetes, use service discovery to monitor all of your active pods and nodes and you know, all that kind of stuff and collect all of that information. You will have seen that I while I was answering another question, I said there will be no underscore stuff by default is the local system. So like disc read bytes total, right, is actually the system that this is running on. You can see that here. But, you know, however many targets you set up, you can then monitor that centrally. So you get all of that data into Prometheus. And then you can graph anything like I was showing you obviously in Grafana, or you can you know, so like an interesting tidbit for you is like our graphs, so stuff like this, where you see like on nervous graphs, these are actually powered by Prometheus. So you can also send manual query search and it doesn't have to wrap through Grafana necessarily. And you can like integrate that in your tooling and all that kind of stuff. And you can set up alerts. So you could say, you know, CPU is too high, anything like that. And there are kind of Grafana, they are Prometheus like exporters for basically everything. So yeah, you can definitely do that. And then the last question was, do you have Grafana dashboard that people can import. So this is another person I've seen the same thing. And absolutely, I mean, it's a good point, we should share some example ones, like our internal ones have got like percentile tracking and all, you know, quite a lot of cool stuff that takes longer to set up than this kind of quick demo that I did with you guys. So we will definitely publish that on our site. Maybe in the next couple of days, I think that's clearly a good idea. Yeah, absolutely. And that's, I think, the end of the questions. Awesome. Thanks so much for showing off such a new feature. Yeah, I know it can be intimidating to show off something so new. Yeah, well, you know, you never know. I mean, it was 5050 work or not. It's a component of the ADC game is making sure when you release something, it works. Always important. So thanks so much for joining everyone. Like I mentioned, the recording will be up shortly. You'll be able to find it on the CNCS website, cncf.io slash webinars. I think if you signed up, you may also receive an email, not 100% sure on that, but it will be up on the website at least. And if there are no more questions, we will go ahead and close out. Thanks so much for presenting today, Dave. It was great getting to see that. Thank you. Thanks for having me on. And yeah, thanks to everyone who attended. Thank you.