 Hello, everyone. Thank you for joining our talk today entitled Supercharged Analytics for Prometheus Metrics with Spark, Presto, and SuperSet. So just a little bit about us before getting into the talk. So we have Rob Skellington. He is currently the CTO and co-founder at Chronosphere. Prior to Chronosphere, he was at Uber and was one of the creators of M3DB. And then he's also a contributor for Open Metrics. And then myself gets Colin. I'm currently a developer advocate at Chronosphere. And prior to that, I was a product manager at AWS. So we're just going to briefly jump into a preview of what's to come to kind of set the context. So essentially, we're going to be looking at Apache SuperSet, which is a UI that connects to arbitrary SQL backends. And in this talk, we'll be connecting to Presto, which is talking to Prometheus backend. And you can see that arbitrary metric names can be queried against once you set up everything correctly. And this allows for deeply complex analytical queries to be issued against Prometheus metrics over a very long time horizons. But for now, I'll throw it back to you, Gibbs. So we're just going to go through the agenda real quick. So we're going to start out talking about kind of how and why you would want to use your metrics for analytics. After that, we'll go into a bit about why using your metrics data is a bit challenging with the current monitoring solutions, followed by a couple of different ways that you can solve this problem. So basically looking at what a solution would look like for this. And then we'll give a little bit of a demo followed by time for Q&A at the end. So quick overview of metrics. So metrics are essentially used mostly for monitoring and alerting purposes. As you can see, this is a pretty standard way to view your metrics, as you would like in a Grafana dashboard here. So you can see your metrics over time in a variety of different views, which you can then set alerts against. One thing, though, is that metrics are also very valuable in terms of they also have a lot of historical data. So they can also be used for analytic purposes. So now we're going to go into a few examples of how you would use your metrics for analytics. Starting with this first example with cost monitoring. So as you can see in this graph here, you have basically this would basically be an example of a cost report that you might get from your cloud provider. So zooming in a little bit more into the different diagrams here. So you can see that over time, your EC2 usage is increasing month over month. And that's a little bit different than the other services you're using, which are basically just staying steady or even decreasing. So given this pattern, you would then want to try to understand why your EC2 usage is increasing over time. So one way to do this would be to look back in time and see if any of the events or things you might have done, changes you might have made could have triggered this change. So for example, going back to June 2020, let's say you went ahead and changed all of your EC2 instances from SSD to spinning disk in the effort or in the hope that this would then lead to a decrease in usage, overall usage. So as you can see, it had the intended behavior and in July with the usage going down. However, in August it spiked up and then in September it got even higher. So you'll want to then understand why this behavior all of a sudden changed after one month. So a way to look into this a little bit further is to kind of use your metrics and do more granular analytics on those metrics. So in this example, we've zoomed in to just the EC2 usage component of the graph from the previous slide. So you can see here that the various EC2 usage amounts are broken out into four applications and that you can see also that applications see here is mainly the is the main culprit in terms of the increasing usage over time. And this is because that we know when you did switch over your EC2 instances to spinning disk, the A, B and C or A, B and D applications performed well with that change. However, application C performed poorly and needed more EC2 instances because yeah, it just did not perform as well with the spinning disk. So now that you have that insight into what was the main driver of this increase in usage of EC2 over time, you can then go back and kind of change your cost monitoring strategy and revert any of the changes that might have led to those increases. So in this example, you would then want to change application C back to SSD as you found out that it performed poorly with spinning disk. So by doing that, you are then able to go back to your intended behavior and then start seeing the intended results which is a decrease in EC2 usage over time. Thanks a lot for walking us through that Gibbs. Yeah, I wanted to walk you through another analytics case here on your metrics. So essentially there's plenty of other things we can do with metrics when we want to take a more analytical view. Of the data that they represent over time, that is a little bit more difficult to do with the monitoring and observability tools out there today. So we can do data science, perform data science operations on these to kind of like understand a little bit more at a very deep granularity on how different applications or infrastructure or product things like how your actual technology system is actually performing and kind of dissect them using more typical data science like queries. And then we can look at optimizing essentially our systems similar to kind of what Gibbs was, the use case Gibbs was looking at but with a more granular focus on specific systems deployments and kind of like discovering hidden waste which is harder to do with just a graph. And then looking at forecasting and how forecasting can be used with metrics in an analytical setting. So basically we're going to walk through a example here to extract some more probabilities and back test basically what different models have been performing perhaps under different systems that you're writing. So this example we're going to look at a high-volume real-time ad-bit platform. We're going to look at say number of ads purchased, number of ads that led to click-throughs, latency to respond to an ad request. And then we're going to, so those are the key KPIs that like would be tracking with Prometheus metrics. And then, you know, we'd also be looking at and these are really high volumes so obviously things like latency and you know high granular counts per second you would want to use with metrics. And then we're going to label these metrics with things such as the display ad type, the regions, the machine learning model used and the different input configuration parameters for each model that was used. So if we look here we can see this is kind of like a distribution on the click-through rate which is calculated by number of ad volume versus number of ad volume with purchases, sorry with click-throughs after the ads being purchased. And then, you know, you can see here basically how the different regions are performing with different display ad types. And then what you really want to be able to do is look at how the different machine learning models that were being considered to serve your traffic or AB tested performed. So you could even do shadow AB testing where you're not actually, you're only like shadowing a few percent of your request to these different machine learning models with different parameters tuned. And then using metrics on these like high volume kind of events that are happening and also using metrics to look at the different latency on how these parameters are actually performing, you know, you could kind of see what the different perhaps click-through rates would be of the different machine learning models, your kind of AB testing versus like a control set of control units. So, you know, that kind of thing is obviously possible with Prometheus metrics today. It's just over large time horizons with very high numbers of models and different variables in terms of like display ad type regions, like other kind of variables that go into different segments you want to track, to analyze all those various different segments across the various different model parameters that you're testing. That becomes a lot harder to do just on a simple graph, obviously, and SQL becomes quite, quite useful here. Then the second use case, you know, I want to walk through is essentially optimizing your systems on a more granular level than even just the breakdown of microservices. So for instance, for you can, with a SQL query, you know, enumerate every single application group by service that you're running on a say Kubernetes cluster and work out how much of the CPU quota that you've reserved or requested for the application is actually being used. And then you can see that over time, you know, over very large time horizons and using, you know, SQL or other tools at your disposal, that you use in a more typical analytical use case, you can actually get a very highly granular information here on what actual services are, you know, becoming wasteful over time versus others. And then also like we can use, of course, these same type of analytical queries to forecast how many resources your system is going to use. And so here's an example of, you know, an organization that's forecasting a growth of requests per second by their system by end of year for 30%. However, if you look at this graph, you can see that the number of requests per second actually does not literally use more CPUs, it actually exponentially uses more CPUs. And when you break down that average CPU use by service on an even more granular level, you know, you could see essentially which service is, which application in your stack is contributing to that higher growth of CPU versus others. So you kind of know which services you're going to have to reserve more resources for, which will then let you make the decision on whether you actually need to actually make that service perform better, or do you have enough budget to just go and purchase the necessary resources and expand the cluster ahead of time to safely handle that traffic. And, you know, there's Dan Lu, who has done a lot of prior art on this. And I highly advise you reading his blog there on danlu.com, Metrics Analytics. You can really find some really amazing insights when you kind of start to be able to run these arbitrary queries. And he describes here that in one case he found a service that was wasting enough RAM to pay his entire salary for a decade. And so now Gibbs is going to walk through a bit more of the problem of why this can't really be done today. Yeah. So why is Metrics Analytics hard? So at a high level, this mostly comes down to some performance constraints with existing monitoring solutions such as Prometheus. So they just, you know, with these longer term queries that span a high cardinality set of metrics, they don't, they either just don't perform well or they will time out when they are running. And this is due to a couple of different things here with looking at Prometheus in particular. So with Prometheus, there is a max sample limit. So this means that there is a limit to the number of samples you can query. They also does have a default timeout of two minutes. So this may become a problem with your longer term queries as they do take longer to run. So you may end up timing out your query request. So, and then another constraint here is that results are typically computed in memory. So, you know, if something does take up a lot of memory or is a very large, large query, then it can lead to an out of memory or just taking up too much. And then at that point, your query will not be feasible to run. So what would you need in order to resolve these, these constraints? So when looking for a solution for this kind of one thing that you would want to look for is some, you know, some tools that can help with executing your large queries or queries of arbitrary size. So in this example, both Spark and Presto would be good tools for this, as they are built to spill data to disk, if something or if the request does not completely fit into memory. And so by doing that, then you can make sure that you don't use too much of your memory when fulfilling these requests. Another thing that you would want to look for is a way to utilize and get more insights from your data, specifically through some machine learning tooling or Spark. So in this example, Spark is actually very compatible with and it integrates with ML lib. So by, by, by using Spark, you're able to use the different machine learning capabilities within ML lib. And this will be provide even more insight into your into your metrics data. It'll also make it so that it's a lot easier to use for, you know, the end users of your data or your analytics, such as your data scientists, for example, as you can now, you know, use and query your data or your analytics through, you know, some of their more native languages, such as Java and Scala. Yeah. And then finally, what we'd want in our solution is the ability to join metrics with other data. So that not all of your monitoring data is living in metric store. So for example, here, you might want to join your metrics data with the kind of cube cluster, no information that you might have an elastic search, or even my SQL. So, so that's what we would want to have in a solution for looking at our metrics data. So now I'm going to jump into a demo here. And yeah, you can find the demo and all the working code at this repository at github.com forward slash Chronosphere IO forward slash demo dash metrics dash analytics. Essentially, what I'm going to do is this repository has a Docker compose file in it and a bunch of Docker containers, I'm going to make start, which is going to create a Presto container, super Apache Supersec container, and a whole bunch of setup scripts that is going to kick off here. And so I'm going to create basically a admin username and an admin password for Apache Superset. It's going to go and migrate all its default migrations that runs when it sets up. And then we're going to log in and connect to Presto through Apache Superset. So now that Apache Superset has done its done its startup, we are going to go in and remove the default SQLite data source, we're going to create a Presto data source. It's going to talk to the Presto container import 8080, talk to the permities connector with the default schema. Once we have created that, we then visit the SQL editor, which is actually restored the old result that I previously had here. So we're going to deep dive into looking at basically, yeah, this use case of looking at memory. And so, you know, I hit the explore button there from the from the SQL editor. And so now I'm in basically a visualizer visualization explorer. And essentially, I've chosen line chart visualization type here. And then I'm going to go in and edit the data source, which is came from the query that I read in my SQL editor. And I'm just going to remove the limit that was put on there originally, so that we can query arbitrary amounts of data. And so, yeah, we're going to look at say the past three weeks, but you can do three months, what have you, I just don't want the query to run too long because how this is a demo. And so, but of course, the power is to run really long queries, obviously. So here we're going to do a sum on the original query, which, as I showed here, was container memory RSS memory. And we're going to basically filter on the labels to make sure that we have a container name on them. So let's say a container name element at sorry, must with container name key must not be nil in the label map. And then we're going to group by the container name label. And we're going to set some really high limit. Okay, one billion to be exact. But we should make it stick. And now we're going to run our query. So, yeah, this, this query is running on Presto. And actually, if you go visit the Presto interface, you can see that it is running it across a few different parallel workers. And you can see here that's running multiple splits, because so basically it's split our query up into 70 different splits, random all on different workers. So though this is locally, you know, Docker container. So just sharing one machine here. And they and they see it process results. And we're now looking at past three weeks of data. So this basically has grouped everything by container name. So if you had even hundreds or thousands of containers going to group by the the actual deployment, rather than the individual pods in that deployment. And then of course, you know, the great power is that you can really do this over months or years or any arbitrary amount of time. And, and, you know, Presto, of course, will distribute that work. If it's on a single, you know, if you've only got a few machines and it runs out of memory, it's going to simply spill that the disk like Gibbs talked about. And it really does allow you to do some some really large arbitrary queries over a very, very significant amount of data. So yeah, and that's that's looking at over a three month period. So yeah, I hope this was useful. And I hope you get to try this at home if you're interested. But otherwise, I want to say thank you and I would love to answer some questions, if anyone has any. Thanks.