 thing for me today. Oh nice, nice. It means it is truly one of the things that you know people ignore when. Possibly. It definitely says that I have ignored it but I cannot speak for people. So glad to know that you know that we are able to share something that is not common. Hey Pranay, thanks for joining me. Oh, we are live. Awesome. Cool. So, hi everyone. Welcome to the June edition of our Observability Meetup. As most of folks here might know, we have been doing this for more than over one and a half years now. I think we started in sometime June last year, 2020 in the middle of the pandemic. Pranay and I had this sort of idea while walking together that we should, you know, do an exclusive observability centric meetup here in Bangalore and with a lot of work that people are putting in, in and around from India and also other organizations, open source organizations. And initially we had a lot of support from Grafana. So still good to us to them, you know, for bootstrapping us and now it's we are in our second year. So glad to have you all here. Today we have two talks lined up. First is by Ashok Shastri from Cisco, who would be talking about the entire observability landscape in and around Cisco, what goes on there. And I can guarantee you it's not your standard observability stack. What I have personally heard from him during a banter session, which I'll expect two months ago during our last meetup. And it was quite interesting and I had to have to invite him over to talk about it. Our second speaker here will be Suresh from Grafana and Suresh has been a personal friend for quite a long time now. We have hung out in multiple meetup groups, multiple conferences. And he has since joined Grafana after Karecite and he's been doing some pretty interesting work there. He wants to share one of the core things that he has been working on at Grafana, which is called synthetic monitoring. Now I'm being honest here. This is a term that is also very novel to me and I'm as interested as any of you to hear about this, you know, and to learn about this because I have absolutely no clue what this one is about. So it will be a complete, fresh, empty cup learning experience for me also. So looking forward to both the talks and glad to have all of you here. There are at least 10 of us here and there are at least four people watching on YouTube and that's quite a turnout, especially when we are coinciding this with the KCD Bangalore Meetup, you know, the Kubernetes community day. So thanks for turning off your man, everyone. Cool. Without further ado, I will hand over the, you know, stream to Ashok. Ashok, if you could kindly, you know, give us a short intro about yourself and then we can second to your talk. As far as we're going ahead, how can people ask questions? Do they post in the job? Absolutely. So we are not doing a webinar. We have to get away from that format for all the attendees here. So what we can do is that we can type in the questions directly in the chat itself. Pranay, would you suggest to bring up the other chat question platform that we used to use? Slido one. I don't think we have like lots of people so we can just do chat. I think all of the attendees, you know, you can type in in your questions on the chat itself and we will unmute you at the end of the talk. After every talk, there will be a Q&A session and, you know, the speaker or one of the attendees or one of the moderators, someone can address your question. Based on its merit, of course, we will try to, you know, refine the questions to, you know, addressable ones and most of those will be answered, I hope. Okay. So, and for the folks who are attending the meetup through YouTube streaming, feel free to, I am also there on the chat. Feel free to forward your questions on the YouTube chat and I will ensure that, you know, our speakers today and that in this, you know, we are able to answer those questions. So yeah, that would be the format today. This is more a close-knit group where we talk to each other and have dialogue much more. And I think both the speakers today have said that they want to have more dialogue with the audience regarding that topic. Suraj has insisted on it. I think Ashok also has insisted on this from the get go. So we'll have a lot of that today. Cool. So I would be silent right now handing this off to Ashok. Ashok, it's your Yeah. Hey, good morning, everyone. Let me know if you're able to see my screen. Am I audible? Yeah, we can hear you and see your screen, Ashok. Yeah, that's good. So this is a talk on observability at Cisco. EN, EN is enterprise network. So it's a group within Cisco which is responsible for routers, switches, etc. Ashok, you want to bring up your video also like might be? I can, but the camera are not so good. No, the network connection seems to be dropping down again. So probably in the end Q&A, I'll bring it up. Okay, cool. Yeah. Yeah, Ashok. Yeah, let's start. Yeah, so the normal disclaimer, these are my views and Cisco is not level. And also one more thing is that whatever I have put here is almost one and a half year back when we were doing the research and exploration. This might not be very much relevant today because all the stats or whatever explorations might have might be not really true anymore. But to the most part, yeah, it should be true. So let me proceed. So who am I? I'm a software engineer working on tools and workflows. I've been in Cisco for almost 13 years now. So I'm working with the developer experience team. So this team caters to a developer community of around 10,000 developers here in Cisco. So it's a pretty big population as you can understand. You guys can stop me at any point. You can ask me about anything. Ashok, just curious. So you serve the internal developer team in Cisco, is it like? So Cisco has a lot of lateral. So I tend to end up calling Cisco as a bazaar. So because of acquisitions and so many other things, there are a lot of groups. One such group is Enterprise Networks. And this Enterprise Networks has about 10,000 developers. And there there is a developer experience team. So we work through the whole developer workflow, you know, all the way from say an IDE to their continuous integration, a lot of tools internal and external facing the testing and reporting and so many other tools. So yeah, observability is important to be as I said, collectively, we have about some 30, 35 tools, which we provide to the developer life cycle. And these are polyglot coming all the way from Java, Python, these days, everything is there. So the earlier when we when we were into this business, so it was just that, okay, my application is available or not available. But these days, the quality of service also matters because everything is mostly a pipeline and something, you know, slowing up causes a ripple effect elsewhere. So quality of service is really important. Open source has a lot of tooling. Cisco traditionally used to build a lot of tools on its own. But these days, we are exploring and embracing open source because there is a lot that can be leveraged. So we set out observability stack around 2019. And these were the high goals, I mean, higher level goals, build on some resilient platform, we did not want to reinvent. At the same time, we wanted something which was very, you know, rock solid, and then pick up the open source software. We wanted to start with telemetry because the whole observability was a new thing to all of us. And if you say somebody would start with logging, somebody else would start with something else. So we kind of really decided that it is better to start with telemetry. And then because our applications have a lot of history and not all of them are in a CI CD model. So another important requirement was that, okay, we also need to do arch, you know, we wanted some kind of monitoring, where deployments and other things could also be tracked. So this is where there is a terminology, all synthetic monitoring in our systems also, but I really don't know if that is going to coincide with whatever is the next speaker. So that is what I meant here. So we wanted to see how the traction is when newer versions are being deployed. So the platform that we have is resource based because it is a proven scale. There was a lot of in-house expertise, there were some systems that were built and explored with missiles. We used marathon as the scheduler and metronome as the other crown kind of scheduler. So those are the two scheduler schedulers that we had. We have set up a stage and a prod, whatever you can say. So there is a relatively conservative one with, so we have deployments across three geographies. And our Dell setup has almost around 7 to 8, here we call it UCS. So it's a very capable host, multi CPU, multi core RAM, et cetera, et cetera. And prod also has across three geographies, but 25, 30 hosts. So yeah, when it comes to telemetry, we wanted to do application monitoring. Then we had a separate of business telemetry also, then the normal infrastructure monitoring. So application in the sense where we instrumented things and then got metrics out of them. Business was more in the lines of, say, a higher level view than what we have. How many users are there? What is the ROI that they are getting at a much higher level? You can say a collective kind of aggregation. Then infra, we also ended up building a stack for monitoring the lot of hosts that we support. As you can see, there are around 10,000 developers. So we have about, I would say, around 3000, more than 3000 hosts, which we are also monitoring. So that was our infra monitoring stack. So this was the pipeline. So once we have the metrics, it is ingested to Kafka through a smaller SDK. Then from there, it is picked on by the telegraph agent, which also does some kind of processing. And then it is passed on to Prometheus. And Prometheus is what we heavily use. It is our kind of brain. And then we again have another telegraph in the end, which writes our in-clugs TV, then ultimately, which is queried in Grafana. I'm sure like you guys will have a lot of questions. I'll probably try to address each of them through individual slides. So Prometheus is our centerpiece. It kind of acts as a metrics aggregator for us. So for Prometheus, the input actually comes from the telegraph, which kind of does a work of collecting, filtering, and then passing on. We use Prometheus heavily for our alerts and custom metrics that are generated. Blackbox exporter is something which we use heavily. And then we also use it for service discovery. There is a capability where Prometheus can work with Mesos. So that is also being used. And then we also use it for Federation. Federation, in the sense, we have, say, around 35 tools. If we want to get a collective view of how things are at the R level, that is where we use Federation. And again, a word of caution is that we are not using either Thanos or Cortex or anything. I'll tell you what, because we don't have object storage yet. Or at that moment, they were not even ready. If you guys have any question on this, you can stop and try to answer. Ashok, I had a quick question. Why are you first sending to telegraph and then sending to Prometheus? I think you can directly send to Prometheus. Probably I'll go through the whole stack and I'll come back to that. That's a very interesting challenge that we had. Telegraph. Telegraph is our collector and filter. So it collects all the metrics, it does some kind of filtering, and then it passes on. Why we ended up using telegraph is also because it has tons of plugins. And then it kind of supports templating in the sense that you can play around with environment variables. That was one big thing that I found very hard to do with Prometheus. I don't know if it has changed. But in telegraph, you can refer things through environment variables. That helps us kind of template it. And then this is one very useful plugin that we had. There is a plugin called as Exact Plugin where you can execute some shell script at a given interval. It just needs to provide the output in a specific format and then you are good. So we use it for getting some specific metrics from the host, NFS, IO, statistics, and other things. Whatever else is not there, if you can write something, you can just use this. So this is on the telegraph, very useful thing that we found. Then there is Kafka in between. We wanted a reliable transport in the sense that, yeah, there is always you can push the metrics to a prom gateway, but we did not want to do that because we found that at times what happens if the prom gateway fails? What do we do with the data that is to be sent then? So that is where we thought, okay, maybe we will write to Kafka and from Kafka, we would read it. There is a plugin in telegraph to read from Kafka. All that we need to make sure is that the data is passed in the right format. Then, yeah, as I said, it is used instead of the gateway. Something that troubled us were we had to play with partitions and consumer groups because if we fiddle with this, then metrics start coming out of order and that causes all the havoc. Then the last part of our thing is influx. It is our data store. We don't do any down sampling. I think the latest, we were at 1.8. We still have moved to version 2 and we had a retention of around three weeks. We have chosen the enterprise edition because it also allows us replication across the sites. One thing that we had trouble with which helped us was this time series indexing tuning. With that is when our actual data size shrunk a bit. Otherwise, it was going haywire. This is something that if anybody is using influx can also look into. Grafana is our last piece. It is used for visualizations and alerts. We took, I think if we started around version 6.0 something, now it is in version 7. It is heavily customized. We have added some capabilities in terms of PGSQL integration and SSO integration. Some custom visualizations have been built because some of those visualizations were not scaling to the level that we wanted because as I said, if I have to visualize 1000 posts, it could be 1000, whatever, some geometric shapes with its own things. That was not working. That is where some custom visualizations have been built. Some newer alerts have been added. One specific thing I think which I bought up previously also that if you have variables, then alerts don't work. That was the problem that we were facing. That was overcome with that. I hope these things are getting fixed in the latest version. So I'll probably now go back to that previous slide. I should have put another copy here. So what happens is the metrics come. We have a custom SDK written. The metrics go into Kafka. It goes in a specific format which telegraph can understand, which is nothing but whatever is recommended by influx. So telegraph reads all these metrics. Imagine metrics coming from all the tools. Telegraph agents are there per tool. It collects, it aggregates, and then if some massaging is required, massaging mostly in terms of... So we generally don't restrict people to add their own custom attributes. So this is where we kind of filter it off. Otherwise, the cardinality explodes. So once we get data out of Telegraph, then we pass it on to Prometheus. This is where we do further whitelisting of the metrics that we are interested in, right rules for the ones that we really want, the KPA is out of, and then also pass it on to the Federation. Then there is another telegraph which there is... This is mostly for conversion of some tags to fields. So I presume you guys have used Influx DB or even otherwise. So tags are the ones with which you group by and aggregate when you use Influx DB. Fields are just the values. So fields don't really contribute to cardinality, but tags do. So some of the tag to field conversion say it could be in terms of the user ID or some specific thing which could be boundless. So such things are converted in this telegraph, this second telegraph where those are retained, but they are converted into fields. And then we write it to Influx and then query in Grappana. Did I answer that question? Hello, am I audible? Yeah, I got it. Yeah. I just had another question. Did you face also scaling issues with Kafka? And what is the scale you are handling when you're sending metrics to the scale you are handling there? Scale is actually dependent because not all applications send it at the same scale. It also depends on the time of the day, day of the week, so many other factors. I would say we were getting around I think 3000 metrics per minute or something at one point of time. Again, whatever I'm saying is like you can say one and a half year back. So the scale actually really did not hit us that much in the normal application monitoring. But if you look at the host monitoring, the same metrics coming from 3000 hosts, yeah the scale was there. Our infra monitoring stack was a bit different. Even though I have not depicted here, what it was was we had telegraph agents installed on all those hosts, all those 3000 hosts. From there, we were directly pushing it on to Prometheus. So this Kafka piece was not there in the infra monitoring. So it was directly telegraph agent pushing to Prometheus and then the other telegraph to convert tags to fields and influx. So you were using Kafka privately because you wanted to handle higher scale, and also reliability, more than scale reliability. What if the prom gateway goes down? So we faced issues there. So the question Ashok from my side would be that did the experiment or did you load test from gateway and then arrive at the conclusion that it might go down or was it more a thing that you were already super like the team already has expertise in maintaining Kafka. So you decided to go with something you knew and it was stable enough for you. Both. It was both because we had applications running all across the globe. And then we were really putting data into prom gateway at very frequent intervals. We started seeing some anomalies in the data where some things were not captured. Okay. So that is where we could not pinpoint was this like a network issue? Was this at times we also saw that okay prom gateway had gone down and there were some issues with the prom gateway. So we said, okay, let us do something else there. Ashok, one question from me also. Why did you go with influx to be as the final data store and not from ages? Retention. Okay. So retention you did not want to have it for me just like using Thanos Cortex or anything like you said. At that point of time, yeah, we did not have object storage in our realm. So or we were not even experienced to experience with object storage. So influx DB was available and whatever experiments we had done with influx DB, it worked well. And how does this influx DB do the retention? But does it? So I hope you you rent in-house influx DB or you use a cloud version? No, we use an enterprise version. It's on premise but I enterprise. Okay. And how is the retention there? It's on the disk or it's on the disk. Okay. So if it is on the disk, then Prometheus would have also done it on the disk, right? So influx we run as an application per se. Prometheus, we run inside containers with ephemeral storage. We have their retention policy for the Prometheus itself. We have a retention of around two hours. So we are okay. Prometheus goes down and comes back. Okay. So you are relying mostly on the influx DB due to the reliability and the enterprise? Yeah. So sorry, Ashok, also to sort of burgeon, since you are inviting the questions, why have Prometheus at all in the pipeline? Why not just go with pure influx? Yeah. It's from QL. From QL, the rules? Rules and problems. Yeah. From QL is powerful. That's true. Yeah. Okay. And the federation also we use significantly. See the one thing I don't know, I should have organized this one better. See one thing is, if you see here, right, the data actually reduces as we move through the pipeline. So whatever comes in influx is actually less. So there is a lot of filtering and other white listing which would have happened through the pipeline. So some part of it happens in the first telegraph, some part of it happens in the Prometheus, some portion of it, some transformation happens in the last telegraph. So that is where our pipeline was optimal for our use. Ashok, I think just last question. What sort of post-processing you are doing on the telegraph, if you can share a few things? The second, so the first one is there is a plugin which reads a Kafka topic and then you get the metrics out of it. So that was like the initial primary use of that. Okay. Yeah. That's just transferring the data, right? But you said that you also do some post-processing, et cetera. Yeah. What's those? That was one. And the second one is we rename some metrics there. That is also possible there. Okay. And the third one, as I said, there is an exact plugin with which we kind of capture things which you cannot really otherwise, you know, instead of writing a custom exporter, you just have a script which just runs gets the metrics and then pushes it. Oh, yeah. Then there is a file plugin and all those stuff. The second one that is there, we primarily use it for tag-to-field conversion. That is to keep cardinality in check. All right. Thanks. Yeah. So yeah, I think I was at this. The ones that we tried initially were, we tried metric bit. We tried, it was very limited, but for us, it did not scale. Probably we gave up bit too early also. That was also there. We also tried Victoria metrics. It was very good. I mean, in the sense it was one solution which kind of worked, but it was at Alpha. So being an enterprise solution, we could not opt for that. We did try Thanos, but I think it was like very early days of Thanos. Things were not refined. Some things were working. Some things were not working. And also due to our own restrictions in the host and other things, it did not go forward. We also, instead of before having influx DB as the backend, we thought, okay, we should have elastic search as a backend. So that is one thing again that we tried. We were able to capture data with some work. But then the problem we started seeing was when we started querying, the forming the optimal schema was not the issue because before the pipeline, as I said, we restrict our developers from the format, but we don't restrict way too much. If they want something, they can always send it. So that is where our schemas were not very strict or very optimal. But then we also noticed that Grafana would lag heavily because the queries that would come out with the bizarre was very suboptimal at that point of time. So that is when we switched to influx. Any questions there? I can take. I mean, elastic search is not really a TSDB by any measure, right? So it's We expect it to be interesting. Yeah, I see. All I would say it's a bit interesting news case. Elastic search is based on Lucene, which is a document search engine, right? So kind of difficult to pair up with the structure of TSDB. Yeah. So what happened there was, as I said, there was a third access if you noticed in that graph. So we were also capturing business metrics. And for that, elastic search was the right one. So we still have it in elastic search. So we thought, okay, let's have one solution and try to put everything. But as of today, even the business metrics are still in elastic search, the front end is still graphon. That portion has not changed. It's just that the schema there is very restricted. It works. So when you say business metrics, what do you mean? Is it like number of visitors to site or is it like something more deeper like the number of revenues or like, can you give me a sense on what is business metrics for you? Business metrics could be like you, if I take, say, like a continuous integration pipeline. So how many complete successful runs were there? How many commits, LOC, everything. It could be all the way through. As I said, we support a lot of tools. So to each of them, it's different. And also from the perspective of a customer, for us, the customer is the internal development teams. For them, at their org level, whatever are the metrics, at their aggregation. Got it. So primarily like number of pipeline runs and things like that will be business metrics. And Ashok, those business metrics could have easily sent your existing pipeline of metrics that you showed us. Why choose elastic search for that? Because that part of the solution, the business telemetry was already there. It was already there. Yeah. Then comes the logging part. We wanted to correlate metrics with events. It was the first step with debugging. We also wanted to see if some kind of log analytics could be done in terms of how many errors per day at a very basic rate or response time crossing something and is it because of some warnings, is it because of retrace, etc., etc. Our pipeline is fairly straightforward. We have SDK for people to push. And then we use Fluent D or Fluent Bit. Then it is mapped on to log stash. Then it goes to elastic search. And then in the Grafana log panels, we map it. Any questions on this? I think this is pretty standard for most of us. So Fluent Bit is easier, but it is not available on some of the older OS versions that we had. That is where we used Fluent D. Filtering is important. For example, we tried hooking it up to Jenkins and there was a huge overflow within a day. So filtering is very important. Data enrichment is easy. We can kind of add multiple things to the logs that we are sending. Fluent D was very good in that sense. And I am not an expert in this field, but when it came to aggregation, Fluent D also can do aggregation. But there were some specific queries which I am not very aware of because of which we ended up choosing log stash instead of the Fluent D aggregator. Otherwise, our initial plan was to have Fluent D through and through. SDK Fluent D, Fluent D aggregator, elastic search, and then Grafana. But there were some conditional queries, if I remember correctly, wherein log stash fared much better than Fluent D at that point of time. Because of which log stash came in as the aggregator. Then, yeah, we wrote two SDKs. One was for the telemetry, where instead of push gateway, we wrote it to Kafka. Then for the logging, it was custom log handlers. It's fairly straightforward. I mean, you just have another handler and make it right to Kafka. Then if those does not work, we can always use the Fluent D for logging. That was on that. I think this is all is what I have. Tracing, we are experimenting with open tracing and open telemetry. As I was just telling Joy at some time back, it's moving very rapidly for us to kind of put a milestone and then adopt it. So that is what is delaying us. We are also looking at the ROI of not really instrumenting and seeing, because OTEL also supports auto instrumentation and some debugging out of that. I would be, I mean, delighted to hear from you guys if you have tried any of those. And this is like the later realization, like, okay, if Fluent D, if we can just hook it up to our container engine and then that logs, that's like the best log that you can get. And also off late, I'm seeing that, okay, if we have a service mesh kind of scenario, I even just exploring Linkerd. So you ready might get all the metrics, which are off the standard use and read format. So those are some things that we are kind of trying to look at going forward. Some things that we, you know, that has troubled us in the past and some things that we should always be looking forward is how do we monitor the monitor? We really don't have a standard answer for any of this. Then alert fatigue. I mean, yeah, you can build the best solution. But once people's interests are out of it, then getting people's interest back is very difficult because people have their own ways of, you know, finding if their application is down or not performing, which has happened through the years. I'm not sure if others would have faced this because being in an enterprise company for decades. So this is how things are. Anybody who owns an application will have, these are her own way of, you know, kind of detecting the issues and having things, it could be like simple cron or anything. So it is one thing to promise something good and fancy. But if it becomes a burden, then people are just out of it. So that's another problem. Then capacity planning is something, yeah, but of course, there's no formula to that. I would say it's something that we've been learning. But yeah, this is something that sometimes keeps hitting us. Then upgrade loops. To quote one small example, like you saw our pipelines, there was a some issue in one of the Prometheus. So we discovered that, okay, that issue was fixed in some next version of Prometheus. Then we had to upgrade all of them. So over time, we have learned to, you know, where Hi, Ashok. We lost the audio. Wherever graceful restart of the end of the day is supposed to be always up and performing and correct. So those are the issues. Hi, Ashok, can you repeat the update loops, the point we lost your voice in between? Yeah, upgrade loops as in, so in our pipeline, we have about hundreds of Prometheus. Let's say they are in version something, say 2.x something. If there is an issue discovered, we find that, okay, that is fixed in some next future version. If we have to upgrade, we have to do a rolling upgrade of all of them. And in that time, if something goes wrong, then the whole pipeline is stuck. Or even in the best case, we end up losing some metrics, which start showing up as blank in Grafana. And that, you know, again raises the question as to was the monitoring system down with the application down, because we have people all over the globe and 24 bar seven, it kinds of there is a lot of back and forth that happens. So that is what like my first point is the one that I was trying to make. How do we monitor the monitor? That's like the biggest challenge that we have. I guess that's it. That's all I had. Thanks, Ashok. That was a great insightful talk. We don't get to usually see the inner workings of massive sort of enterprise orcs. So yeah, glad that you could be here. So a couple of questions from the attendees here. I think one person asks here, Sudhir asks that, Sudhir, I will unmute you. Why don't you ask the question yourself? Hey, hello. This is my second absolute meter, right? Okay. I am from Facebook. I just want to understand like, what is the data volume that you use for logs? And you have any numbers that you can share regarding the metrics, the cardinality or the number of samples you use? Okay, here is what I have. So I have my colleagues who are actually there. One is Srinivas, other is Raghavendra. Both are from our team. So give us some couple of minutes. I guess they'll look up and then give you whatever we have as the latest. So it's hard because there is things keep rolling. But Raghavendra, Srinivas, hey guys, good to see you. Can you guys share some metrics on the volume of data and metrics that we have? I think, I think a ballpark figure would be just like, because sometimes when we look at certain stacks, which work for a certain context, it's also the skill and the volume of ingestion tied into the context, right? So having a ballpark number sort of helps us understand, you know, that why this particular thing works. I think the last I remember was around 180 GB in inflex for the metrics alone. And elastic search for the logs, I'm not really sure because it was a rolling index, you know, which keeps changing every day. I would say that we were having about 40 GB every day and three or four indices of that sort. Hi, Ashok. This is Raghavendra. Hello everyone. So from the inflex data that we monitor, so there are two kinds of telemetry that we do. One is the host monitoring, like monitoring all the machines and the other one is monitoring of applications. So for the monitoring of applications, which is the infra telemetry, we have the average points that we get per second in our inflex setup is 18k. So around 18,000 points, data points per second from all the different applications that we have. And the other one, which is like the host monitoring, the monitoring of infrom, like lab machines. For that, there are around 6000 of them that we monitor. So the average number of points per second that we get for them is around 670 k. So which is around like 670 k points per second is the volume of data that we get into inflex for host monitoring part. And for this host monitoring part, because the volume is so huge, we only have a retention policy of three days. And that takes up space of around 200 GB. And after that, we down sample it and transform it so that roll it up, roll the data up our voice and then they voice and then put it into another metric. Cool. Sudeer, I think that answers your question. Yep, thanks. So Ashok, one request from my side is that when we publish your PPT for, you know, like the greater audience later on post a meetup, it would be really nice to have another slide at the end with some of these ballpark sort of metrics there. That would help, you know, the readers of the PPT whenever they browse it to contextualize, you know, your whole pipeline. And yeah, that's a small request from my side. I'll do that and share. I think like at different points of time, Suraj was coming up and Suraj was talking about a couple of points, right, that about Grafana agent where it could be useful for you and also, you know, the bug that you particularly mentioned that got fixed later on. Suraj, you want to like just talk about that for a couple of minutes? Yeah, sure. Can you guys hear me? Okay, cool. So Grafana now have some development agent. It's think of it as a Prometheus without the storage. It's mainly meant for use cases where you have a centralized, you know, long term storage, something like Cortex, Thanos, you know, any other cloud service that offers Prometheus compatible remote right. So you have that running on one side and you have your applications that emit or have, you know, Prometheus and points. So you can use Grafana agent to scrape your applications and then remote right to your long term storage. And if, like assuming you didn't use Prometheus, Grafana agent, not Prometheus, you would have to use Prometheus scrape and promote right, which would use more resources and is not as powerful as Grafana agent because Grafana agent is mainly meant for this remote right purpose. Also, there is a Prometheus agent that is in the works. Parts of Grafana agent will be donated and upstream there and then we'll build on top of that. So is Grafana agent now part of only Grafana's enterprise stack or is it like already available right out there? It's open source. You can go to github.com slash Grafana slash agent, run it. If you have Cortex running locally, you can use agent to remote right to that. I don't have to go through the entire Prometheus and Prometheus remote right configuration, who better from the host itself, I can directly dump the data. So you have to configure Grafana agent. Right. The configuration is pretty much the same but yeah. So now we have a push base model in the entire pipeline also. Once the Prometheus as you said that is going to come up then we have a proper push base pipeline along with the full base ones. Agent also has like logging support so you can you can push logs and you can also push traces. So it's basically hotel agent plus plus sort of because matrix logs, traces, everything taken together. Yeah, you can think of that. And would I be able to use this agent to also write in the hotel open open hotel format if I want to use that or only support Cortex as a backend? I think any Prometheus remote right compatible backend is fine. It's not picking up about that as long as you accept remote right, it would be happy to send data to you. Also, I think we use hotel for tracing part. So we have not written our own tracing. Suraj, why not just use hotel or like why not just support hotel agent? Why introduce a new Grafana agent? So Grafana agent also has a bunch of things that are meant for Grafana Cloud. So Grafana Cloud has something called integrations. They're similar to integrations on other platforms. So let's say you have NZNX running right. You can enable NZNX integration in Grafana Cloud. It gives you a configuration you can configure and then we would pull all your NZNX logs metrics and then ship that to Grafana Cloud. Okay, so but if somebody wants to run within there, like run themselves, is there any advantage for Grafana? So Grafana agent has a couple of exporters built in. For example, you don't have to run node exporter on your own. It has that inside it also a bunch more. It's like wrapper of the existing agents of Prometheus plus like the push format and the remote right format and enterprise facilities of Grafana Cloud. It's sort of a wrapper on top of multiple things taken together right. Makes sense. Yeah, also lightweight like because Pro is is stripped out and if you have your returns of services then that overhead adds up. Okay, I think should we give some more time for other questions or should we already just seg into the next talk? What do you say Suraj? Should we take a short break? Or just a conversation break? Yeah, also I wanted to mention one more thing. So metamonitoring topic came up like how do we monitor the monitor? So we actually see that problem ourselves and we think it is a reasonable solution that we have built. So we wrote a blog post about it. Not sure how to share it on screen. I sent the link on chat. But it's on our blog and what we do is we have different Prometheus that is not related to our Grafana Cloud Prometheus and that Prometheus monitor this Prometheus and we also use something called dead men's snitch. And there is an always firing alert to test the whole alerting pipeline. So if that alert stops coming in, yes that's the post. So this post has details on how we do metamonitoring. Thanks. I think I first encountered DMS while working at Torxstar with Thanos and stuff. We finally did not use it because what we did was we just deployed another single Prometheus node which would basically scrape data of only the observability stack components and because otherwise it just turtles all the way down and you cannot stop. You have to go on the chain, who monitors the metamonitoring, monitoring and so on. So I think one or two levels depending on how distributed one Sintra is and how multi-tenants is one Sintra is. I think you have to stop at one particular level. You can have one metamonitoring per tenant if you have something like that, if you have separate stack for each tenant or if you are only like me, I am only responsible for a single orcs. I think I can just stop at a single level of metamonitoring where I have a completely isolated crumb or something running which just monitors only the components of the observability stack and should be good enough I guess. Yeah, for smaller contained walkroads not for massive scales of graph analysis graph obviously. So yeah, cool. I will stop this here. I will stop this here. Any thoughts on this alert fatigue? Are there any creative ways that you guys are doing it? If I may pick up that question. I think the best way I have personally been able to handle it is like a bi-weekly infra review sort of metric review cannons and you solve top key basically and you decide what key is for your work and you see by either volume or by criticality you analyze the top key alerts that your team wants to solve and every next cycle you ensure that those top key items those bugs and those issues are solved. That way the next cycle you can pick up the next chunk of the alerts. So instead of focusing or being reactive about okay alert comes in should I solve it right now. If you throttle and patch the problems together and you solve them as proper sprint tasks in every sprint that sort of has helped me personally solve a bunch of these issues right where I reduce alert fatigue on people by saying them okay ignore the volume ignore the alert right now if the volume exceeds a certain threshold then we will take a look or we will take a look at the top 10 ones unless there is a P1 P0 of course and of course if there is a P1 P0 that's an immediate response. I'm talking about some alerts which are sort of as you said Ashok like they are sort of you know ignored by most of the dibs and they have learned to live with those alerts even though they're impacting your arc silently it's like yeah you know let them be for a bit but in that case the top key analysis really really has helped and sort of a weekly or bi-weekly whatever is your sprint cycle that cadence do one sort of multi stakeholder meeting show them the impact of the alerts that are causing havoc across the infra that process really helps I think. Yeah I guess the meta monitoring is the answer I mean because people tend to there should be a trust on the monitoring system it should not be because the monitoring system had issues. Yeah go ahead somebody was saying something. Yeah I can take a look at I can take a stab at that so the team in which I work in right so we have dedicated on-call engineers and a part of on-call duties is monitoring so we have three separate tiers of alerts set three set two set one set three is like hey you can take a look at this later today tomorrow next business day a couple of businesses we can live with it set two is like things are going to blow up look at it right away and seven is like there is an active incident like people are shouting about it people cannot do anything so that's things set two and sevens I've considered as the most strictest thing so we don't promote an alert to those levels unless you know they are that critical so and the way to reign in at those levels has been we also have our developers as part of beyond schedule as well they are not primaries they are un-secondary so in case of the primaries are busy with something it goes to a secondary but it's sort of brought up a sense of discipline knowing that they can't just create alerts willy-nilly saying that oh look it's not my it's not my responsibility the so-called dev-op team is going to look at it and they will call me and if it is that urgent that necessary that sense of discipline has helped reduce the biggest problem has been this F3 once because you know PSF3 is like how we'll take a look at this later but that later never comes for you know whatever reasons you schedule some tasks but you know some other priority stuff comes in and this goes and it keeps going to the backboard so for that for what we decided was that we have one engineer working throughout a week dedicated to on-call and not just on-call for monitoring so as part of the monitoring task we constantly are looking at what alerts are you know if it's if it's coming up frequently if it's coming up frequently then it should not be coming up frequently then we either need to retune it to retweet it or we need to fix it at its root cause so like you know if it's a warning then we need to figure out what is the one is it because of application or is it because we are running infra restraints constraints whatever interest so this has helped reduce the alert for tink quite a lot we are still not yet there but yeah maybe soon enough we'll get there but I mean we've gone from like hundreds per week to like I don't know the last time when I was on call I didn't get a single page duty call which is which is a fantastic thing so we we always we always say we know you go on call if you get a big bang outage then you're set for the rest of the week because nothing's gonna happen fortunately that's not always been proven because you know if you are if you're unlucky enough your your on-call shoot is going to be a living nightmare and that I dig this but but the point is like I said bring in sense of responsibility by putting the people who set up their alerts also responsible for handling them and have one person dedicated to constantly look at it look at it and I realize that you know not every team will have the resources or the people to handle this but that's something you need to consciously think about yeah one more point here this is actually tying back to one of the older talks we have had here by Gaurav who works at Red Hat is the ISRA practice itself and error budgeting right too many alerts and too many P1s and P0s you need to exhaust your error budget in a proper way in a structured fashion so that once you know you have a bulk load of alerts taking up a chunk of your error budget you need to prioritize to sort that out and you need to de-prioritize a feature creep for a certain duration till your error budget has recovered right so you have to couple observability with ISRA practices for the engineering or to actually correct you or otherwise there will be alert fatigue and that will cause you know issues to pile up and just for the you know just putting it out there so Satya is Adobe API gateways you know core engineer so Satya could you please introduce yourself to the new folks who are here so they get a context on the particular thing you're talking about I've been off radio because it's sort of early morning I'm in Bucharest Romania and it's 9.25 so it's I can I consider that as early one but I've been listening on the background and I have a discord stream also happening so I've been quiet and listening but anyway so hi I'm Satya I work at Adobe as one of the SREs on the API platform basically any of the Adobe cloud services that or any of the services from Adobe that you use you're most likely reaching R in France so we are the API platform for all of Adobe so most likely any traffic or any cloud service that you're using from Adobe say HDA Spark whatever it is it's most likely going to flow through our infra and we do quite a lot of traffic I think even since the pandemic of the things have blown up we did not expect that and we we have an active project right now where we are trying to improve our performance even more so we can scale up better Joy if you want I can talk about how we do things on logging on metrics and what not yeah for the next for the next month for the next time meet oh awesome absolutely we'll be happy I think Pranay and I were already saying to get you together here for a golf session I should you do a golf session next time on on meta monitoring and how various or sort of I think let's do Satya stock at least we have as one stock confirmed and then do some sort of buff on meta one I think there would be some good I'm sold on that man yeah awesome I think yeah that would be great I think we are we are running just close to time so let's start with Sura stock yeah just shy of like big stocks time-bang so unless there are any pressing questions from the rest of the audience we should be starting if someone has a pressing question regarding the banter session also feel free to use the last three minutes here to ask that and someone here would try and answer or we can just discuss but we'll start the talk otherwise right now or short and just a quick shout out to a telegram group so we also run a telegram group where we'll have like around 100 members now and like mostly people who have attended this meetup or have an interest in observability so if any of you are not there already would love to have you join this we share lots of like we will share lots of interesting stuff links etc and you can like take many of the conversations offline also so yeah I posted a link on the chat so also the link is already available on our ask each page if you head there you know thanks for thanks for asking for hosting the center platform and hosting the meetup for the last couple of editions I think the all the 2021 edition so you can get the link there also but yeah other than that you can DM any one of us Ranae we Ankit a bunch of people here if you want to join us on we can you can tag us on Twitter message us on Twitter or Telegram and we can you know add it to the group we have really nice conversations here yeah and also if anybody wants to speak in the next coming meetups just feel free to ping any of us because we are always looking for new speakers and some most of the times it becomes the last minute chase as as we got as soon as this time but yeah if you can if any of you are interested or you think like any topic about observability I think joy has a set of things with topics which we can cover but probably anything about observability SRE practices how do you monitor systems even sharing I think there were some folks from Freshwork who are joining in the call so if you guys want to share something in next talks like like Ashok did this time that would be great yeah I think that's all from us I think let's have Sura's take over and we'll learn more about synthetic monitoring and thanks Ashok once again for sharing your you know journey and experience with Cisco Ian really glad to have you so can you can you see my screen you can see us okay cool yeah so today I'll be talking about synthetic monitoring what it is why do you need it and how do you want to do it you know like the basics to you know how to get it implemented at your organization so who am I I'm I'm Suresh I'm working on Grafana Cloud these days for past one year ish I've been busy building synthetic monitoring product for Grafana Cloud before that I was at early stage startup you know and built bunch of core things from you know like databases to ETL pipelines and whatnot including monitoring also so that's that's who I am also so let's just define you know what is synthetic monitoring so we all have monitoring on our systems like we monitor what our servers are doing what our databases are doing you know how much CPU we have you know how much CPU is being used how any other you know resources that that we have on our services and we alert on that but we like most organizations don't really know how applications are doing from their users point of view you know our users in certain locations are seeing a degraded performance right your internal monitoring will see say everything is fine things are running as expected you would not know these things unless you monitor things from your users point of view so that synthetic monitoring you you you test externally visible behavior as as your users would see it you and if you have users all over the globe you would need to monitor all over the globe like also synthetic monitoring is part of this you know broader black box monitoring category so sometimes we will also refer to it as black box monitoring but like in this talk we will just call it synthetic monitoring so quick question I just wanted to ask everyone you know like how many folks here have used synthetic monitoring or are familiar with synthetic monitoring so feel free to you know unmute yourself and let me know if you use synthetic monitoring and if yes which which product do you use I heard Cisco uses black box exporter in the last talk we mostly use it used to be run scope we are migrating towards new relic I see nice any any other folks who are here who who use synthetic monitoring already okay so I guess it's it's not as common as you know other monitoring practices right so so why do we need synthetic monitoring right as we said we want to find out how our application is doing from our users point of view so to truly find out how our application is doing from users point of view you we actually need to pretend as if we are users and then test our applications as if our our users are are using it right and if our users are all over the globe we would use use it from all over the the globe so I want to share a personal story of mine you know like I made a change that that broke production for four plus r's and we didn't even notice that production was broken for four plus r and the change that I made was I added a route in load balancer to you know redirect traffic to our new website so you know fairly small change nobody thought it would impact production but I don't know how you know I still don't understand I think it was something on like Google's load balancer UI that I I did but because of that change actual production traffic was also ended up going to you know website we didn't notice it until late that evening it was 11 p.m and we had a scheduled maintenance we started doing scheduled maintenance and then we found out that you know we we cannot really use our our production application it was down and it was returning you know errors and everything's were all fine from you know inside you know servers or our databases were all all all good so what actually happened was you know our actual production traffic was being routed to netlify because that's where we had our website and I looked through header and I I saw a x netlify header and that's when I I found out that should we we we messed up and then we fixed it like even if assuming we didn't have that maintenance window things would have gone for longer and customers would have been impacted but yeah that's how I took down production for four hours and I didn't knew it that you know production is down I mean we also had instances where our monitoring went down and in those scenarios like during an outage monitoring went down and we didn't had anyone alerting right like nobody told us that our own like customers told us instead of our monitoring systems so that's not a good state to be in and also assuming you you have you know regional deployments right you have configuration that apply to certain like certain configuration apply to certain regions let's say you you you do a deployment and your team is sitting in one region developing and deploying like deploying globally and things are only broken in that region it would be really hard to know that because from your point of view from your region things are all fine you can use your application but customers in that region would either see your application broken or you know integrated performance you know there was one such incident that happened with the folks at K6.io so Daniel is SRE at K6.io and K6 recently joined Grafana labs so I was talking to him about K6 and synthetic monitoring and here is what what we what he said so they misconfigured cloud frontend because of that they had a reasonable caching issue you know nothing major things were slow in one region and they were they wouldn't have found it because there was no way to find it you know because folks in in in other regions were seeing it all fine but because of synthetic monitoring they saw the response times and latencies were up in that region and then they hand down and fixed it so you know it not just me who like who breaks things like it's almost all of us so so another question that I want to answer is how to do synthetic monitoring right we talked about we want to monitor from our users point of view we want to monitor from users point of view so that means we have to leave leave our internal private network you know get out of it you know and monitor from actual internet and we want to travel through the internet just like our users would you know going through different networks and then reaching our our production services so even if you let's say like use google and and you set up your infrastructure on google and you use google to set up another monitoring then you would still be you know going through google network unless you tell google to not go through google network so it it's bit tricky to actually go through internet if you are not going out of that cloud provider so you know because google has its own optimized routes and whatnot they do their own networking magic to make things fast and try not to leave their own network so that's also another important thing when when you want to do synthetic monitoring right because you want to make sure that you're actually coming through regular internet values your users are coming through so another is exactly how to do it so there are two options one is you can do it yourself you know black box exporter is popular project in in promises ecosystem so if you have promises you would want to configure and deploy black box exporter at locations where you want to monitor your applications from and also you have to be extra careful about the networking magic that other big cloud providers does so you have to actually leave your cloud provider and deploy it on on other cloud cloud provider that you don't use right there is one more project from google i think it's called cloud prober i have never used it but supposed to be used for you know black box monitoring synthetic monitoring or and then there is always crown jobs so you can you can write crown jobs to ping your websites and monitor them but that's not really scalable and easy to manage also that there is a trade off you know there are certain pros of doing it yourself and then there are some downsides of doing it yourself right like when you do things yourself you have full control over what data to collect you know where to store it how to format it and whatnot right like you can do whatever you want you don't have to start from scratch there are existing open source tools right black box exporter is there but i'm pretty sure there are others as well uh that can do uh based on you know how how you want to monitor what you want to monitor and whatnot uh but the downside that is that now you need to monitor this service right now you need to maintain this service and uh let's say you have you know certain assets like in marketing websites and whatnot then uh people who are uh not you know engineering savvy would have to go through the engineering to to get their services monitored and you know have their checks up and running and you know modify those checks as well so that could be seen as a pro and could be seen as a downside because now you have one more hop and uh you know engineering have to be involved so yeah and again the meta monitoring question comes in right you use synthetic monitoring to monitor you know your overall application but who monitors your synthetic monitoring right and as I said it stutters all the way down right so uh so another solution that you have is uh you use a managed service uh it's easy to use uh no engineering overhead most managed services have a nice UI where people can just you know put an a URL uh or an endpoint and they would start monitoring it you know they have uh global locations uh you can select where you want to monitor your applications from and then they take care of the rest right you don't have to worry about uh operating it and then monitoring it and then again meta monitoring it uh they like there is also another downside of that you know they might not have locations that you want right uh some have you know let's say you want to monitor your application from a location where they don't offer their pro right and uh there is no way to to gather data from that location right uh also most uh you know managed services don't mix well with the with your existing white box monitoring right so you have all of your internal data uh but then you know let's say you use an and manage service so that service will have its own UI its own way to look at data so people will have to learn how to use that service you know learn how to get data out of that service and then like it's not easy to to mix with your existing data let's say you're using Grafana and then there is like you can set up things to get your data uh along with your other data but again it's it's work so also uh let's say you have some internal applications that you want to monitor you know they are inside your network you don't want to monitor them out from outside your monitor but you just want to make sure that they are you know uh things like VPNs and whatnot and things that are critical for your for your organization you know because uh almost every organization have internal applications that are not uh public uh per se so the only way to reach those applications would be to be in your network and not every managed service offers that so that's managed services for you uh so now I'll instead of you know uh going over all the managed services and all the the open source solution uh I would just talk about you know what we have at Grafana Cloud because that's what I've been building for uh past uh a year or so you know so uh Grafana Cloud Synthetic Monitoring uh you know looks something like this uh it's inside your Grafana instance lives as a plugin you know you can monitor things from all over the world uh but you know people who know uh Grafana might say wait wasn't it called you know world pink didn't you guys had guys had one more service that did exactly the same uh then you would be correct uh you know there was a service you know was because it's already uh EOL uh last April and right now it's on and read only more than will be taken offline on August uh first so world pink was the very first service from from Grafana Labs that time Grafana Labs was called Terrain Tank uh it was also a synthetic monitoring product but it had you know a bunch of things that we we wanted to improve upon and uh ecosystem has decided to move on to you know Prometheus and uh this world pink product was using graphite as its data format so it also had locations it would also you know run your checks all over the world and then it would gather metrics and then store in it in graphite format you know uh there was bunch of downsides uh with that you know Prometheus is easy to extend you can throw in labels and you know like to work uh bunch of things with with it uh also frame promcule is very powerful uh so graph graphite was like we were feeling like you know uh we could make it better but because of graphite we couldn't uh so we decided to duplicate world pink and then move forward with the world and you know like move to Prometheus based uh world also another you know major requested feature was uh let's say you you have a check running from you know a remote location and only that location is failing uh and uh there were no logs as part of world pink service you know like your checks would generate logs but then we would like there was no feature to show those logs to you right so customer would have to either you know create a VM or you know like use a VPN to browse uh services from that location and then debug that and that was you know suboptimal and painful you know we don't want to do that now so logs was another you know major missing missing feature of world pink so like this is what it was so 2015 world pink came out 2016 like it it came inside uh 2016 it came inside grafana uh was uh app plugin and then like today we we we moved on from world pink and shut down world pink and have developed synthetic monitoring that is same product but different new and better in in various ways uh so uh if you want to know more about you know what what is different from world pink and uh you know what's new in synthetic monitoring my colleague teddy wrote a blog post about that uh i have new links to this slide you can go go to this blog post and then read more about the exact product differences uh that are there so you know why new products you know the world is moving to to prometheus uh you know almost everybody is adopting prometheus all major vendors support prometheus ingestion prometheus quaring so we also wanted to you know move with the world and have prometheus based metrics for your synthetic monitoring data also the logs uh that i mentioned right uh we want logs we want we don't want our you know customers to to have to connect through a VPN and then try to reproduce a problem that our our growth is seeing right you know sometimes then they couldn't reproduce because you know internet right uh you know like it's not like the only uh way to know why certain things happened is you know collect loads and and then tell you okay like this is why it happened so new synthetic monitoring products uh product have logs and also like because uh it is built on top of black box exporter uh we have all the configuration flexibility that you you get with black box exporter right if we expose all the configuration options also black box exporter exports prometheus metrics so because of that our synthetic monitoring product also has prometheus metrics and we we integrate well uh with our grafana cloud offerings uh so you know like metrics are stored as prometheus metrics in your grafana cloud hosted metrics instance and logs are stored in your hosted local instance so if you want to mix and match uh with your other data that you have uh in a hosted metrics and hosted logs instance uh that would work out of box without any pain uh you know like you don't have to feed in data collected i get it gather it around and you know worry about it it's already there also it is inside grafana you don't have to use another tool you don't have to uh teach people to use another tool uh it's it's all grafana and prometheus and it integrates well uh with uh cloud alerting so grafana has cloud alerting product it's a hosted alert manager offering uh so if you can actually use the same alerting infrastructure that you have right infrastructure to manage your alerts manage your routes manage your rotations uh you could just use the same thing for your synthetic monitoring you know you don't have to go to another product add people do it and then configure email alerts you know and uh because it is alert manager you you get all the the great things that alert manager has to offer you know including uh places uh where you want to get alerted from you know like beat page or duty email slack telegram whatnot so yeah another uh bit that we have a new product is uh private probe right uh so private probes are uh probes that you can run in locations where we don't have our probes or inside your internal networks if you want to probe your internal applications so private probes uh like you give it a token bring it up and then it would connect to our api get all the work that it needs to do and do start doing it works uh and collect metrics and uh logs from the checks that it executes and then pushes them right back to our cloud so you just need to run a process that's it and uh if you if you if you want to run a Docker container we also have that and I'm assuming all most of your organization can now throw in a Docker container uh somewhere right and if not uh you know we also have other ways to install so uh I want to you know uh do a live demo right now you know so all hail uh demo gods but yeah there we go so uh I have synthetic monitoring installed uh in my personal grafana cloud instance you can see it like that's the icon then let's go to the home so I created a bunch of checks for for this demo so this is last three hours so I know is see I have it supports uh dns s50p dcp and ping checks uh I don't have any ping checks so that's why it's only shown three of them and then it has all the all the regions uh where we want to so there is an instance uh for my dns check so I'm I'm doing a dns check on my website and it says uh reachability 99.8 percent and latency up and these are my you know s50p check and this is my dcp check so uh let's let's look at the UI uh you know like let's go through and create a check so this is the UI where where you want where you see your creative checks you want to see a compact view you can see how many locations it's running from you know what's the frequency how many active series uh it'll generate uh we don't build uh on your check executions uh we build on the data that you generate in the store so there is no separate billing uh for synthetic monitoring you don't have to worry about you know how often I'm checking uh if you generate more data you'll get build more uh and if you generate less data you'll be be built less uh and it's all part of you know uh your grafana cloud hosting metrics and logs billing so you don't have to worry about you know paying one more service and then you know what are the uh building woes that you have to see also there is this new you know visualization view if you if you want to look at your checks at a glance so you know so I see like these two checks are failing this is successful this is successful okay so let's let's go ahead and create a check so these are four check types that that I can create uh I have to define a job name so whatever you write will be we sent as a job label in your promises metrics so let's let's monitor google because why not so I can just say google.com and I can just say service you know these are you know your query parameters if you want to send to your service let's see uh you know like where I want to monitor from so all these are are you know hosted probes and this is my private probe uh I'll go over to private probes later and show you how to set one yourself so I'm running this on my laptop right now or it's executing checks and sending data so I'll select all of them I want to run it every 60 second time out after three seconds so we there is this button so what this button does is by default we don't publish all the prometheus black box exporter monitor like black box exporter metrics because uh you know we believe that you know not all not all metrics are useful so we only you know like they are useful but you know if you run tons of checks and if you are not using those extended metrics then you would end up paying more and so like if you want to only look at basic metrics you can just leave this and check but if you care about all the metrics that black box exporter has to offer you you can check this and then you'll get all the metrics this is just you know uh if you if you don't care about all the metrics and uh don't want to pay a high bill you can select HTTP method request what you had as you know normal stuff tls config you know you can give us a certificate and then if you have a bearer token basic basic authorization thing so you can put your bearer token inside here or have your username password if your service is behind authentication you can you can you can do validation sttp versions also and see like check for ourselves also do a regular expiration uh match on your headers and body like all this is already there in in black box exporter this ui makes it easy to configure and uh you know intuitive so people don't have to write black box for configuration you can you can add labels to your your checks so they will show up in your in your metrics and logs uh so let's say you have a team label team google and then you know you can use them however you want also we have uh ip version selection you know let's say you only want to monitor through ipv6 or you know let's say you don't really care about this and we have built-in alerts that are uh like you you can use this and then configure default alerts based on you know your time and reachability so and those alerts will be there in your funnel alert so let's go ahead and save this check okay showing any because uh it's not running yet it'll take a while to go and you know start generating data so these are all the probes that that uh we have been running uh for you like you can go like see okay this is public that means that you know this probe is hosted by us and you you can run your checks there there is also a private probe that i'm running uh so this is my private probe uh that i'm running on my laptop it says online you know i'll go ahead and kill it and then it'll it'll show up offline so it is offline you know let's bring it down line again so we have synthetic monitoring agent uh on our GitHub so whatever code that i'm running as private uh probe is actually open so you can go and look at the code if you if you want uh so the code is uh that you have Grafano synthetic monitoring agent that's what i'm running like i'm not running some random code that you know you don't want to to run on your network should show up online okay so it is online uh the success rate here is uh you know the amount of checks that succeeded on this this probe uh and then there is this alert so like if you know how to configure ProMitch's alert manager and if you want to configure your own alerts uh then like you can leave this alone uh but let's say you you don't want to mess around with the the synthetic monitoring metrics that we generated you just want basic checks on you know let's say if my my service falls below 95 percent or 90 percent or whatnot for five minutes then you can just say save say save alert and then this would generate an alert manager rule in your Grafano cloud uh alerting service same for other uh services and uh if you remember in check creation there was a alert sensitivities feature so like we add low medium and high label in your checks if you configure that and then you can target those checks here you know so these are 3d4 alerts that that we add for you so let me go ahead and show those alerts i'll go to the cloud alerting product and if you see uh this is alerting ui that is uh for Grafano cloud alerting it's think of it as a hosted alert manager because uh it is alert manager you know this is alert manager configuration you have silences you have notifications and you have your rules uh so there is one rule that i i created you know website is slow so if my website takes less than uh five second i'll i'll alert for greater than five seconds for for five minutes and i'll say my my website is slow this is this is what i have created and these are are the default alerting rules that are part of synthetic monitoring and then like we create this recording rule for you uh so you don't have to type same query again and again so yeah uh i'll go ahead and uh so we we import uh five dashboards five default dashboards as part of synthetic monitoring so like we we already looked at summary dashboard right but there is uh there is a dedicated dashboard for each check type so there is a dashboard for dns paying sttp and they are uh your normal dashboards you can go here here and and see them you know along with all of your dashboard put them in a folder so let's let's look at the dns dashboard uh let's let's see what's happening with my dns queries right so look at dns uh it says banglore 3.57 percent uh so says up time 100 percent reachability 99.83 percent so up time means that uh only like at least one probe is able to reach your service and confirm that your service is up uh and reachability means you know the overall uh combined ability to reach you know all these probes so let's say if one probe is not able to reach or you know if it's getting error then your reachability will will start decreasing that means that you're not really reachable in in certain locations let and see records like what not and this is uh the logs so you see that that there are some errors coming from you know uh banglore pro and uh you can see see what happened so it says error while sending a dns query io timeout like really helpful when when you want to develop something so it says a beginning check timeout so it says check failed duration five seconds okay check failed duration five seconds so that means uh if i go ahead and look at the check uh okay so this is this is our check right i configured five five second timeout right so like if if it takes more than five second i'll consider this check as failed and i'll run the servicing go ahead and you know increase and decrease this uh based on my my threshold but yeah let's look at look at the stdp dashboard that we have for our stdp services so i'm monitoring bunch of bunch of websites so let's let's look at the google check that we we added just now so this is our our google check we see uh you know like it's it's it's up but it's not really reachable uh could be because of our threshold uh but you can debug right uh why it's failing you have all the logs from your checks and these are your normal looky logs you can you know go ahead explore on them run all the log queue that you want uh let's say i i want to show me only errors i'm not a log queue expert for i'll write bad log queue but whatever you know sets cannot assign requested headers okay i can i can go ahead and do show context that dcp cannot assign making a strict request resolved whatever so there is some issue like you you have the logs to debug it right whatever you know and you know let's let's say you have it shows a selects query right so you can alert an SSL also so i'm i'm assuming you all almost every service offers that but i just wanted to show that you also have this uh and uh in response let's see you get a breakdown by you know how much time so if you look at this majority of time is going in in resolve and then the check actually fails so that's why the other metrics are zero but let's let's look at my website and see you know what's going on so you can see that you know my left my home network is not as quick as you know other probes so it's it's slow from my point of view uh but it's it's fast when when we look at uh other atlanta and and panglor problem so like this is what i what i mean when i when i say you know network is not uniform uh so you have to actually monitor from all over the world and go through actual internet uh also uh synthetic monitoring like you can just look at your checks like this what not and these are the probes you can see what version you you have and what not anyway i i guess that's that's all with with the demo uh i wanted to tease tease it out and and you know uh say what's what's next as part of synthetic monitoring so we are in the process of you know planning and building trace route feature in it to trace how you know atlanta probe is reaching to your service you should be able to run trace routes from from our probes uh once we have this live you will be able to run trace route and look at the data and you know visualize this visualize it and see how it travels all over the internet right uh it's particularly useful for folks who want to do network monitoring what not and smoke smoke ping uh so the current ping feature that we have it only sends one pin packet uh but that's not really good when you want to you know monitor your response latency uh so smoke ping would would send a burst of pings and then you can look at the data and and see how it's doing compared to a single packet uh though no promises on when these things will be out uh but hopefully they'll be they'll be out soon so okay so uh now let's say you say you know like looks neat uh how do i get it you know i i want to use it to monitor my home lab i want to use it to monitor my personal project so grafana cloud has a free tier uh without any credit card or anything you can just go sign up uh and use all the grafana cloud features including synthetic monitoring so you have met its large traces alerting uh and it's more than enough to monitor you know site projects home labs and personal projects and even if you want to play around uh with alerting tools like observability tools uh and uh learn things you can go ahead and and use our free tier uh learn things play around with grafana and and from its ecosystem the graphite is also there but you know i'm assuming nobody wants to learn graphite these days uh so that's that's all from my side uh thank you everyone you know uh these slides will be available i'm assuming uh on the has geek platform as well and if you want you can you can go to my website and download these slides as well i'm there on on the internet you can reach out to me on twitter uh you know write an email my email address is there on my website but yeah so uh i'll stop sharing and uh we'll we'll go ahead with the questions uh yeah i mean uh of course the last slide kind of got to me i was any time this feature is coming to the open source integration like see the probe is open source already so is there a way to set up my personal open source probes uh across multiple other clouds let's say i'm hosted on adbs i can probe in either you know error code digital ocean gcp wherever uh but grafana i'll still be with grafana cloud i don't have this feature on myself so is that anywhere in the product pipeline so uh you know in the last couple of months we were busy with you know deprecating deprecating working and migrating those users off uh so we we were only focused on on you know uh the cloud as a as a data storage uh because that's what we needed from an organization product point of view so that's what we have as of now that's not there on the roadmap because we have a free tier so you know if people say oh i want to play around with it then we just say you know free tier is more than enough right we have 15k you know active series that are allowed also alert rules and logs are there and synthetic monitoring if you if you look at the checks we tell how many active series it will generate right so some like you can monitor hundreds of endpoints of synthetic monitoring with free tier uh let's say let's say you have like there is one part you know that you can do let's say you have a local grafana and you want to install synthetic monitoring uh you can configure uh cloud as a storage destination install synthetic monitoring plug-in on your local grafana uh and provision those cloud data sources in your local grafana and synthetic monitoring will work on your local grafana through the cloud once yeah so it will collect data sent to cloud but you can use it in back to our local yeah so still seems around about way but i'm pretty sure with grafana start track recording near future we will be able to see this integrated with open source version but fingers crossed from my side with list uh other than that like if i want to do this at my org right now uh we do have a very distributed customer set then i'll have to go for the black box exported out i think and set up custom probes to that right uh yeah as of now if if you want to store data on your own and store logs on your own then i think black box exporter is is the way but i would say like even if if you have your own internal cortex and and other services you can still use uh free tier just for this uh and uh install it in your local grafana provision those cloud data sources uh then you know mix it and match it with your in-house data and then this data yeah uh i don't have any other questions the whole thing looks pretty simple enough i have used i think black box exporter once we were back uh in only one of the orgs i've worked in but that also is for just a simple full system ping test not a distributed test uh across everything we just went monitoring our external infra from one location not like all locations possible uh yeah but let's wait for any other uh anyone else from the audience to pitch in with question because this is definitely a super interesting topic personally for me so um anyone else has questions on this yes this is Ashok here so very interesting we have used black box exporter extensively use it all the way from you know ssh chat to ping and everything else that's our easiest way of finding okay something is up on the availability front so one question i had was uh see when we do it the at least currently when we do it the cloud way what are your thoughts on the latencies because i have to go back to cloud and then come back then measure something bit no no no no so uh cloud is where where you uh store your data right so the checks are executed all over the world no i i i agree what i'm trying to say is since i'm storing the latest data might not actually be the latest oh okay i might not get the real time data oh okay so i think the latency should be fine uh you know we have i think two regions as of now in in grifana cloud uh i'm not sure if you can choose your region in in the free tier uh but i think it will get better over time as joy sir yeah so i would say give it a go install on your grifana like play around with the latency uh how's the latency shouldn't be a problem you know from what i what i think yeah in our space like enterprise pushing anything to cloud is forbidden so yeah we'll try on our own i see yeah and it's it's also not about just enterprise i think see most orgs even even if they are not enterprise posting their own graphana and uh it's it's a pretty standard pipeline right like yeah cloud only becomes i like it's only at a certain scale it becomes you know important right uh or like at a very early scale or a very uh late scale where you know managing a self-hosted cluster is sort of getting too much over it at that point you can rely on a cloud but uh in the mid to mid tier org or in superskid or like you know like as ashok says in enterprise scenarios people have their own graphana setups already so there has to be a sort of a good way to integrate this pipeline with the graphana dashboard that people already has as you currently said to us that you know we don't want to see the data in two three different dashboards right the same would happen if we have our own self-hosted graphana dashboard and graphana cloud dashboard just for the uh synthetic side of the thing so uh is some integration would definitely be super awesome and nice but yeah let's see how that works so if folks are fine with keeping the synthetics data in cloud uh then you know uh you can install the app on your local and configure it you know give it data sources give it credentials to connect to cloud so then you don't have to go to two graphana instances to to see this data like it will all be in single instance but again the data is with some latency of course like if you have a thing then you have to pay with some latency but i think like from a historical point of view that would still be relevant enough uh maybe not a p0p1 point of view but we can still analyze historical data of from certain locations if too many customer complaints are happening from a certain region right we can figure that out at this point yeah uh anyone else here does this with any other tool seats uh in there or we can start having that banter station at discussion i guess uh we still have online time till at least the ministry time right now according to schedule uh okay 12 15 so two minutes uh till the live stream ends so you can still have us ask questions or we can take all the questions to the scheduled banter station that we already always do where we go off the record and we can talk about everything under the sun uh so till that time like are there people here who are already using other sort of tool seats to to synthetic monitoring telegraph telegraph has a http and a tcp plugin i guess with which you can do synthetic monitoring we have tried to talk about of times so i think a lot of lot of this self-hosted setups would basically be deploying your pin test monitor info across multiple locations and then get the data to stream to your central monitoring location or the meta monitoring location something like that right yeah i'd like to add something okay yeah so um so i've started consulting a while just a month back so the company i consult right now they're using ping them for basically ping test and all of that and recently we they kind of faced an issue where the tests are basically just timing out not even able to reach and all of that so we're just exploring what new options are available apart from ping them and i think data doc has been offering a synthetic monitoring because they've been using that so what are you guys uh what do you guys think about like data doc synthetic monitoring and do you have any views on that because like we've not been using gifana cloud so it would be difficult to suggest using a completely different product for it since they've been using it so it seems like a good option more or less and also another question uh of mine was um like if a product is region based like for example just kind of you know like how like you won't have so many probes to check like mostly data signals are either in Mumbai or somewhere else it's just like one or two probes probes that you would get in a specific country or something like that so um if you wanted to do more of a regional check like if customers from Delhi can connect or customers from Chennai can connect how would you do solve that kind of problem uh but like something like data doc or maybe we have to do something in house you're still limited by where data centers are posted and uh exactly for our tier two and tier three cities and those customers uh there are no data centers out of which you can run your black box tests right yeah not really sure you can leverage a couple of uh you know meet your uh sort of bare metal uh companies which have certain other uh you know uh data centers but other than that not really anyone else would pitch in on this so uh I would say using uh bare metal offerings and then running in private probes on them would be a way to go and if you if you really want to you know check from you this point of view I would say send a nuke to every employee of your company and then yeah I mean yeah if you do have it right you know pass them to the Raspberry Pi and uh yeah that seems like a good option I'm going to the extreme but yeah if you give your customers discount to run a probe from there and oh yeah just built it into the app itself this is what I think a lot of telecom companies do right if they're putting up a tower on top of your uh you know building or something you get a flat discount or some payment you know out of that so this is already sort of a business practice in certain domains so whatever is that not sure yeah I was in project before so we used to use data docs for all of that and it used to because we didn't have a like a website and stuff so we didn't used to have that many of problems so yeah this is relatively new but data docs also want to like Indian region based uh splits right it will stay but who may be Delhi Chennai you know Mumbai Bangalore there's no data center in Bangalore so you're limited to Delhi and Mumbai and maybe Chennai there is couple of places Hyderabad has a couple of places I think but not sure you're limited by where data doc has a center right because they don't have too much fine green if I believe you only have India green they don't have state green right right right also like I was looking at cloud flare workers but they don't allow you know go land so I can't run otherwise I would have you know deployed it on cloud flare workers and suggested that as a solution but yeah yeah that's a good option maybe you can write something on your own uh in a language that they support and then deploy just just write a shell script yeah I mean back to old Zabix and you know uh those days all probes were shell scripts there was no fancy language you used to use grape and said I'm off to filter out the data point and send it not use any of the point of time every new software was a shell script or an extension shell scripts that I prefer even now folks should we should we end ask the Hasid good folks from Hasid to end our live stream since we have the end of our you know talk and the official do any conversation I would say uh uh life