 So this is a talk on observability at Cisco EN. EN is Enterprise Networks. So it's a group within Cisco, which is responsible for routers, which is et cetera. Yeah, so the normal disclaimer, these are my views and Cisco is not labeled. And also one more thing is that whatever I have put here is almost one and a half year back when we were doing the research and exploration. This might not be very much relevant today because all the stats or whatever explorations might have moved might be not really true anymore, but to the most part, yeah, it should be true. So let me proceed. So who am I? I'm a software engineer working on tools and workflows. I've been in Cisco for almost 13 years now. So I'm working with the developer experience team. So this team caters to a developer community of around 10,000 developers here in Cisco. So it's a pretty big population, as you can understand. You guys can stop me at any point. You can ask me about anything. Yeah, just curious. So you serve the internal developer team in Cisco, is it like? So Cisco has a lot of lateral. So I tend to end up calling Cisco as a bazaar. So because of acquisitions and so many other things, there are a lot of groups. One such group is Enterprise Networks. And this Enterprise Networks has about 10,000 developers. And there is a developer experience team. So we work through the whole developer workflow, all the way from say an IDE to their continuous integration, lot of tools, internal and external facing, testing and reporting and so many other tools, analytics. Okay, sorry, got it. So yeah, observability is important to be, as I said, collectively we have about some 30, 35 tools which we provide to the developer lifecycle. And these are polyglot coming all the way from Java, Python, these days, MLA, JavaScript, everything is there. So the earlier when we were into this business, so it was just that, okay, my application is available or not available was the problem. But these days, quality of service also matters because everything is mostly a pipeline and something slowing up causes a ripple effect elsewhere. Quality of service is really important. Open source has a lot of tooling. Cisco traditionally used to build a lot of tools on its own, but these days we are exploring and embracing open source because there is a lot that can be leveraged. So we set out observability stack around 2019. And these were the high goals. I mean, higher level goals, build on some resilient platform. We did not want to reinvent. At the same time, we wanted something which was very rock solid and then pick up the open source software. We wanted to start with telemetry because the whole observability was a new thing to all of us. And if we say somebody would start with logging, somebody else would start with something else. So we kind of really decided that it is better to start with telemetry. And then because our applications have a lot of history and not all of them are in a CI CD model. So another important requirement was that, okay, we also need to do arch, we wanted some kind of monitoring where deployments and other things could also be tracked. So this is where there is a terminology of synthetic monitoring in our systems also, but I really don't know if that is going to coincide with whatever is the next speaker's. So that is what I meant here. So we wanted to see how the traction is when newer versions are being deployed. So the platform that we have is resource based because it is a proven scale. There was a lot of in-house expertise. There were some systems that were built and explored with missiles. We used marathon as the scheduler and metronome as the other Quran kind of scheduler. So those are the two scheduler schedulers that we had. We have set up of dev and the prod or a stage and the prod whatever you can say. So there is a relatively conservative one with, so we have deployments across three geographies and our del set up has almost at all in the cells. I mean, here we call it UCS. So it's a very capable host, multi CPU, multi core RAM, et cetera, et cetera. And prod also has across three geographies, but 25, 30 hosts. So yeah, when it comes to telemetry, we wanted to do application monitoring. Then we had a separate of business telemetry also, then the normal infrastructure monitoring. So application in the sense where we instrumented things and then got a metrics out of them. A business was more in the lens of say higher level view than what we have, like how many users are there to what is the ROI that they are getting at a much higher level, like you can say a collective kind of aggregation. Then infra, we also ended up building a stack for monitoring the lot of hosts that we support. As you can see, there are around 10,000 developers. So we have about, I would say around 3,000, more than 3,000 hosts, which we are also monitoring. So that was our infra monitoring stack. So this was the pipeline. So once we have the metrics, it is ingested to Kafka through a smaller SDK. Then from there, it is picked on by the telegraph agent, which also does some kind of processing and then it is passed on to Prometheus. And Prometheus is what we heavily use. It is our kind of brain. And then we again have another telegraph in the end, which writes for in-clugs TV. Then ultimately, which is queried in Grafana. I'm sure like you guys will have a lot of questions. I'll probably try to address each of them through individual slides. So Prometheus is our centerpiece. It kind of acts as a metrics aggregator for us. So for Prometheus, the input actually comes from the telegraph, which kind of does a work of collecting, filtering, and then passing on. We use Prometheus heavily for our alerts and custom metrics that are generated. Black box exporter is something which we use heavily. And then we also use it for service discovery. There is a capability where Prometheus can work with the resource. So that is also being used. And then we also use it for Federation. Federation in the sense, we have say around 35 tools. If we want to get a collective view of how things are at our global, that is where we use Federation. And again, a word of caution is that we are not using either Thanos or Cortex or anything. I'll tell you what, because we don't have object storage yet. Or at that moment, they were not even ready. If you guys have any question on this, you can stop and back to answer. Yeah, Ashok, I had a quick question. Why are you like first sending to telegraph and then sending to Prometheus? I think you can directly send to Prometheus, right? I probably, I'll go through the whole stack and I'll come back to that. That's a very interesting challenge that we had. Telegraph, telegraph is our collector and filter. So it collects all the metrics. It does some kind of filtering and then it passes on. Why we ended up using telegraph is also because it has tons of plugins. And then it kind of supports templating in the sense that you can play around with environment variables. That was one big thing that I found very hard to do with Prometheus. I don't know if it has changed, but in telegraph, you can refer things to through environment variables. That helps us kind of template it. And then this is one very useful plugin that we had. Like there is a plugin called as exact plugin where you can execute some shell script at a given interval. It just needs to provide the output in a specific format and then you are good. So we use it for getting some specific metrics from the host, NFS, IO, statistics and other things. Whatever else is not there. If you can write something, you can just use this. So this is on the telegraph. Very useful thing that we found. Then there is Kafka in between. We wanted a reliable transport in the sense that, yeah, there is always, you can push the metrics to a prom gateway, but we did not want to do that because we found that at times prom, what happens if the prom gateway fails? What do we do with the data that is to be sent then? So that is where we thought, okay, maybe we will write to Kafka and from Kafka, we would read it if there is a plugin in telegraph to read from Kafka. All that we need to make sure is that the data is passed in the right format. Then yeah, as I said, it is used instead of the gateway. Something that troubled us were, we had to play with partitions and consumer groups because if we fiddle with this, then metrics start coming out of order and that causes all the havoc. Then the last part of our thing is influx. It is our data store. We don't do any down sampling. I think the latest, we were at 1.8. We still have moved to version two and we had a retention of around three weeks. We have chosen the enterprise edition because it also allows us replication across the sites. And one thing that we had trouble with which helped us was this time series indexing tuning. So with that is when our actual data size shrunk a bit. Otherwise it was going haywire. So this is something that if anybody is using influx, can also look into. Grafana is our last piece. It is used for visualizations and alerts. So we took, I think we started around version six point something, now it is in version seven. It is heavily customized. So we have added some capabilities in terms of PGSQL integration and SSO integration. Some custom visualizations have been built because some of those visualizations were not scaling to the level that we wanted. Because as I said, if I have to kind of say visualize thousand posts, so it could be thousand whatever, some geometric shapes. With its own things. So that was not working. So that is where some custom visualizations have been built. Some newer alerts have been added. One specific thing I think which I bought up previously also that if you have variables, then alerts don't work. That was the problem that we were facing. So that was overcome with that. I hope these things are getting fixed in the latest version, right? So I'll probably now go back to that previous slide. I should have put another copy here. So what happens is the metrics come. We have a custom SDK return. The metrics go into Kafka. It goes in a specific format, which telegraph can understand, which is nothing but whatever is recommended by influx. So telegraph reads all these metrics. Imagine metrics coming from all the tools. Telegraph agents are there per tool. It collects, it aggregates, and then if some massaging is required, massaging mostly in terms of, so we generally don't restrict people to add their own custom attributes. So this is where we kind of filter it off. Otherwise, the cardinality explodes. So once we get data out of telegraph, then we pass it on to Prometheus. This is where we do further white listing of the metrics that we are interested in. Write rules for the ones that we really want, the KPA is out of, and then also pass it on to the Federation. Then there is another telegraph, which there is, this is mostly for conversion of some tags to fields. So I presume you guys have used influx DB or even otherwise. So tags are the ones with which you group by and aggregate when you use influx DB. Fields are just the values. So fields don't really contribute to cardinality, but tags do. So some of the tag to field conversion say it could be in terms of the user ID or some specific thing, which could be boundless. So such things are converted in this telegraph, this second telegraph where those are retained, but they are converted into fields. And then we write it to influx and then query in Grappana. Did I answer that question? Yeah, I got it. Yeah. I just had another question like, did you face also scaling issues with Kafka? And what is the scale you are handling like when you're sending metrics, what's the scale you are handling there? Scale is actually dependent because not all applications send it at the same scale. It also depends on the time of the day, day of the week, so many other factors. I would say we were getting around, I think at 3,000 metrics per minute or something at one point of time. Again, whatever I'm saying is like, you can say one and a half year back. Sure, yeah. So the scale actually really did not hit us that much in the normal application monitoring, but if you look at the host monitoring, the same metrics coming from 3,000 hosts, yeah, the scale was there. So our infra monitoring stack was a bit different. Even though I have not depicted here, what it was was we had telegraph agents installed on all those hosts, all those 3,000 hosts. From there, we were directly pushing it on to Prometheus. So this Kafka piece was not there in the infra monitoring. So it was directly telegraph agent, pushing to Prometheus and then the other telegraph to convert tags to fields and influx. So you were using Kafka privately because you wanted to handle higher scale, right? And also reliability, more than scale reliability. What if the prom gateway goes down? Got it, okay, got it, sorry. Got it. So we faced issues there. So the question Ashok from my side would be the, did the prompt, like did you experiment or did you load test from gateway and then arrive at the conclusion that it might go down? Or was it more a thing that, okay, you were already super, like the team already has expertise in maintaining Kafka. So you decided to go with something you knew and it was stable enough for you. Both, it was both. Because we had applications running all across the globe and then we were really putting data into prom gateway at very frequent intervals. We started seeing some anomalies in the data where some things were not captured. Okay. So that is where we could not pinpoint, was this like a network issue, was this, at times we also saw that, okay, prom gateway had gone down and there were some issues with the prom gateway. So we said, okay, let us do something else then. Ashok, one question from me also. Why did you go with infrastructure as the final data store and not fromages? Retention. Okay. So retention you did not want to have with fromages like using Thanos Cortex or anything, right? Like you said. At that point of time, yeah, we did not have object storage in our realm. So, or we were not even experienced, two experienced with object storage. So Influx DB was available and whatever experiments we had done with Influx DB, it worked well. You called it. And how does this influx DB do the retention? But does it, so I hope you are renting house, right? Influx DB or you use a cloud version? No, we use an enterprise version. It's on-premise, but on-premise. Okay, and how is the retention there? It's on the disk or is it on the disk? On the disk. Okay. So if it is on the disk then Prometheus would have also done it on the disk, right? So Influx, we run as an application per se. Prometheus, we run inside containers with ephemeral storage. We have their retention policy for the Prometheus itself. We have a retention of around two hours. So we are okay. Prometheus goes down and comes back. Okay. So you are relying mostly on the Influx DB due to the reliability and the enterprise? Yeah. So, sorry, Ashok, also to sort of burgeon, since you are inviting the questions. Why have Prometheus at all in the pipeline? Why not just go with pure influx? Yeah. It's a- Cromql. Cromql, the rules. Rules and problems. Yeah. Yeah. Cromql is powerful, that's true. Yeah. And the federation also we use significantly. See, the one thing, I don't know, I should have organized this one better. See, one thing is, if you see here, right, the data actually reduces as we move through the pipeline. So whatever comes in Influx is actually less. So there is lot of filtering and other white listing which would have happened through the pipeline. So some part of it happens in the first telegraph, some part of it happens in the Prometheus, some portion of it, some transformation happens in the last telegraph. So that is where our pipeline was optimal for our use. Same. Ashok, I think just last question. What sort of post-processing you're doing on the telegraph if you can share a few things? The second, so the first one is there is a plugin which reads a Kafka topic and then you get the metrics out of it. So that was like the initial primary use of that. Okay, yeah, that's just transferring the data, right? But you said that you also do some post-processing, et cetera, what's those? That was one. And the second one is we rename some metrics there. That is also possible there. And the third one, as I said, there is a exact plugin with which we kind of capture things which you cannot really otherwise, you know, instead of writing a custom exporter, you just have a script which just runs, gets the metrics and then pushes it. Oh, interesting. Yeah, and then there is a file plugin and all those stuff. The second one that is there, we primarily use it for tag-to-field conversion that is to keep cardinality in check. Got it, thanks. Yeah, so yeah, I think I was at this. The ones that we tried initially were, we tried metric bit, it was very limited but for us, it did not scale. Probably we gave up bit too early also. That was also there. We also tried Victoria metrics. It was very good. I mean, in the sense it was one solution which kind of worked, but it was at alpha. So being an enterprise solution, we could not opt for that. We did try Thanos, but I think it was like very early days of Thanos, things were not refined. Some things were working, some things were not working. And also due to our own restrictions in the host and other things, it did not go forward. We also, instead of before having influx DB as the backend, we thought, okay, we should have elastic search as a backend. So that is one thing, again, that we tried. We were able to capture data with some work. But then the problem is packet seeing was when we started querying, the forming optimal schema was not the issue because before the pipeline, as I said, we restrict our developers from the format but we don't restrict way too much. If they want something, they can always send it. So that is where our schemas were not very strict or very optimal. But then we also noticed that Grafana would lag heavily because the queries that would come out with the result was very suboptimal at that point of time. So that is when we switched to influx. Any questions there? I can take. I mean, elastic search is not really a TSDB by any measure. So it's... We expected to... Yeah, I see. All I would say it's a bit interesting use case. Elastic search is based on Lucene which is a document search engine. So kind of difficult to pair up with the structure of TSDB. Yeah, so what happened there was, as I said, there was a third axis, if you noticed in that graph. So we were also capturing business metrics. And for that, elastic search was the right one. So we still have it in elastic search. So we thought, okay, let's have one solution and try to put everything. But as of today, even the business metrics are still in elastic search, the front end is still Grafana. That portion has not changed. It's just that the schema there is very restricted. It works. So when you say business metrics, what do you mean? Is it like number of visitors to site or is it like something more deeper, like the number of revenues? Or like, can you give me a sense on what is business metrics for you? Business metrics could be like you, if I take, say like a continuous integration pipeline. So how many complete successful runs were there? How many commits, LOC, everything? It could be all the way through. As I said, we support a lot of tools. So to each of them, it's different. And also from the perspective of a customer, for us, the customer is the internal development teams. For them, at their org level, whatever are the metrics? At their aggregation. Got it. So primarily like number of pipeline runs and things like that will be business metrics. Yeah. Got it. And Ashok, those business metrics would have easily be sent your existing pipeline of metrics that you showed us, right? Why choose elastic search for that? Because that part of the solution, right? The business telemetry was already there. That's right, it was already there. It was already there. Yeah, then comes the logging part. We wanted to correlate metrics with events. It was our first step with debugging. We also wanted to see if some kind of log analytics would be done in terms of how many errors per day at a very basic rate or a response time crossing something and is it because of some warnings? Is it because of retrace, et cetera, et cetera? Our pipeline is fairly straightforward. We have a SDK for people to push and then we use Fluent D or Fluent Bit. Then it is mapped on to log stash. Then it goes to elastic search. And then in the Grafana log panels, we map it. Any questions on this? I think this is pretty standard for most authors. So Fluent Bit is easier, but it is not available on some of the older OS versions that we had. That is where we used Fluent D. Filtering is important. For example, we tried hooking it up to Jenkins and there was a huge overflow within a day. So filtering is very important. Data enrichment is easy. We can kind of add multiple things to the logs that we are sending. Fluent D was very good in that sense. And I am not an expert in this field, but when it came to aggregation, Fluent D also can do aggregation, but there were some specific queries which I am not very aware of because of which we ended up choosing log stash instead of the Fluent D aggregator. Otherwise, our initial plan was to have Fluent D through and through SDK, Fluent D, Fluent D aggregator, Elasticus and then Grafana. But there were some conditional queries if I remember correctly, wherein log stash paid much better than Fluent D at that point of time. Because of which log stash came in as the aggregator. Then yeah, we wrote two SDKs. One was for the telemetry where instead of push gateway, we wrote it to Kafka. Then for the logging, it was a custom log handlers. It's fairly straightforward. I mean, you just have another handler and make it right to Kafka. Then if those does not work, we can always use the Fluent D for log. That was on that. I think this is all is what I have. Tracing, we are experimenting with open tracing and open telemetry as I was just telling Joy at some time back, it's moving very rapidly for us to kind of put a milestone and then adapt it. So that is what is delaying us. We are also looking at the ROI of not really instrumenting and see, because OTEL also supports auto instrumentation and some debugging out of that. I would be delighted to hear from you guys if you have tried any of those. And this is like the later realization, if Fluent D, we can just hook it up to our container engine and then that logs, that's like the best log that you can get. And also off late, I'm seeing that, if we have a service mesh kind of scenario, I even just exploring linker D. So you ready to get all the metrics which are off the standard use and write format. Those are some things that we are kind of trying to look at going forward. Some things that has troubled us in the past and some things that we should always be looking forward is, how do we monitor the monitor? We really don't have a standard answer for any of this. Then alert fatigue. I mean, yeah, you can build a best solution, but once people's interests are out of it, then getting people's interest back is very difficult because people have their own ways of finding if their application is down or not performing, which has happened through the years. I'm not sure if others would have faced this because being in an enterprise company for decades. So this is how things are like you, anybody who owns an application will have, these are her own way of, you know, kind of detecting the issues and having things. It could be like simple cron or anything. So it is one thing to promise something good and fancy, but if it becomes a burden, then people are just out of it. So that's another problem. Then capacity planning is something, yeah. But of course, there's no formula to that. I would say it's something that we've been learning, but yeah, this is something that sometimes keeps hitting us. Yeah, upgrade loops as in, so in our pipeline, we have about hundreds of promethiles. Let's say they are in version something, say 2.x something. If there is an issue discovered, we find that, okay, that is fixed in some next future version. If we have to upgrade, we have to do a rolling upgrade of all of them. And in that time, if something goes wrong, then the whole metric pipeline is stuck or even in the best case, we end up losing some metrics, which start showing up as blank in Grappana. And that again raises the question as to was the monitoring system down, was the application down? Because we have people all over the globe and 24 by seven, it kind of, there is a lot of back and forth that happens. So that is what, my first point is the one that I was trying to make. How do we monitor the monitor? That's like the biggest challenge that we have. I guess that's it. That's all I had. Thanks, Ashok. That was a great insightful talk. We don't get to usually see the inner workings of massive sort of enterprise orcs. All right. Let's meet up. So yeah, glad that you could be here. So a couple of questions from the attendees here. I think one person asks here. Sudhir asks that, Sudhir, I will unmute you. Why don't you ask the question yourself? Hey, hello. I'm Sudhir. Yes, Sudhir. This is my second absolute meter, right? Okay. I am from Freshooks. I just want to understand like, what is the data volume that you use for logs and you have any numbers that you can share regarding the metrics or the cardinality or the number of samples you use? Okay. Here is what I have. So I have my colleagues who are actually there. One is Srinivas, other is Raghavendra. Both are from our team. So give us some couple of minutes. I guess they'll look up and then give you whatever we have as the latest. So it's hard because there is things keep rolling, but Raghavendra, Srinivas, hey guys, good to see you. Can you guys share some metrics on the volume of data and metrics that we have? I think a ballpark figure would be just like, because sometimes when we look at certain stacks which work for a certain context, it's also the skill and the volume of ingestion tied into the context, right? So having a ballpark number sort of helps us understand that why this particular thing works. I think the last I remember was around 180 GB in flux for the metrics alone. And elastic search for the logs, I'm not really sure because it was a rolling index, which keeps changing every day. I would say that we were having about 40 GB every day and three or four indices of that sort. Hi, Ashok. Hi Ashok, this is Raghavendra, hello everyone. So from the influx data that we monitor, so there are two kinds of telemetry that we do. One is the host monitoring, like monitoring of all the machines and the other one is monitoring of applications. So for the monitoring of applications, which is the intra telemetry, we have the average points that we get per second in our influx setup is 18 K. So around 18,000 data points per second from all the different applications that we have. And the other one, which is like the host monitoring, the monitoring of the cloud machines. For that, there are around 6,000 of them that we monitor. So the average number of points per second that we get for them is around 670 K. So which is around like 670 K points per second is the volume of data that we get into influx for host monitoring part. And for this host monitoring part because the volume is so huge, we only have a retention policy of three days and that takes up space of around 200 GB. And after that, we down sample it and transform it so that roll it up, roll the data up our voice and then day voice and then put it into another metric. Cool, Sudhir, I think that answers your question. Yep, thanks. Yeah, cool. So Ashok, one request from my side is that when we publish your PPT for like the greater audience later on post the meetup, it would be really nice to have another slide at the end with some of these ballpark sort of metrics there. That would help the readers of the PPT whenever they browse it to contextualize your whole pipeline. And yeah, that's a small request from my side. I'll do that and share.