 some more monitoring back to the monitoring part of our talk and we're going to be doing actually a Q&A session with our monitoring authors later on in the session. So this is Manan Balala, he is coming from Delhi again. He has worked both with not-for-profit companies and for-profit companies so he's seen it from both sides. Today's talk he'll be giving a little bit of experience from his for-profit work on tooling and monitoring for performance critical applications. So please welcome Manan Balala to the stage. Hi everyone. Just wave if you can hear me at the back. Awesome. Alright so thank you for staying back. Hope you got enough coffee and energizers to keep you going until lunch. This talk is about monitoring. My name is Manan and that thing below is the Twitter handle. I'd love some feedback on this talk. I'm trying to constantly improve it. So please if you have any feedback for this talk, write it there and that'll be great. So yeah this talk is about monitoring but it's not about monitoring in any generic context. It's in a very specific context, the context of a large online retail shop, an e-commerce website and we'll see how that changes the equation. We'll see what sort of a monitoring tool we'll need when it comes to an e-commerce website. Have you begun recording? I hope. Alright so yeah it's about monitoring in an e-commerce context. We'll also look at trying to conceptualize the entire monitoring system into phases and we'll see we'll try to divide it into phases and each phase sort of warrants its own tool. There are certain tools out there which cater to more than one phase but hopefully after this talk if you see a tool you'll easily be able to map which phase of monitoring setup does it sort of go in. Right so first a brief about Auto. Auto is a large online e-commerce retail company. It's based out of Germany. It's the second largest e-commerce company in Germany. Amazon basically surpassed Auto in 2014 before which Auto used to be the biggest. A little bit of history began post World War II when this guy decided that he wanted to sell shoes and he took pictures of shoes, pasted them on a catalog, photocopied 300 copies, hand delivered the catalog to the local areas and the orders came in, he delivered them. Over time the guy built enough loyalty that today Auto is the largest mail order company in the world. It's the second biggest e-commerce company in Germany, operates in more than 20 countries. The net net worth of the Auto family is about 18.4 billion and of which about 4 billion comes from from from retail. A big chunk comes from their real estate business as well. That is the Auto premises. This is where I was working with them and essentially so that that campus is about 205,000 square meters. For those of you who aren't able to see how big that is, it's roughly 35 football fields, true football fields put together. It's pretty big but when we talk about scale at Auto, that's definitely not what we're talking about. We generally talk about their online shop Auto.de which records about a million unique visitors every day and about two orders happening every second and given those statistics which are public statistics, we could extrapolate and say that if it takes these screens to place your order, that by the way is the home page of Auto. If you say it's fairly simplistic if you compare it with the Flipkart and Snapdeals, that's because they appeal to a very different persona than what Flipkart or Snapdeals sort of appeals to. They like to keep things simple and that's basically their selling point. But basically if it takes these screens, you pick and pick a product, you add it to cart, you later on you proceed to check out, it takes about eight screens to get your order placed and if one in 30 customers sort of convert, we can say it's about 480 page impressions happening every second. We saw that Auto has their own big billion days and they call it boom days, translated to English, basically in German it's something else. But on these days we saw roughly 1,000 page impressions every second. By that statistic, I'm not really trying to sell that this is large traffic. Of course, there are companies with much larger traffic and facing much more visitor inflow as compared to Auto. But the thing that makes it unique is, so when it comes to e-commerce, the thing that makes that unique is because we're talking money. There is no other context in which it is more easier to relate downtime with direct losses. So we had these numbers being thrown at us that okay, for a one minute downtime, you just lost us 100,000 euros or something like this. So it's very easy to sort of connect the loss value. So essentially we don't want to do anything that affects the money we're making. So we want to maximize that. We want to sort of offer an optimal solution which keeps inflow coming. We want to continue building more features. There are great features coming in, everyone trying to get one up over the other. But we don't want to build any features that don't tie back to the original point. We want to keep building features which add more value, which in turn basically bring in more money. We need to make better decisions and we figured out that whatever features we are building can give us enough insights to basically decide how to go about fixing a bug, for example. So say there's something broke in production and someone coming to you over the top of their head, they just say, hey, this is a massive failure. We need to fix this now. Unless it's backed by real data, unless you know how much of your users are affected by it, it's really hard to judge what sort of priority to associate with it. So it's very important that the monitoring setup we have equips us with an understanding of how big of an impact something has. Lastly, we really want to verify our business assumption. For this, I'm going to take an example. Let's say we were a little late to the game and we decided that we now need to personalize our e-commerce shop. We said that if we personalize one user out of three, we basically expect a revenue increment of X. We make this claim. And using this claim, we're able to sell the feature to the business. And we basically are able to begin development. So on. So forth, we develop the feature, roll it out. And it's great. Everybody's happy, except we never go back and verify whether the claim we made is actually are we performing at that optimal level? So are we able to personalize one and third of users? And without that, it's really just not making any sense. So we're developers. It's not our job to make up numbers. We need to base our numbers on hard facts. And that's what our monitoring tool should give us. So what do I need from my monitoring setup? I need what every other monitoring setup does. I mean, there's no discounting the fact that you need basic database monitoring. We need to know if our queries are not performing well. If basically maybe we need to add an index, maybe some queries are slow, maybe we need to shard. And our monitoring system needs to tell us this much. We need to monitor standard server metrics. Every monitoring tool does that. We need this too. What's our throughput currently? How many requests are we serving? How many what's open connection pool count and whatever, right? And we want to be alerted upon exceptions, right? So this is a no brainer. We need all of this. What we also realize is that what is really important to us is also to measure the state of the system. And while this is a big statement to make, what this essentially means is we want to grasp how the system is performing, not just an infrastructure point of view, but from a business point of view. For example, your website may be performing very optimally at one a.m. at night, but there are no users there. So essentially, you need to know what's happening, right? So it becomes important to also track these other metrics. And for this, we turned over and we said that, okay, let's just ask the product team. Let's just ask the business teams how they would measure the state of the system. And so this whole all of this agile jargon, what it's done is it's broken through those walls. Now you can communicate, right? And they were equally shocked. They were like, okay, thank you for coming to us. Yeah, we'd like to define our e-commerce system using orders per second, and users who are on the website. We're like, okay, these are important metrics, we need to keep a track of them. And so we realized there's essentially two sides to monitoring. There's monitoring of infrastructure, which ensures that you're offering a stable and optimal performance. And then there's the business side of things which measures key performance indicators. Lastly, our monitoring tool should definitely allow us to narrow down when something's broken, to narrow down on the source of the bottleneck, narrow down on the source of the bug. And the thing that helps us here is continuous delivery. For example, with continuously deploying your code on to on to production, what we've essentially achieved is the marginal difference between the last deployment and this one, this change has become smaller and smaller. So it's become much easier to track down what exactly broke something. And it's it should ideally become that much easier to fix it. So our monitoring tool should leverage that. Right? Lastly, tying into the last point, we want to validate whatever business assumptions we set out to achieve, are we achieving them or not? And that brings me to logging. So logging is great, right? Where we developers, we love logging, they point out the exact source of the problem, exact line number down to the character. And they tell you exactly why the error happened, what sort of error happened after the error happened. And that's the inherent problem with logging. So it's inherently reactive. When I get a log based alert, I sort of feel like I'm on this planet, which is about to explode. And somehow I need to figure out something that fixes the situation. I mean, I'm not saying the logs are not good. I mean, logs are great, right? If it weren't for logs, I would probably be having beer on the planet, like I wouldn't be doing anything on the planet. But at least now I know that there's something wrong. But I mean, what about the time leading up to the point when it actually broke? Could I have known that this is going to happen? So that's why I feel that perhaps if you're building a monitoring system today, you don't want to base it on logging. You want to base it on something more explicit. So one more problem with logging is the share signal to noise ratio with the with distributed systems, microservices, the number of systems have grown, their logging has grown. Of course, you have great tools, splunks, kibana, elk stack, all good, very easy to track your logs. But for a newcomer to the system, it's very hard to gauge in a glance what's happening in the system. So even with all these tools, the signal to noise ratio with logs is just terrible. So I like to define this. So if logging is about scattered incidents that happen in the system, accidents that happen at the system, we figured that metrics explicitly collected metrics about the system help you in understanding the state of the system better. And they allow you to gain valuable insights. But the biggest advantage of explicitly collected metrics is that they give you insights at any time. They don't necessarily give you anything after something's broken, but leading up to that point, they're excellent sources. So we we went forward with basing our entire monitoring system on metrics and and proceeding with that. And so we we so this is where we actually get to our monitoring system, which can be divided into four phases. Our first phase is the collect phase where you essentially gather the data. And I'd just like to begin with the disclaimer. We want to gather as much data as we want. But only if it is going to give us any useful insight. It's very easy to collect all the data out there, except you'll be running out of disk. And secondly, you'll just be increasing your clutter. So try and be reasonable about it. Right, so when I spoke about metrics, I was talking about this metrics library. It's a library written by code code, and I think he presented it at a at a meetup. And there's a talk about it called metrics metrics everywhere. I strongly urge each one of you to watch that talk if you haven't already, it's an excellent introduction to the library and why the library essentially makes sense. So metrics is a library that you plug into your system. And it runs with your code, you instrument it in code. And what you're able to achieve is to get collect data while your code is running in production. And that's its biggest selling point. So ours was a closure project, but the metrics port exists in there's Java, there's a Scala port. And I believe there's there's similar solutions available in most languages. But yeah, we figured that it was as easy as just including the dependency. And then there were three parts to it. So first, you have to create a registry or registry is nothing but a container for your system. So if you have a microservices based architecture, each microservice can have its own metrics in its own container. Next, you just create what sort of tool are you going to use from the metrics toolkit gives you five tools will be discussing each of them in just a little brief. But essentially, in this case, we're using a counter. And a counter basically is good when you control when something is increased, decreased, opened or closed. So a counter serves a good purpose here. And finally, it's all about just instrumenting it in your code, just increment the counter where you see things where you see something opening or whatever. But essentially, you know your code better than any third party tool. So that's why this makes sense. This is much more explicit than a third party tool saying, Hey, let me do all your monitoring for you. So the first tool is counters. As I said, it's great if you control when something's opened or closed, particularly good when you want to check connection pools, okay, how many connections right now, open it, increase, close it, decrease. Gages are great if you if you're relying on a third party or an external source, you want to gauge its value at every moment and you see how its value is changing. For example, if you're relying on a certain disk space, you're using Redis, you want to keep track of how the Redis memory is changing over time, you can use gauges. Metals are great for tracking the rate of something. So for example, if you want to track how many requests they're receiving per second, meters are a great tool. While these were good use cases for these tools, we found even better use cases with them tracking actual business metrics. For example, something like counters can be used to track the number of users you're personalizing. And it's as easy as if I'm giving any personal recommendations at all, just increment the counter. And later on, you can visualize how many users are you personalizing? Are you personalizing more than you were just a while ago? Are you not? Similarly for gauges, if you have like a cache of products with which you do a lookup, and you base certain recommendations on that, and you see those recommendations underperforming or overperforming, you can directly correlate. Okay, so the gauge tells me that the cache product count was too little or too much. So that kind of gives you an insight into why a thing is happening. Similarly, meters can be used to track how many orders are being placed every second. And the next one's histograms, histograms are particularly good if you want to measure the count of something or the frequency of something and how that changes across time. So you can check, okay, if you're offering a response, how does the response size change? For example, something goes down, maybe you don't have, I don't know, database or your query gets messed up, your response size just reduces drastically, and you can easily tell. But we'll come to a better use case of this, which is how many of my customers, what quantile of my customers are getting five personal recommendations or less? So essentially, histograms allow you to measure quantiles. What is the main number of recommendation that you're serving, which is an average, what is the median? How many recommendations are 99 percentile of your users getting? So that's worth, so it gives you a lot of insight. Timers are great. Timers essentially have a meter built right in and are able to measure how long it took at that level, at that rate. For example, you can measure page rendering time and you can correlate it with how many requests I was getting. Similarly, a great insight is okay, it took me 80 milliseconds to give recommendations to 99 percent of my users when I had 200 requests per second. But the moment my requests became 800 per second, I suddenly started taking 200 milliseconds. So there's a problem that the timer is pointing out to you and we can take action on it. Again, implementing them in code is fairly straightforward. It's depending on your language, it's always just one command, one line of code or something. So now that we've collected all this data, we want to make sense of it. First, we want to store it somewhere and the choice of storage solutions has been increasing. Luckily, all of these metrics are time series based data. So what makes sense while storing them is a time series database. Of course, a great number of solutions were being discussed today morning and one of those solutions is graphite and others in flux DB. Mostly, all of them sort of have an architecture similar to this one. You'll have some sort of a backend storage where they would be using some sort of database, maybe a SQLite or maybe a different file system all together. They would have an API which gives you a queryable interface. You may use SQL, you may have an ugly string that graphite provides. And then they all have a layer in between, which is your true metric storage, which takes the pain of compressing or also helps in queries. Of course, with metrics, they arrive at a very tremendous rate. They're just a lot of metrics that you gather. So it's not optimal to write them all to disk. So that's why usually the solutions have a cache built in where the data goes first and is regularly flushed to disk. And when reading, you essentially fetch from the cache and from the disk. So you maintain a certain recency. Just some of the characteristics of a storage tool would be it should have a good query DSL. In this case, I think influx is a good job. They picked SQL, which everybody understands, graphite not so much because they picked something like just pure strings and you have to apply mathematical functions strings there. Query editor is pretty prone to mistakes and everything. So yeah, well, take your pick. It should give you the ability to apply functions. So essentially, you want to apply mathematical functions because the same metric may give you different information based on what function you're applying. And for example, a counter may give you a current count of open HTTP requests. But if you check the slope of that talking statistics here, but essentially you check the acceleration, the rate at which the counter is growing, you can easily check, okay, request per second just increased. So essentially I'm facing some traffic now more than before. The solution needs to be scalable. Again, the metrics are coming at a tremendous rate. So whatever solution you have should be scalable. Graphite says it works in distributed. I didn't see it working so much, but maybe there are solutions that works better than that. The data or the query results should be up to date. It's very important. Otherwise your graphs would be just completely silly. Right, and this is a great note to read. The influx DB, how their journey has been. First, they started with using level DB to store their metrics and then they proceeded with a different data structure altogether. Today they use time structural mergers. It's a great block person. There's a link down there. Hopefully you will get it when we share the slides or you could just check it on the influx website. Now that we've gathered all this data, the next phase is to visualize them so that we can finally draw some insights out of the data that I've gathered. And for this it's very important that I'm able to drill down into why certain things happen. So any visualization tool that you use should allow you the capability of drilling down. It should be interactive. I should be able to read what a value of something was at that very instant, what happened at that point. So it should have a good query language and editor, which is not prone to as many errors. So essentially it would be good if you use easier syntax. And you should be able to correlate changes in metrics to certain events that happen. I think this was also discussed today morning when we discussed our annotations feature in Grafana. Similar features may be offered by other tools as well. So given that we need all of this, present to you the oscillator. This is a tool we wrote. It's a D3 based tool. It's written in closure. Of course you may like the language, you may not like the language, but you sure can like the tool. It's very easy to implement a chart. It's very declarative and the charts are highly interactive thanks to D3, which of course is a great library. But the advantages that you get is you can define your charts in basic just data. So for example you can have a simple map defining what your chart is called, what your page is called and what charts it has. So you define tiles and you may have multiple charts on one page. We'll see a demo soon. Essentially each of these tiles, each of these charts you can choose what kind of chart you want to show. For example in this case we're showing just my user requests. I'm pointing it to a graphite target in this case because that's what we were using when we wrote this presentation. But essentially what we also open source was a graphite DSL. So it's a bunch of functions, a bunch of closure functions that allow you to work with graphite strings without you having to actually write those strings. So this minimizes the chance of making an error. So I have a small demo. So it's a random generator that I have and it's generating one request or the other and I'm sort of plotting the count of these requests. If you check the chart it's fairly interactive. At any given point in time you can sort of compare what the value was of one particular type of request and what the value was of another type of request. Moreover you can drill down into the timelines. Essentially you can see what was happening in the past one hour, what's been happening let's say in the past 24 hours. Well it was an application running on my local system which is why you see the brakes. Hopefully your production system does not suffer through the same. But yeah, so essentially you can drill down into timelines. You can also have summarization. So how has the trend behaved? If you summarize over a 10 minute scale. So it seems to be going down a bit. You can also have on the same oscillator you can have all your environments. Essentially you can have your development environment. You can have your pre-production or what you define what your environments are. So you can have as many as you want. And then you can kind of hide one, only see one. You can have as you like. You don't just have line charts. You have all sorts of charts. You can have bar charts, pie charts if you're into that. Maybe stacked charts if that makes more sense. Another great feature is again annotations. So these are events that significantly change something in your application and may lead to variations in these metrics. So it makes a lot of sense to have them on the same graph as the metrics visualization. For example those dots down there kind of point to what happened at that instant that the graph changed. And so it becomes easier to correlate. Okay there was a code change, there was a deployment, there was a provisioning. You can easily correlate why the change happened based on that event. So I had a few learnings based on this sort of flow. The first was to choose the right metric and the first thing when someone talks about monitoring is hey let's just plot the CPU load. And I was often left out of that conversation because I honestly did not know what goes into calculating the CPU load. And so I googled and I figured out that the formula to compute a CPU load is actually very complex and you don't want your metrics to be as complex as that. I googled further and I saw what spikes the CPU load, essentially what makes a CPU load spike. And I figured out that there's this post that says if you're parsing text often times your CPU load will spike which is odd. If my application requires to parse text I will be parsing text. What is a CPU load plot giving me? So essentially choose simpler metrics, metrics that point to exactly what's happening. For example disk utilization, user logins, simpler metrics. Again the second learning was that plotting a mean value of something may be absurd at times. For example we had a contract with the herd party that we'll give them recommendations and they can show them and we had a contract we'll serve in under 50 milliseconds. And when we actually saw it on our graph we were serving in under 20 milliseconds and we were happy. We said okay so that's pretty cool. We were well below 50% of the threshold. So that should be okay. Until they came back to us and they said they're running into circuit breakers and this is you're often going above the limit and that's when we realized what we were seeing on the chart was actually mean and average value. When we plotted against the max which is the max value in every summarization interval. For example the max value in every minute we see a graph something like this. So essentially we were easily peaking over 100 milliseconds every minute and that's why the team was kind of running into circuit breakers. So just choose how granular you want to be. Just don't pick mean because it just sounds cool. So yeah choose wisely. It's important to choose the right tool for the job. Histograms are particularly good. If you want to measure the size of something, the frequency of something, how that changes. And the last phase of the monitoring setup is alerting because we all want to go home and sleep at some point in time. So essentially we don't want to keep added. We don't want to just keep glancing at the charts. So and when it came to alerting again we being the team that we were we didn't like any of the solutions out there and we wrote one ourselves and this tool is called x ray. It's essentially again a tool written in closure in which you can essentially define your condition for the failure of the alert in a simple function. So you can have a function saying if this happens, if it returns true, it's all cool. If it returns false, alert me. And then you can define what sort of alerting strategy you have. Do you want one alert every five minutes? Do you want if it's continuously breaking or do you want to be alerted on each instance of the condition breaking? You can also define what what sort of alert you want to receive. You can have slack integration because it's all code, right? And that's what the advantage you get from having explicitly written down solution. So you could have a slack integration, any sort of emails or text messages. I don't think you want to go for that though. But but yeah, for example, in this, we're simply logging and you can change whatever you can pretty flexible. In the end, xray also gives you a nice little dashboard, gives you three level views. This is the middle level. And I've had to blur out the names because it was a live system. I can't share that. But essentially you have all your environments and how the checks are performing on each of those environments. And you could basically drill down into the ones that are read and check, okay, what exactly happened? What was the expected value? What is the value we got? You also have a higher level chart which sort of aggregates all of this and gives you whether the whole system is behaving okay or not. So yeah, we thought it was extremely high importance. Early morning in our standups, the first thing we used to do was decide what's going to happen in the rest of the day based on these alerts. What is the value of the alerts we're getting? What happened over the night? We added these to the definition of done of our story. So any, any new feature that you're adding would not be complete until you add the relevant alerts or change existing alerts as required. We wanted to make it as visible as possible. In our case, we were able to provision a nice little monitor and Slack notifications to ensure. But at the same time, you don't want to spam anybody. So try and choose the right alerting strategy. So the entire solution kind of looks like this, you capture your metrics, you aggregate them, store them in a nice time series database, visualize them using some sort of a graphing tool. And finally you alert and you choose an alerting library for this. There's plenty of solutions that offer all these four phases, the tick stack on what not this, there's enough solutions. But yeah, I mean, just you can take your pick for each of those phases. Just last note on continuous delivery, as I mentioned, since the marginal change between deployments has become so small, it's become increasingly important to find out what broke something. For example, there was this one instance, where we noticed one graphs just shift drastically. And what we realized was we had a deployment. But since the last deployment before this, there had been only 10 commits that went live. So if it's just 10 commits, it's very easy to kind of find out, roll back 10 commits, apply one commit at a time, see which commit broke. So it's a matter of practice as well. No monitoring tool is going to allow you to really do wonders without you following some sort of etiquette. And I think it's important to follow the etiquette of continuous delivery, especially in a retail shop where you constantly want to release features, you don't want to hold back the teams. And a note on software that claims to monitor everything for you, it just seems like they know your system much better than you know your system. I don't know why you want to go for, why would you want to go for something like this, especially because some stack that we saw, there are like 100 alternatives to that, all of them free. Why would you ever want to choose something that doesn't even know your code? So the only way I can think of they could do monitoring is by hooking into individual function calls, I could be too expensive, both performance wise, as well as monetary, because they charge an exorbitant piece. And so really, I mean a great monitoring setup is available for free, really no need to go for such solutions. These are the references that I talked about. The first one is a great talk. Again, if you haven't seen it, metrics, metrics everywhere. The oscillator is available at that link. It's on GitHub and the x-ray, the alerting library is available at the next link. I urge you all to follow the Autodev blog where they publish excellent things about how they're changing their architecture. They're one company that's keeping up with the times and essentially doing great with all of that. That's all I had. Thank you. Does anybody have any questions? Oh, yeah. I have a question about design and how you cascade elements. Because we've got too much knowledge that you can't get it up in actual calls. So we fell into the trap, as well. Honestly, we faced enough lack for that. So initially, when we built the solution, we were spamming everyone. In fact, we were spamming more than just our team. We were spamming even like the organization. So we faced that trouble. Essentially, we left the choice fairly flexible because you may want to categorize based on the priority of the alerts. Since it's all in code, it was a little easier for us because we could essentially see, OK, this particular alert, if something happens, I do want to be alerted and this particular alert don't. On the other hand, you could also have summarizations. OK, I want to be alerted if this thing happens continuously for five minutes. If it's a one-off, I don't want to be alerted. But really, I don't have a perfect answer for that. It's it's really a gray area. But we yeah, we fell into the trap as well. Where are you? Oh, hey, I'd say we really did not. We really did not focus on building a product that was different from Grafana. We think both are great products. Honestly, at the time that we built the oscillator, Grafana wasn't as mature as it is now. That being said, we would probably have still done it anyway. We we love writing explicit charts in code. We like to have more granular control with the language that we are using. I guess those could be the reasons and writing something yourself all the frowned upon. But for larger enterprises, sometimes it makes sense because it offers you the flexibility that no prepackaged tool could offer you. Hello. Yes, hi. So from the top, it looks like the matrix, the library which you're using. Idea is to infer the lot of details from the code rather than from a log file and push it to a system. So what happens is like in a web application at an endpoint, you interact with like 10 other different system and collate the result and finally present to a user. As a result, what happens is like you have two places to infer the information. For example, my web application can infer how much does a recommendation engine took the time. So I can put what to say like inference point at that web application level code or at the inference engine and later you can push to the system. So which style did you follow and is there any reason to choose putting everything at a web application level all at the final system level? We had metrics on both. We were collecting metrics from both. Essentially the rule of thumb that we followed what Kodohil also mentions in his talk is every microservice may have maybe should be capturing their own metrics and you may have up to 30, 40 metrics from each microservice reasonably sized microservice of course, we often go above board. Yeah, I don't know, we just I guess it's one of those separation of concerns thing. The web application did capture these metrics for a third party that was not written by us because we did not have control over them and we did want to measure how that third party is behaving so that we could communicate effectively. We went ahead with the approach of just having every service have its own metrics and container kind of thing. Thank you. Hello. Hi, this is about collecting data on only useful data part. So how do you choose to collect useful data? I understand the intent behind that statement but there's also this fear of missing out and have you run into any situations where you actually missed out on some useful data that might have led to say a new metric? Of course, obviously. I guess what I'm trying to say is yes, you should capture nearly like everything that you think could be plotted on a chart but definitely not something like okay, what's the CPU load at the time? That's why I kind of tied it to the end of the presentation as well. Capture metrics that are simple that really offer you some insight into what's going on into your system. Of course, we missed at times capturing useful insights. At times it was hard to even keep the charts updated because there may have been a change in the feature after which the metric also should have changed accordingly but did not test help us in that situation. I guess that's not your question but yeah, again, that's a great area. We ran into that trouble. I guess you can aspire to be as vigilant as possible when you're building a new feature. Anything that's simple enough to be captured gives you some insight to be plotted on a chart. Just do it. But yeah, there's no black and white. I had a follow-up question to that. So when you said that you captured business metrics as well as infrastructure metrics, would you say that looking at the business metrics daily is kind of the only important thing and if there's something wrong with the business metrics then you go into the infrastructure metrics, not otherwise? Why would you say so? So essentially something like if I'm using a Redis, I wanna keep track of how much Redis is filled or for example, have I expired too many keys to suddenly? Am I done? Sorry. Okay. So yeah, so I guess we were keeping track of any of these charts. But by the same logic, if suppose you have Redis at suppose 90% full capacity but your business is going as usual, there's no problem, right? Oh no, you may run out of capacity then your business will not be doing good. That the trigger should be the business related metrics. That's kind of what we're doing. So essentially the entire intention is to not let the business metrics be affected, right? There may be business metrics that tell you that or there may be infrastructure metrics telling you that. So okay, people who want to ask questions can stay back. We'll have speakers Manan, Pooja and Burnt to answer your questions on the stage. The people who are not really interested for the Q&A session can proceed for lunch. So we're taking more questions but combined for all three speakers. Pooja and Burnt, could you please move to stage? So we're up there? Oh. Hi. Where does it happen? You're free, right? What solution? It could be for long-term. Okay. It could be for long-term. Sure. Okay. Well. It could be for long-term. So the first part of your question was how do we get developers to be motivated to maintain the code that they've instrumented just for capturing metrics? So essentially, this is a culture change, right? In this case, we were the driving team. We were the team that went to the business and saying we think this makes more sense. I guess with Agile, the entire point is that everybody's kind of contributing and you don't sort of work in silos. But yes, I do understand that there are teams and there will always be teams which have a clear division and that makes sense for them. But in this case, it was not. So we chose to be explicit and maintain the code. We took that call that we will be happy to do this and the benefit we got was flexibility. The benefit we got was we don't want to actually monitor something that we don't really care to monitor. So I hope that's some sort of an answer. I'm happy to have this discussion afterwards as well. Hello. Hey Manan. Hey. I'm from Goldman Sachs. Sorry, I can't hear you. I'm Pawan from Goldman Sachs. So this is a specific question to process monitoring. So I just want to know what was done in auto or in your space for critical process monitoring because say a process is down, your order rate could go down anytime and if a host is down. Can you put the microphone? Did you get the question? Hello. To some extent. So I think he's asking about critical process monitoring. Right. Go on. Is it him now? Oh, sure. OK, sir. So yeah, so in certain cases there's a host outage and you have a barrage of alerts of process being down and you have to deal with that. So was there any intelligent solution that was kind of adopted? I guess it's not really related to this talk but we essentially implemented something like circuit breaker so that even if some third party or some critical process is down, we still fail gracefully. But yeah, I mean, some would argue that if your critical process is such a critical process, how graceful can you be? To be honest, I was not able to get the question. I'm sorry. Sir, he's asking if a critical process goes down, some sort of outage. So how do you handle that? So your entire order processing is down. What sort of thing can you do to sort of manage or circumvent the entire problem? So in my opinion, also that process is part of an application. Means you can simply monitor if the process exists but it doesn't help. But at the end it should have an effect on an upcoming service and then getting to the root cause you figure out. Means if you think about the first talk going from the top down, thinking about your business as an impact and then seeing what application is responsible then perhaps a process is not working. Means also that it could be fine in the process list. Doesn't mean that the process itself is working. Means process monitoring itself could be easy. Means is the process alive? But how many of them are allowed to be alive? How much memory is consuming? It's probably hard to answer in a single sentence. It could be a complex topic. It is. Yeah, I'm certainly thinking so. Thank you. I think your mic is off. Okay, I'll repeat. So you're saying that it could be expensive to put. So if I understand your question correctly, what he's asking is, so with instrumenting these metrics in code, what if the code itself is unstable in some way? And so saying that if the code is unstable the metrics would not be gathered. And be with respect to purely with respect to performance. In terms of, you know, you're collecting metrics through a risk, imagine a web application like yours. You're collecting metrics through a risk, request is once lifecycle, right? That's going to be expensive again. You would rather sort of push it asynchronously to another system. Okay. So metrics does not push events directly. It does do batches and it won't push every event. So then your infrastructure is probably they're not state, it's very stateful, right? Because... Oh yeah, so there's a chance that if your application goes down you lose certain metrics. Yes, that is true. And it's a strategy that you could go for or you could choose to do one metric at a time which would not work well, I'm sure. Because metrics come at a very high rate and it's better to actually batch them. The second question on that is that every time you want to collect newer metrics, like you have data, let's assume for a moment you have a lot of data, you want to collect newer metrics, you're going to instrument your code every single time to sort of collect newer metrics as opposed to me sort of extraneously collecting that in some fashion, right? I mean, just a question. Yeah, so we thought of it as a part of developing the feature. We did not think of it as an external, I want to collect metrics from a feature that is already there. So we said every feature that gets developed has the metric collection part in place and that is the response way. That is part of the story, part of the feature. So we did not... No, I was just trying to understand philosophically what was the thought process behind doing this which is doing that. We just favored flexibility. So essentially, we don't want to collect certain metrics. We choose not to collect them. And there's an aspect of testing but I guess it could be frowned upon but we tested what metrics we are gathering as well and you could do that because they're in code. Well, let me... I guess I did not say it's not important. I said that you won't be able to get meaningful insights if you see a change in CPU usage. It's, yes, it's important. And I'm just saying that if your CPU load drops, what is that really indication of? So as I understand, CPU load is a function of not one thing. There's multiple things that can bring down CPU load, bring up CPU load. So it going up or down may be caused by five things and you're not pointed to any of those five things. So that's why we chose to go for simpler metrics, which point to only one thing going up or down. But it's certainly not important. It's important. It's just not easily intuitively trackable. With this one, yeah. So essentially we have systems in production, right? And we try and maintain them such that they don't go up to failure. So without having them go up to failure, how do I know what sort of capacity I can get? One, and is any system available which would help me identify bottlenecks in my whole architecture because not everything scales linearly? So maybe right now I can see that everything can go to X, but one of the whole stack will fail much before that. So is there somewhere I can have a dynamic bottleneck identification? So starting with the last question again, are you talking about seeing more services, building an application in a combination? Means having a rule for them? Means that a service consists of others? Or are you talking about the user story question we had before? Means seeing that a specific user request has been- Yeah, essentially, even a very simple web application would have, starting from the load balancer to the application code, to the caches to the databases. Okay, one good way to do it is if you use in a metrics database, so forever influx or graphite, that you put Grafana on top, and then add different sources from different services. I mean, you can create a Grafana panel and then add probably, let's say the load from your database server, the load from your application server and the load from the web server, and then you can see how a request impacts your infrastructure from the three whole stack. Means it's just a simple idea. But then you can bring all metrics together at the same time, which is an advantage of graphite. Means that you combine different things and also adding to his answer is, then also perhaps load monitoring makes sense. Load monitoring, process monitoring with the correlation of an annotation, for example. To see, okay, we changed something with Puppet and then my load is increasing, then it's available information because you see I have a different load profile, but without any information, it doesn't help you because you can see, okay, no, we have more load, but what now, shut down, reboot. With the annotation to see that some code changed or you see a commit message, for example, depending on your continuous delivery workflow, you immediately see what's happened to a specific part of the infrastructure. Means you can see that a user request, at 10 o'clock was going well and at 11 o'clock after, I don't know, deploying some new software, you see an increase of load at the application server after a request and combining these metrics on a specific time, you can see that some performance profile changed and that I think could be a way to solve it. Okay, so follow up question. What about any way of finding the current bottleneck? Say the whole structure keeps dynamically changing and the code keeps changing. Maybe I'm currently making one database call and then I want to making three different database calls. So in the whole development cycle, can I have an automated system that would tell me in the current scenario, this is the peak load you are able to provide with your current setup? It's kind of a base lining that the monitoring system learns what the load typically is at five o'clock in the morning. I say I just want the peak capacity of my system. Okay, means you would like to know what the maximum number of user requests is. So I'm not on a cloud infrastructure. Yeah, that depends on so many factors. So it depends if your system scales in a linear way, for example, or if garbage collection kills you in some way. So it scales and then it breaks down. So it depends on your, I would say, it depends really on the development language and so many effects. Also sometimes databases scale very, very good in a linear way and at some point, they're not able to write down the IOPS, which is very hard to really promise you can do a specific amount. Means testing helps, of course. There are thousands of testing frameworks also which probably only know 2% of them. But I think it's, I would say it's not possible to say that infrastructure can handle three or four time the user you have right now in the future, except you exactly know what every system involved in that process has a linear scaling aspect. Therefore, I don't have an answer for that. Perhaps you have. Sorry, yeah. I think that's as good of an answer as I can offer. Hello. Oh, thanks. Manat, great talk. Thanks for that and quick question. I didn't know about metrics and I'm gonna look it up. And I keep hearing about tools like Zipkin and all that. Do they play in the same space? Are they comparable? And maybe burned. You can also weigh in it, right? Because I'm confused about the whole aspect about you have monitoring and then you have application lifecycle monitoring or something. So just break that down for me. Thanks. So I'll start and then yeah. So perhaps one can help with the Zipkin if you mention. I am honestly not aware. But there are alternates to metrics. So metrics follows a more push events sort of flow. On the other hand there's tools like Prometheus which have a pull kind of a flow. So essentially your Prometheus server sort of knows where your application is running with all servers and they collect full stator from them. But I don't know about that particular tool that you were mentioning. I'm not aware of it, sorry. Okay, fine. I think these are a class of tools that allow you to trace a transaction across multiple systems. It's more of a, you know. Zipkin? Zipkin, yes. But metrics is more for one kind. It's just one, yeah, I get it. One instance of a, yeah. Okay. I'll ask a question to Pooja quickly. It's probably not related to monitoring but if I understood the whole bot approach that you talked about correctly, right? You had to rely on regex a lot to understand what the user has typed in, right? Had you by any chance looked at natural language processing that are things like api.a that might make it easier? Is that planned? That's the question. Yeah, yeah. Thanks, it's a nice question. Even I had thought initially that we should have some kind of algorithm of natural language processing. But the initial idea was not to go in much details of that. The initial idea was to have a control over the code which was messing up things. And that's where we focused on solving that problem first. That isn't planned but not immediate plans because there are hardly certain questions which are repeatedly being asked. So for example, when is my release? What is there in the release? Which machine I should reply? So there are hardly totally countable 50 questions maybe. So that's overhead to do it at the moment and that's where my priority is continuous integration first and wherein we can actually stop the pull request merge button itself to go ahead. So that's the immediate plan and probably NLP is what is very interesting for me also. So probably I'll have a look at it. Thank you. Hello. Manan, you said don't use any arbitrary measure like me. So I'll just divert here a bit. In whenever you're doing machine learning, so you build models and you will use a couple of algorithms before you figure out which one is best suited your needs or purposes. So if you look at most of the things that we are doing is collecting data and applying certain statistical analysis on that data. So do we have like some heuristic method in which, okay, you apply this statistical model first, see if it works for you, if it is actually getting the patterns out of the data that you want. So is there like step one, use statistical model A, use, okay, not this, use statistical model B, okay, not this. So is there some kind of heuristic flowchart or something like that which you could share with us? If you're using it? And any of you could answer this question. I'll try to. So essentially there's a stream of DevOps called anomaly detection which is heavily making use of knowledge of statistics and applying models. This on the other hand is a fairly straightforward use case. There isn't much derivation that's happening here. Essentially just, yeah, so I'm guessing in this case I am not aware of a lot of statistical models per se that are applied, at least not standard ones. Of course, people might be coming up with ad hoc statistics on top of something, but yeah, but there's a stream of just catching a lot of traction. Of course it's a anomaly detection where you kind of learn from how your particular one statistic had behaved and you kind of make assumptions about whether to alert or not. For example, when it happens again because you can now correlate that this would happen at this time of this month of this day, kind of something like this. That's as best as I can answer, maybe one of you can. I have no answer, but I know somebody who knows it and it's name is Avishya Shalom and you can find a lot of talks on YouTube and I think he's studied math whatever and he's giving crazy talks about math and metrics. I can give you the name afterwards. I don't know how to write it. Can we have? I will have to look it up. Avishya Shalom or something, I give it to you. Okay. So they'll share it with you. Yeah, or on Twitter, whatever. So he's, in his talks are really a good point to start about thinking stats and metrics and he's explaining it very well. Perhaps a good point to start. Okay, I'd just like to remind everyone that we're happy to take more questions if there are, but this is eating into the lunch hour. The lunch finishes at 2 p.m. and the talks resume here at 2.05 p.m. sharp and we're gonna try not to be late. So just to remind you. Now that was it. Oh, food break. I'm out. Yes, I saw your face like you're getting nervous. Story of my life. Oh, sorry. Oh, yeah. Is your mic on? I, it's hard to, oh, okay. Can you put the mic a little? So you have 100 nodes out of 50? No, I got the question. Yeah, so you want to auto scale? Yeah, so I'm sure it's possible. I mean, yes, I think that is what a lot of providers do. They look at the traffic and accordingly auto scale. Of course, I have no experience doing this and wasn't the subject for say of my talk, but I'm sure, I'm sure. So essentially the tools that we wrote made use of monitoring data and alerted or not alerted based on that. So instead of alerting, you could, you could trigger something else. I'm assuming that should be possible, but maybe, maybe one of you can. Nothing to add. I'm sorry, that's okay. Your mic is off. It's on. Now it is. Okay. Like this, okay. So my question is more about distributed model. So I have a system of tools which logically comprises one tool, right? So my tool talks to a lot of other systems before it can serve a response. It's a general distributed architecture. And these systems interact in different protocols. So it's not like HTTP. Sometimes it may be RPC. Sometimes it may be TCP. So I cannot really weave the HTTP headers and see how the data is flowing. So right now the solutions that we implemented to correlate how a request is served with different systems at a given point of time is to use overlapping groups, group names. So tagging, we're tagging systems with a certain name. Say this is product P1. This is P1's active MQ. This is P1's Redis, probably. This is P1's database. This is P1's MySQL. So when we actually plot a graph, or we want to correlate a graph, we search for the tag P1 and see how all the systems related to P1 are behaving. So in that case, we are actually bound to write, take care of giving correct tags at every point of time. So is there a way wherein I can trace a request that this request flows from system A to B to C to D with different protocols and then have a holistic view without using tags somehow so that it doesn't have any manual dependency? So systems like APM also do it for me, but they are more like only one platform based, say only Java based. So if I go from Java to a different, or go, like a console or probably a Redis, which is written in different language, two systems get, they are blacked out. And it's also very costly to put APM into every system to monitor it. So what do you think? It's like grouping, having overlapping group with the only solution, or can we do something else about this? I don't think there's a tool for that. I don't know. The way you can do it, this means enrich your code base and probably assign kind of a session ID or a random number or a serial number. I know that Flix is doing it that way, tracking their SMS and recommendation action that they gave a talk on Elastic on about that, how they trace a specific request in the infrastructure, means assigning a serial number and then to every lock entry they produce manually or out of a system. The system outside which takes out all the serial numbers and puts them together, weaves them together. But the problem with this approach is that when we have different protocols to talk. So say for example, if a serial number is into my HTTP header and now from HTTP I'm going to a TCP. So I'm losing that serial number because I do not know how to propagate the serial number along with a different protocol altogether. What kind of payload goes through your protocols? So it's like assume a JMS system, I have an active MQ and it has its own protocol, right? So I don't know. I think if you don't enrich your protocol. How do you transfer the session ID? If you don't enrich your data payload which is transferred with that kind of a marker it's hard to identify. I have no better idea. Perhaps, perhaps you? I don't have one. You seem to have already done a bit of research on this. But yeah, I would love to pick your brain on this later. I have no more solution honestly. Okay, thank you. That's it, right? Enjoy your lunch.