 Awesome. Hi, everyone. My name is Avik Mishra. I work at Intuit and it's very exciting to be here at Rootconf. We have had such great talks and I'm really thankful to the Rootconf organizers for having me here. I'll be talking about something which is very close to my heart and that is the title of my talk is Evolution of Monitoring. I've been working in the monitoring domain for the past seven or eight years and it's been really exciting and interesting to see some of the developments happening in the monitoring tool landscape and how they have been necessitated and motivated by the changes in application landscape, mainly how the applications are now getting architected, the way they are getting hosted, the way they are getting deployed. Since this is a 20-minute talk, I'll just give you a broad, high-level overview of what are those different changes which kind of trigger the rethink of whether your monitoring is delivered, the kind of insights and information that you want. Without further ado, let me jump right in. This is how it started for me around seven, eight years back and we were using Naju's and RRD tool. Mind you, these tools are great. They work fantastically well for hundreds of organizations even today. The thing is, at that point of time, we actually wanted to scale out big time because we had millions of metrics coming in per minute and something like RRD tool or Naju's were not able to cope up. Again, I say this with a disclaimer because I know both these tools are undergoing evolution even today and now you have the concepts of clustering. RRD tool has something like the RRD cache D and so on and so forth, but some of the things that I'll talk about actually are tool agnostic in a way that they will have a very high bearing on what your monitoring looks like. Let's look at some of that and coming back to the origin point, does it need to change? Maybe, maybe not. Perhaps your monitoring is already geared towards some of the changes and all you need to do is make some improvements to it. Let's look at some of the developments taking place in the industry. Microservices, I think we have heard a lot about it. Microservices are just loosely coupled services and independently deployable. One of the bearings that microservices has had is now you have hundreds of services out there and indeed we heard in one of the morning talks that it could be one of the 99 services which could be actually errant or problematic, but how do we actually know about it in the entire scheme of things? Microservices also mean that the application architectures are changing very frequently. Gone are the days when you had architecture diagrams which were relatively kind of static, changed once in every one or two years. Nowadays with hundreds of services and many more services coming in to the mix, the application architectures are changing very frequently, but the underlying concept of why an architecture diagram is required is still very relevant and that is to understand the connections and the dependencies between the different services and the interrelationships. So we still need a visualization of your application topology. Now this is one of the tools that we use into it in our consumer tax group called App Dynamics, but it need not necessarily be this tool, it could be any one of the other tools also. There is the open source Zipkin from Twitter. The idea is how do you get a visualization of your service map and how are you able to kind of trace the relationships and the problems that are taking place and for that it's very necessary to have the monitoring tool itself give that kind of information to you in terms of what are the application relationships, what are the upstream and downstream dependencies. As you can see, this kind of view is very important for a DevOps engineer to understand what are the source of bottlenecks in a service based environment, but is this a silver bullet? Certainly not. As you can see this can, so when you try to map out the entire application ecosystem it can easily get out of hand, whatever the tracing technology is and you see here a maze of services talking to each other and it is incomprehensible. So this area or this field is still undergoing innovation. There is a lot to be thought about it and there is a tool by Adrian Cockroft who is the, many of you know is the former cloud architect in Netflix. He has been trying to do something on this field. The idea is how do you get a visualization of your application to polish in a way that makes sense and that helps you not only understand the architecture but where the bottlenecks are. Visibility in business transactions. Again something very, very important. We all know the power of instrumentation libraries. Like I think many of us have used STADS-D, CollectD. In my time in my previous organization I was part of a team that built an instrumentation library from scratch. Now many of these instrumentation libraries will spew out very relevant metrics about critical pieces of your code. But at the end of the day the degradation of a metric does not necessarily imply the degradation of a user experience but you need to know the exact degradation that is happening in terms of the business transaction the user is executing. Now what do I mean by business transaction? In an e-commerce application it could be viewing all items or it could be adding to cart or it could be paying or it could be login. So we need to understand how those business transactions are getting impacted. And for doing so what you need is an end-to-end view. Just having the instrumentation from a particular critical piece of code may or may not be able to tell you whether the entire business transaction is doing healthy or unhealthy. Monitoring in cloud. I think we have seen this concept beaten to death today. Many people have talked about it. But I think it's very important to understand that monitoring in cloud is very different from monitoring in data centers. The reason being many of the monitoring tools assume a static infrastructure. So we have heard about techniques in which you can tie in things like survey discovery with some of these tools and make them dynamic in nature. But the underlying concept still remains the same with like for example EC2 instances getting torn down in AWS you still need to understand how your backend hosting infrastructure is changing so that you are able to still monitor. So that's why you have this concept of monitoring the cattle versus the pets. Evolution of time series databases. This is again something very close to my heart. I've actually followed the evolution of time series databases from RID tool. And RID tool works great. But again it has scaling problems. So we went to something like MySQL and believe it or not MySQL works great because of the fact that you know an RDPS database is very suited for a metric and dimension kind of a schema that starts schema kind of thing. But again MySQL sharding also becomes a pain after some point of time. So then we actually have monitoring tools or frameworks actually adopt databases that are inherently scalable in nature. Which means we are looking at something like an Influx TB, an Open TV TB which is based on HBase or even having in-memory databases for the very recent time periods. Moving towards unified monitoring. Again a very important concept. Moving unified monitoring here does not mean that you are using a single tool. What I mean here is you are covering all the aspects of monitoring. And often more often than not we forget that we have to monitor from the perspective of the end user what is called the real user monitoring. We need to have mobile monitoring in place because in a mobile first world all the mobile like we are actually seeing applications that are having more traffic in mobile. So how do you do crash reporting? How do you do analytics? Custom insights are very important. Many of the monitoring tools actually expose a kind of a query language that help you kind of get custom insights into your metrics. And this kind of the query language may be actually SQL based or it can be something very custom. So here I have mentioned two tools but it can be other tools also. Anomaly detection. Again very very important. I think many of us know the pain of setting static thresholds and the fact that if your metrics are cyclic in nature in terms of the patterns they show we want something to automatically detect those patterns and flag of these anomalies. Setting static thresholds is not the answer but anomaly detection is not very easy to do. It's not trivial. It's not like a single algorithm like holdwinters will apply for all the metrics. So then what do you do? You actually should have a monitoring framework or tool in place that at least exposes different algorithms, statistical algorithms like moving average, moving median, standard deviation, holdwinters and then you are able to kind of retrofit or fit those algorithms to your metrics and see which one works best. And based on that your anomaly detection is then done automatically. But the question is does your monitoring have that support in place or do you have to do it in a way where you actually are fetching all the metrics from your monitoring database and doing it all yourself in a separate kind of infrastructure, which also you can do but the more important concept is anomaly detection is very very important to get to. And finally the power of event correlation in so much as the alerts, how they get correlated we know the pain of getting thousands of alerts and it becomes very difficult to understand what the source of the problem is. So for example if you have alerts coming from 500 app servers and there is a single database alert most likely the source of all those 500 app server alerts is that single database alert. But when you actually get so many alerts probably that one alert gets lost. So some of these tools like Move Incident actually are able to aggregate alerts in such a way that you see all those 500 alerts from a given app server group as a single situation and you see a database alert as a separate one that immediately tells you something is wrong on the database side and then you can go ahead and troubleshoot it. So with that there are many other facets that I actually wanted to also discuss which are around the speed of deployments. I think Deepvainu spoke about you know candid deployments and blue-green deployments how do you monitor the help of such deployments. You need your monitoring systems to be doing much more granular metric collection other than a one-minute resolution which most of the monitoring tools do because you need a very fast feedback loop for your deployment so that you can roll them back. There are other facets too but I think I've been able to at least give you an overview of how you need to at least you need to rethink about your monitoring not from the tools perspective but what you need in terms of the insights, in terms of the data in terms of the information and how perhaps then you can actually think about the tools that satisfy those needs. Thank you so much. Any question? What kind of decisions drive the analytics or decision making in your monitoring systems? Do you have any cases which are real-time which are based on trends anything which you can share in that? That's a very good question. I think more and more organizations are realizing the fact that monitoring system is just not a reactive system as it is perceived to be. There is a pressure of monitoring data out there how can you use it to kind of get the deep insight that you need for something like a trend analysis a capacity planning or even in the analytics thing how do you expose something like a query language that I was talking about which is able to give you the power to correlate metrics across different applications in any custom way that you want to. So for that, does the monitoring system have that ability itself or do you need to do that kind of build it build up the entire infrastructure for that? And does that answer your question? Yeah, so we actually use in some of the tools we actually use it for things like trend analysis and capacity planning we extensively use the query language part of some of the tools that we use to kind of get more deeper insights into the different parts of the system and correlate them we actually do statistical functions on top of it we also calculate complex expressions but all of that is possible because the monitoring system exposes that kind of an interface When using monitoring for analysis, what was your tipping point when you use MySQL over there? So the RID tool at that point of time it could do only one data write per second so the IOPS it was constrained by how much IOPS a given piece of hardware could do at this point of time I think there is support for batch writes and something like RID cache team but at that point of time we didn't have that functionality But you said that even with MySQL there was a tipping point because of sharding and so what was that tipping point really? So it all depends on how much you are willing to invest in the MySQL sharding aspect of it typically when metrics reach by in the order of millions of metrics per second I'm not even talking per minute then it becomes very difficult for and it's not only the metrics themselves there is the aspect of dimensions basically the tags with which the metrics are associated if those tags the number of tags and their values increases the cardinality of the metrics actually becomes very high and that's when your star schema in a MySQL starts to melt down Can you please elaborate a bit more on the anomaly detection side what anomaly detection any open source software and which are free to use and also based on their experience how effective it is to detect different kinds of problems Yes so as I said anomaly detection is still an upcoming kind of look anomaly detection is not old it is used for fraud detection, intrusion detection the only thing is it is now being used in monitoring area much more and some of the tools like even graphite if you see it exposes things like moving average, moving median and there is no single in my experience there is no single algorithm which works for everything so for example the whole twinters actually works when you have a very rich set of historical data for the metric for it to be able to work properly if you just want to do a rolling deviation kind of an alerting or anomaly based on the last 10 minutes or 15 minutes of data then you could pick up something like a moving median or a moving average so there are libraries you all know about the R library but how do you integrate it with the monitoring or actually detect anomalies is actually the challenge many of the monitoring tools actually have that built in many of the organizations who want to do it do it separately