 Good morning guys, I'm Dhananjay Sathe, I'm a Senior Operations Engineer at Direct Eye, where I work on the platform team, and our flagship project is building our metamonitoring system, right? And let's first start up with what I call the consensus slide, and we all can come to agree on the fact that monitoring today in its current form sucks. It's perfectly broken, right? And it's a multifaceted monster. Every time you chop one head off, you end up with something like that. A new beast shows up, and it's an unending thing, and when you start it off like John of Arc, you end up like that guy, you're just given up. And the problem with this is, what we believe the problem with this per se, is how we interpret monitoring and time and dimensions, and why that really matters. So going back to the basis of any monitoring system, it boils down to something like this, right? Every monitoring system has a similar construct of alerting. You have a data structure that comprises of something that gives you just enough information about what's wrong. It tells you a server, a service, possibly, which could be a host service, and a state indicating its state in certain ways, right? And of all those things, you have time, where time is just a point object. It is a time stamp that uniquely identifies when this occurred. And what that has done today is your mailbox is flooded with multiple monitoring alerts from different sources. It's often that you use Nagios, ZabEx, PENGDOM, a combination of monitoring tools to manage your monitoring systems, right? And this is an ideal Nagios, what a Nagios seller looks like. It tells you about some problem. And there are efforts going on to try and improve these interfaces, and you end up with something like Thruc, which is fairly widely used across the industry, where you are essentially playing dance-dance revolution and you have green and red lights, and every time you see a red light, you're trying to kill the red light. It's a lot like Harry and Ron in the burrow trying to weed out gnomes, and every time they spot a gnome, they throw it out. And we believe there are a few fundamental problems with the way this works. If you see what an event is at the end of a day, right? It lacks context. It just talks about itself at a particular time instant. It also cannot describe a situation. A situation to a human mind is the way we process an event, and it cannot hold state. It just tells you about its current state. It does not tell you about the history of state or how state change has occurred. And it does two or three really good things. If you take each attribute in this data structure and model it into n-dimensional space, what you essentially could do from this point was partition that space according to your needs. And this is really cool because you can do classification and clustering. Now, a very familiar construct in most monitoring systems is when you configure a check, you configure a particular escalation path for it. You configure other stuff about it. And this is not the way you think. Think of it in the way you would invest in a stock market or think of it as you would look for airline price tickets. You would give a context, you would give a field of interest, a query that completely defines what you really are interested in. You're stating an intent. And the way you think about an intent could be across, so what we use in the back end is Lucene. And if you treat each attribute, you could treat each attribute as a Lucene query that part is in space and gets you something that intersects all points. It could be as random as give me, I am interested in all events, anything happening on these bunch of servers if they have a certain field in their info tag. You could come up with any arbitrary Lucene query or any arbitrary bunch of regexes to do this. So this is done on the fly, right? You don't configure your checks. And what this does to you is we configure dummy checks that go into every manifest file and they just fire events into our system, right? And why this matters a lot is have a look at this example. This is a situation when you go to the beach, beach with the bumps, right? And the human definition of an event of going to the beach is discrete. It's fuzzily discrete yet it's continuous. And it contains state about itself. You have questions of like what did Tim say? Where did we go? What did we do? What did we eat? It encapsulates the conversations around it. It encapsulates the probability of someone showing up. When you make a plan to go to a certain space, you probably ask yourself questions before calling out your friends and figuring out which one of them will probably show up and which one of them will probably not show up. And humans tend to think of events and time in these situational contexts. And our brains are hardwired to solve problems about situations. You think of situations. There's a direct mapping between time as an entity and situations. And those are kind of questions you really should be asking yourself. So it's a collection of all experiences and everything around that state, right? And what if you could do this to monitoring, right? What if when I looked at a bunch of events, I didn't look at them, look at a bunch of discrete events, but I looked at it as if I'm looking at a situation, right? So this is where we came up with the incident center of the universe. And what we do is the incident is interacting with the entire system and the entire system is aware of the incident. And this is a really unique thing because when you refer to that holiday or you refer to this conference a year later, you are uniquely identifying a period in time that talks about this conference. And if you actually look at what we're doing in event systems today, it's pretty funny, right? You go and put a bunch of parameters, server service, then you take a date and time range and then you look at a bunch of graphs. And there's no really unique way to identify it. You end up giving a parameter list to any other user who's trying to figure out what's happening, right? So what you do is follow this new incident center of the universe, mask away your events as timelines and state changes which make a lot of sense to the human mind, capture conversation around it. If you look at what Facebook essentially is, is an event-driven approach. When something happens in the world around you, people come together, post comments, likes, all these are events, but what you get out of it is a story. You get a timeline. You get this context of occurrence of things in the world, right? And we think about it really so naturally that even your grandmom can use Facebook today. You don't need to think in event-ed streams. And once you start doing that and get feedback and analysis, every possible query, every possible clicker user makes onto the system, every possible visit into the idea of the incident, can now be used to do really interesting things, mathematical functions, virality scores and a bunch of other things. There's an alert, the number of alerts that go out could add to its virality, right? So how do we do this? Take the center as a finite state machine and describe every other entity as a transformation function onto this. And now what you have is you've gone up from eight dimensions of an event to about 28 dimensions, both mathematical and discrete, and you can query them again, the reason I love Lucene. What I have here now is continuous time that is fuzzily discrete. So you look for patterns of events that occur and once you reach stability, the same way you would look or describe a period of illness, right? When you have a fever, you would say I had a fever three days back and now I'm okay. There's no beginning or end, but it's slightly fuzzy. And the advantages of this is, as I said, it's aware of all the changes happening across my infrastructure. It's aware of all events. I am now completely free of selecting a good monitoring system. We've actually put in systems where you could run a REST API request and we've put in stuff like KFC showed up when we were testing it. And that would still follow the same construct, right? And the system's aware of this entity and it's contextually aware. So diving in further, capture all these streams. Now, if you're trying to process events at scale, your events can be dumped into databases as immutable facts. And these are really easy to process and query for a system. And event systems work really well, but create materialized views that are good for the human context of it all. And alert based on this, do not alert based on events, and timelines of these incidents. You can define cool down periods. If I had something in the morning at 7 a.m. and it came back up at 7.15, went back down at 7.18, went back up again, I'll still continue my old escalation chain because that's still the same logical construct. But if that occurred in the evening, I'd probably still have the same context, but it would be a new escalation path because it's probably a new issue, but it's related to the same old issue, right? How do you do this? So these are some of the outputs of it. The fact that you can refer and search the space, you can actually refer back to any incident that ever happened in your infrastructure and look back at it. More interestingly, this is how we look at an incident today. You have links into your old monitoring systems. You have all your stuff coming in from Nagios or Pingdom, any other reference links into your intelligence or your Wikipedia's in your company. And there's this bunch of things. Something really interesting here is this particular fact about an aggregation. So the fact that I have come to this conference is an event for me, and the fact that you've come for this conference is also an event for you, right? But what we tend to do is we tend to look at it as a single event. It's coming to a conference and you've clustered up all these ideas, all these discrete things into one big idea of a conference or even the party at the beach. And this is what our automatic system does. And as people figure patterns, they can add better rules to improve this filtering and improve these criteria, right? And people have conversation about it, right? And people act and comment. This standard construct of any monitoring system. But there's something interesting that happens out here. For every piece of data that we get as an input into this entity that we have now, we can define from qualitative data, we've come up to a quantitative score which computes the number of events, the number of references, its severity levels, people talking about it, how many people, how far up the chain were notified and using this, you can probably choose the right route when you start clustering incidents together. And in a very good, I'd say, 80 to 90% of the cases, this algorithm actually figures out what's the root cause, puts everything together, follows the chain for the root cause without involving people for each chain. So you reduce a lot of noise by doing that and that is an advantage of this idea of perceiving time. And the best part is you still have all your raw data. If you actually wanted to go and debug on the raw tab, you would actually find your event data and you could look at it. But to expect the human being to look at a bunch of event data just doesn't make any more sense. Because you're now aware of a situation. What are the other advantages of this? These are a bunch of things we do. When I look at a particular incident on a particular server, now in monitoring it, you can assume that other than time, the two other most important dimensions are the host name and the particular check that you're talking about. Or a host group name and the cluster check that you're talking about. And what you would do is you would find incidents with similarity signatures in the same service impacted in your host group. And the host groups, again, were not defined at the check time. They were defined by those intent queries. So I could change them on the fly without having to change my check at all. And this tells me, if I'm viewing this incident here and I'm trying to debug this, often the service is impacted across a bunch of hosts. So if I want to debug this service, I would actually probably just want to click on this one because it looks like a more important service. And it's in the same signature group. And that could probably be something that I look at and I could realign all my clusters around that field. And then I would look to solve that issue. Another interesting thing that happens is when you have freedom to control how you do these mergers as well. And on the same hosts, if you have a RabbitMQ alert, you'll actually find a bunch of RabbitMQ related checks going down and this wheel that you have here is also generated from the same scoring algorithm that was used to merge them. So you can assess qualitatively... And the advantages are, in other ways, you have statistics. And if this thing actually did have a factor where a product team was affected and you had support tickets coming into it, I think the impact of an event or an incident in infrastructure is just measuring time and measuring how many servers were affected or how many sectors were corrupted would probably be more interesting to see how many users were affected, what sort of other impacts did you have over there. And you still retain your ability to drill down into a particular issue and look at it, merge and unmudge these trees. The other thing is integration of changes. So if you follow a certain... if you have really large infrastructure that would make changes in them, then you could assess using a sum of all these scores, you could now assess how successful your change was, what was the description, what was the probable start period, how efficiently is your team actually doing your changes. And that would also show up in this incident, in the same page. And if you had a fault in your infrastructure and you did something like an RCA on it, every time there is an issue with similar signatures in a particular reference frame, all the RCA's would show up. So what this ends up doing is tomorrow when I have... when I wake up in the middle of the night or I'm in the middle of a party and I get an alert and I open this, I immediately get complete context of the system as the situation. I don't have to look through discrete sources and discrete streams to try and make sense of the situation. The situation is presented to me and I need to just act on it. And this is a very powerful thing for us. I mean, you actually stop looking at Nagyos in certain teams because of this, right? What else can you do with it? Now, this is... I talked about going from 8 to 28 dimensions, adding mathematical functions to it. So I could actually drop this issue map using an arbitrary Lucene query again. So it could be anything from a server group to a host group to incidents that add certain people involved in it to just a simple server and a service group, right? It's a Lucene query again. And I would get impacts of... and the colors of these circles would tell me what is currently happening in a certain data center that I'm running and what are the hottest issues that I need to look at? And what are the most recurring issues that tend to, you know, take up people's time on a repetitive so you probably want to get rid of them if you could? And priority ignored issues. At times, you have issues, S3 and S4 issues, that you want to fix but you don't really fix because it's not harming your production traffic in any way. So those will go down and then you could just look up this dashboard and you have time and you can clean it up, right? And you could do statistical... Right, this is one more important thing that happens out of tracking all this data is as a user, when I go to the dashboard, I can track stuff like what were the issues that interrupted my sleep, how many times was I interrupted in the last week versus this week and what were the things that actually caused me to lose sleep. And this is probably one issue that troubles every person on call than any other issue that you would have in your job. So I think it's really important to get this out of the way. The next is statistics. I define certain criteria, not particularly business SLA criteria, but criteria about how my team should actually perform and what kind of severity issues should be solved in what periods of time. And what I can do with this is I can detect how many times we breached that in a particular... how many occurrences of a breach happened in a week, right? And similarly, we could do that for weekly downtimes and MTTR statistics and a host of other things. And that's the interesting bit. When you click on any of these graph lines, you'll actually get a breach offender list or a downtime offender list and you have the issues that have caused this. They might be open still open. In most cases, they're usually closed. They've been solved. But what that helps you do is it tells you what you should probably be looking at in your infrastructure and that's a super important thing because those little penalties add up and kind of mess with your quality of service, mess with what you're trying to do. So these things are really important and it's a host of other statistics. So yeah, that is how we see monitoring and I would want to keep some time out for these questions because I think this is a fairly new approach to it. So go for it. So it wasn't clear to me. Are you offering this as a project, a service, a product, open source, business model? I... it's still an internal tool. So it was just about the... this idea is how we built our tool around. So it was about that. Understood. Okay, cool. Thank you. Great talk. Just wanted to know that when you say that, you know, you link your monitoring with the issues that were reported earlier, right? Yeah. So what's the issue tracking tool that you use? You have JIRA or something in the back end with which you link this up? No, we've just written the whole platform. Okay, cool. So it's everything happening in the platform. Oh yeah, the issue tracking for our customer services, I think use RT or JIRA. They do API calls. I'm not really sure what they use. It's a platform. They can... you can pump anything into it. Cool. And I understand this is an internal tool, but, you know, would be really helpful to get... and I came to the talk just to get the idea of how you are, you know, looking at monitoring as you said, it's a new approach. Would it be really helpful if you have a demo with dummy data set up somewhere so that, you know, you... you know, people can check it out plus, you know, it stays internal. Yeah, that might be interesting. Hey man, can you have it? Yeah, yeah. Yeah, so you have gathered a lot of data regarding services and all those things right now. I mean, you are able to visualize the entire situation of surrounding that event, right? Yeah. So, I mean, have you... Are you planning to put any auto remediation or auto healing kind of stuff in place? Potentially we could, right? So what's happened is we're sending out alerts and page or messages at endpoints. Now, what I would do is, since it's a pluggable system, I could plug in an API and at the end of the day, sending an SMS is basically calling an API endpoint. And I could add another API endpoint, but we as a team do not really handle the... I mean, in a company where you have so many businesses and so many teams doing so many things, there's no good way for us to figure out a system to auto heal stuff that you've blown up because you probably don't know what you're running out there. But we could provide you an API endpoint service that you could subscribe to and then hook into this system in a way. So that is something that could be done. Yeah, hi. I think we could do a little with the internals of the service because right now it's like you've essentially arrived at this utopian point wherein you get context-based alerts. And we... I personally have nothing that I can, you know, absorb from this, which can take my alerting system from point A to point B. Can you talk a little more about the internals? Okay. Let's see what I could talk about that. So, yeah, we really love the UNIX philosophy. We're not a monitoring system. That's what I open my talk with. We had a meta-monitoring system. And I think Bird would agree with me that... I mean, he would testify to the fact that building a monitoring system is an extremely complex task. And there are a bunch of really good monitoring systems out there already, and people could write a simple request into their code that would potentially be a source of monitoring. And we take this data and then we process it in these constructs. So I expect you to give me an event that has a minimal offer state, a server, a host name, and a timestamp, right? I need these three things to kind of build onto it, enrich my data, send it to the pipeline, construct these views. And the other interesting thing is, because of the way we've implemented it, we still have the immutable fact. So we could run our engine from the beginning of time with a completely new algorithm, a completely new bunch of things, and I would end up with a new state. It's just a materialized view. I could end up with a new state that I could then query or do whatever with, right? Backends would be elastic search, a lot of scripting into elastic search. Pipeline is a completely async pipeline, standard broker queues, processors. Potentially, potentially that would happen, right? That would happen in the case where the network would be the thing that failed first and a lot of people would be alerted and looking into it. And your aggregations would collapse into that network layer. But you do get it wrong at times. There is room for improvement over there, you could define... So the other interesting thing we do with aggregations is in most of the systems, you define an aggregation in a very specific term. You define an aggregation such that if these bunch of checks happen together, then you know, aggregate it. So since we've taken this approach of arbitrary queries, we could actually write an aggregation that says an S2 alert in this group and the number of those alerts is three in the last half an hour. And I have two more alerts from this other group in one hour and this query sequence overlaps and combine them all together. And you don't think of the aggregation in terms of the check, you think of the aggregation in terms of the situation. Right? So if you're talking about just the monitoring part of it, if you're using Isingar 2, you have service dependencies that you can create. So if my database goes down, I really don't need to be alerted for my app text, right? Because my database is down. So you say that this service depends on that and Isingar 2 and all these guys do that. So you can already have this in place if at all you're using those monitoring solutions. As I said, you could fire the groove, the rattlesnake example into it and it would still work. It's arbitrary strings, right? It could be a curl request. There are actually people who are using Cron jobs with curl requests as a monitoring source. So that also works. I think that's it. Cool. Thank you guys.