 Okay, guys, so I'm here to bore you for the next 15 minutes. Please bear with me My name is Cyrus. I've been working with Linux and programming for over a decade now Feedback, all right, so I'm a senior tech leader in Moby and This is usually how I code or not So why are we here? I mean most of you guys here are either Belong to operations or dev or dev ops or engineering or either have some way or the other to do with production infrastructure So you've got everything settled down. You have your best monitoring systems best monitoring scripts cron jobs You have bought a lot of your instant management systems. You have a knock looking after everything But when it rains it pours When things go bad things go bad so ugly that Either you think of quitting your job or you just shut down the damn product. Why because we have noise We have noise in our monitoring systems Imagine a service going down. It will have repercussions The node alarms will come in the memory fit checks will fail anything depending with the service it fail So you'll have like a thousand odd 100 of you odd alarms for that particular subsystems and then something which hooks on to it as well and Be it all that would be your actual alarms of other subsystems and If that's not enough you'll always have that one guy who forgot to put your monitoring system on maintenance mode That that kind of sucks. Yeah So what leads to this problem? As a company grows your infrastructure grows With growing infrastructure, you need to fine-tune your monitoring system. You need to fine-tune how your alerts go in What the thresholds are? You need to make sure that your team follows a good discipline as to when your notifications should be disabled when they have to be re-enabled again and Nothing of it should be manual. You should make sure that it's all automated seamless and just that does not affect how you scale But unfortunately, you know, we have your dockers your LXC's your C groups. You've got your whatnot We still use Nagios Nagios was made by one guy a bazillion years ago this probably still use cron jobs to check if the status fails or my DV connection is down and We continue to do that even as a scale. There are things that help you There are a lot of things that help you actually, but most of them are as covered like you know, what do you call a monitoring system or a small sort of an event actioner on The monetary system. Let's say you have cloud was right cloud was says, okay If your alert fails, you invoke SNS with this you your monitor says if my service goes down do this with it, but We are stuck with the way that application is performed coded or We have to just live with that application cloud watch. You have to live with cloud watch You have to pay for every alert of zini major duties good for notifying or waking up you guys at the middle of the night other than that Usually don't go to the dashboards of page duty and see what my system health status And then obviously you just call an organ say to restart the server So what can really help? Imagine a system that allows you to define an event Take an action against that event While doing so you can send all your you can hook up all your monitoring systems say fire your alerts to this event and You can have the system invoke scripts or pray to God or whatever you want via plugins If you have that with the utmost simplicity Then you can keep using your nagios your cloud watch your crown jobs and you you don't need to actually you know have a dedicated awesome Monitoring system which can do it for you can just use whatever you are using plug it in with the system and invoke your scripts I have that for you. It's called the sector engine A very brief overview the engine works extremely extremely simple It has an event listener Liffens to this base API calls The events go to a queue. There's a polar which consumes the queue there's an engine retakes Decisive action based on the settings configured for that event and It calls up the plugin which is corresponding to that event As simple as that if you if you look closely it just makes sense, right? You you have an alert you want something to do with it only and only if you have a certain condition met And then you can you know, whatever you want to go to invoke page or duty and make some guy up You run a script and restart your service you create a geratic it call up Amazon and scale your system scale down your system attach an EBS How it works Define an event and event definition is basically if you guys are away with Nagios is similar to service definition Where you say, okay, let me just define an event which has this is my disk is a 75% Let me give it a severity assign it to a team and Put it in a category Keep it enable and let me do something against it. What is this against it? part We would invoke an action Out here. I have a sample event action. This is invoke the temp watch plugin on the plugins of a fry If threshold of that event is twice in 60 seconds These are the parameters, which I want you to send to the plug-in It's pretty basic What is a plug-in? So plug-in is an extension to Cyto engine where you simply hook up your scripts and you let it Listen to this full API Extremely simple as that. Let's say you have a Jira. You have a Jira API script. You say if I receive a call fridge on Jira hook, you know, create a Jira ticket for me the plug-in server also allows you to Bind each plug-in to access keys So you know who can invoke what plug-in and you know, you don't want accidental restarts of service and This all fits in by Sorry, so this is how your plug-in is defined the system The server is attached in Cyto engine Your active plug-in show up an incident comes in. So this is your alert This is the final thing which comes in when the incident comes in on an event You have your host name attached to it. You say there are five occurrences of this incident The event actions that took place where I log to this So these are the sample ones. I logged I invoke this plug-in and I invoked the tempos plug-in and there were no return codes So everything worked on Kidori. No issues at all You can acknowledge the event so it will stop taking actions against it and it will silently just Append it as and when it required if you close it Any new incident coming in would create a new action item again workflow again. How do I send events? Three simple members Jason members call the event ID the message and the element element could be a host name it could be a Message it could be just about anything that identifies a unique key pair combination of the event and the The incident with the event and the element name. We give you dashboards Your dashboard. This is generic dashboard. By the way, it shows you how many active events are there How many acknowledged how many got cleared in the last 24 hours? You can view the incident details as I showed you in the previous screen or you can act it right over on the screen itself Also offer you team-based dashboards remember the how I defined a team earlier So you can have these team-based dashboard showing up in one of your monitors up on the corner You can continuously keep looking at them and kind of judge, you know What's going on where it shows you active against versus acknowledging incidents? But then your boss says I want reports. We have reports We got tons of reports One of them you even got CSV reports for all that's concerned. We got ones which are half How many common incidents got fired which event but fired the most sorry the resolution Which team has how many events and so on So who would need it preferably anyone who wants to manage our production or a Development infrastructure for us with if you want an infrastructure to be Monitored and you want those monitoring to come up hook up to a decent centralized place So you can take actions against those incidents It could be a knock Could be p teams it could be your QA team who can just have and hook up their Jenkins calls to this and maybe Take actions against it. So the possibilities are literally endless This is made using primarily Python I'm still testing out rabbit MQ and Maria and bgsql should be out in a few couple of weeks, I'd say and Completely open source Apache to welcome to download you guys to download it use it And slide that was a crisp talk any questions. I'm still testing it out. So Actually, I released the code last night last minute coding Yeah, so by the way, this is something which I've been working out of and my personal hours and weekends. So Hence the delay. I know in the funnel. I had promised on there If you first she can then through shoe at me and they said if you first is gone, you fooled me But yeah, it's out. We need guys to work on it. Yes girl. Oh Django Yeah, Django. Yeah, good. So the plugins Of how exactly you write it in a shell script or Write it as a Python file and then you so there is no limitation to what you can do if it's an executable binary It's a plug-in. So it has a DSL. Sorry. It has a no, no, no, absolutely not you can use So you can use tempos your user been tempos, right using add that as a plug-in No, I mean to say the new plug-in, which I write what sort of a DSL it It does not require any DSL. All you have to do is take your executable dump it in a directory I'll link it here. You just say that my executable is here. It takes these parameters. That's it I don't want a DSL. I'm not telling you how to write your plugins You can write as and how you want all you need to do. Okay, is when you call this plug-in Okay, you need to know the parameters that you're gonna pass to this plug-in So let's say if you're calling you have you name and the parameter you're gonna pass is a hyphen a right so when you invoke the action you say in the past the parameter call hyphen a Okay, this plug So it uses command or pie for doing the command line execution. No, no, no uses a process So can you go back to the slide where you are showing the Jason the Jason? All right So I want to understand typically an event Requires you to monitor some some things and in the Jason that you are showing you have the concept of an element, right? You would that is probably How is that enough to generate an event and just take an example that you mentioned that? Let's say your disk is an element, right? And you want to generate an event when 70% of your desk is full. How do you define this? So whom do they generate this alert against you generate it against a host the disk belongs to a host your host is the element All right, okay in the engine You have a unique event ID defined like event ID six over here Even I did six times for any disk alarm Then you need to attach something else to it to say that now this is a unique key That's something else is the host name here. You could for your what is worth you could say Slash there or whatever whatever xda one is full can attach that as an element It does not stop you from creating any correlation with the event event ID itself The idea is what I'm trying to connect is so you have the concept of this event Which is let's say is enough to define what is an event, but then there is a concept of I think what you're calling it as a plug-in which has the intelligence of Generating that event. How are you tying these two things together? Sorry? So the concept is slightly off the plug-in is the final action you take against an incident coming in The definition of what you're gonna do when something happens is In the event action So plug-in is the final thing which you call with a bunch of parameters When an event reaches a specific threshold your disk of getting full is an event, right? So it's an incident you hook it up with this event now The engine then says okay, I got event ID six with this disc alert It has now come to me twice in 60 seconds. Go ahead and work this plug, right? So I think my question now is how do you define the incident as part of your system? How do I define that? You just defined an event. So even sorry incident is something which which inherits the event the event is a definition event is a definition of what an alert It's an alert definition your service definition you have in Nagios and then you have your groups coming in So you throw incident the incident is some actually similar to what an alert is I call it an incident because it's very much the ITIL thingy Thank you Hello. Yes. Yeah, so What is the most complex rule that you can put to take an action? Good question very very very good discussion So we at this point of time is just X events in Y seconds, okay, but we are rigorously. I'm trying to come up with more complicated if this then that kind of rules which will come and one major thing. How does this scale? Scales horizontally so the components which I showed you in the architecture side of the are all horizontally scalable There's no how does your rule scale the rules processor that you have that actually does this analysis as you get to more complicated rules and matches Mm-hmm What would be the processing model that you propose for your There yet as I said, you know my rules engine is extremely simple right now Okay, I'm gonna expand against it Okay, the code is modularized enough that I can compare frontalize the decision-making engine Put it over in different thread put it over in different box send an incident for decision makes a separate grid So it's it's it's possibly design less. I'll say but that's a good question This runs as an agent on all the boxes or no agents at all. No agents at all. So can you go go back to the plugins? Okay This guy yeah, here you mentioned it one of the plug-in is restart Tomcat How how would that happen exactly so How would you reach? How do you inform your knock guys to restart a server? Either you have a jump box or tell them to log into a box and restart Plug-in server sits in let's say your particular divisions Submit and you write your script or invoke the binary which can do it remotely for you The idea behind is to keep the plug-in server isolated enough such that my every department can have their own plug-in server For what is concerned so have isolations are then and further down that you have your API ski API keys linked with the plug-in So let's say you are only allowed to run the restart Tomcat plug-in And I'll give you a key which can only invoke this plug-in and you just take that key put it in the engine once and Have all your incidents fire us again. I think they're just just running short of time. Also It's a very interesting question. I would like to take otherwise do a one-on-one. Oh, you have a question I think Hi Hi So can we push data from Nagios or Zinsoop to perfect? Yes, absolutely. So by default, I'll give a very come simple scenario how to hook up Nagios if Nagios allows you custom variables in your definition. You just underscore event ID X and we have plug-in Say hooks to Nagios which are released you can just invoke that command when you fire When when when you define a service The basic example is the default command which gets called up is notified by email It sends you know, I use notification. So there'll be another command Just an executable Which gets invoked whenever Nagios wants to throw an alert and that alert will go to the system So it can be integrated with any monitoring system like cloud watch exactly. That is the point you have Cloud was ensue Zenos you name it even your cron jobs. You can integrate anything inbound and anything outbound You have absolutely no limitations at all That's why it's so simple That's why it does not have a complicated if this than that modular isn't because right now What I need at this moment is if I have X number of alerts at least do this action and Yes, as in when you know, we get good feedback. We get good logic around it. We can start making things complicated. Yes, sir Yes, so Yes Things would get complicated. That's true But I have not designed the system yet. I have yet to design that system Maybe you could put in a get issue and then we could discuss it on GitHub Round of applause for Cyrus. Thank you guys. Thank you for listening and it's available in GitHub starting Yesterday, okay Awesome. Thanks