 Okay, let's see, who am I? I've just been around a bit, I've been in quite a few reasonably decent companies. I've contributed to quite a few open source projects, probably most well known for XFC. I currently work at DirectEye and I'm going to shamelessly plug the place. This is my nice logo because it's got a nice logo if you're coming to work for us. Well, no, other than the logo, it is a nice place to work. You have interesting people like me and probably people more interesting than me. So shameless plug aside, let's get to the meat of this thing. And when I was coming in for this talk, I caught up with a bunch of people and they all said, you know, I can't find any information about the stuff that your dog came by. And you know, it's a good question, because it's paperware, I haven't released it yet. So Varenade is basically, it's an allergic system that I'm writing. I faced certain, how can I describe it? When I was using some previous other tools, I'll address that. There are certain things I didn't like about it, so I started writing this. I was hoping actually to get to a point one release before this conference. This was supposed to be a launch talk. Unfortunately, real life kind of got in the way. So I don't actually have code to release right now. I don't have anything to actually demo. So I hope you forgive me for that, because I think the ideas in it are worth discussing. And I think there's a lot of potential there, naturally, because I'm writing it. But I mean, what I'm looking for here is maybe a little validation for my ideas, trying to get some feedback and see whether there is any real interest. Obviously, there's interest from other people that give me a little more motivation beyond just scratching my own itch, my own problems. So that's why, you know, like for those of you who don't get the Duke Nukem reference, Duke Nukem is the most famous bit of paperware ever. They were talking about it for what, a good 10 years, I guess, before it came out. I hope to be together a little before 10 years. Maybe a lot less than 10 years. Okay. So as the slide, you know, the title side says, you know, Validate Distributed Resilience Scalable Alerting. I tried to fit in more buzzwords, but the slide was only that big. So yeah, to get to the point, there are already tools in this bit. I mean, the tool that kind of owns the space is nagging us. So what's the problem with nagging us? Why not just use that? And, you know, the fact of the matter is that very serious organizations are using nagging us. Yahoo. You know, I know for a fact, I don't know who else, but you know, I think Amazon uses it. I was just talking to the guy from Flipkart, he uses it. I think pretty much anybody who's got an alerting problem uses nagging us. So what's the problem with nagging us? Well, a quick Google of the hashtag monitoring sucks will kind of, even if it's monitoring sucks has become a movement of its own in the last year or 18 months or so. People are just so sick and tired of various monitoring tools, but one of the, you know, focuses of their irritation is nagging us. Now nagging us is, I would be the first to say that nagging us is actually a pretty unique, it's functional, it does the job. People are using it for very complex things, but it is not the nicest tool to work with. To start with, it's, you know, configuration languages, it's verbose. It's very painful. It's not very, it's not very oriented towards a lot of things. It's not very oriented towards, say, generating it. So, I mean, even a, you know, a normal size deployment, it's very easy to go to hundreds of thousands of lines of code. Fine, I thought I was getting some censorship from, you know, God doesn't like what you're talking about. Yeah, so what was I saying? Yes, so the, the syntax is general. Quite often now, the syntax of these things is actually generated by a configuration management system, because people want their single source of truth to be a configuration management system, some kind of post database. Or a puppet or a chef or something like that. And honestly, a lot of the structure of how the configuration works is kind of painful to do in, in, because of its syntax and the fact that it's kind of centralized. Some time back a problem that we were facing in, in direct time, was that the amount of time it was taking to generate the, the, Nyagya's config out of puppet was taking like half an hour, 45 minutes just to compile all the information that we had. I mean, we were able to fix that by going to a better back end. But the fact is that the sheer amount of stuff that needs to be generated was one of the causes there. Right? So what are the other problems other than this configuration syntax? I mean, said mail, for example, has had a horrible syntax for the last 30 years and people are still using it. It's possible to work around that, but that's not the only problem. Redundancy in Nyagya's is pretty hacky. If you want to have redundant Nyagya's, the official mechanism of doing that is to set up two identical instances, make sure the config goes up to both of them, and to use a failover using something, you know, some kind of IP failover, some kind of third-party failover tool to get to that. And that's kind of hacky because there's a certain amount of, you know, state that is in memory for Nyagya's. So when you do a failover like that, that turns together. There are ways to get around that. People are doing, I've got mechanisms for this, but it's ugly, it's hacky, it's more complex than it needs to be. Similarly, load balancing. Load balancing Nyagya's. Now, we're not going into the details of how to load balance Nyagya's. What I can say about the way load balancing Nyagya's, you can take that as akin to how you would do sharding in a database. You basically divide up, before time, you divide up your network, and you assign that to parts of your load balancing, you know, servers over there. So that's not really very flexible because, especially in a dynamic environment, cloud-based environment, you may not be able to do that. You may go in one side or the other. So your load balancing is very static. And, well, yes, you can work around that, but again, pain. So this is a theme that you're going to see again and again. It's possible to do a bunch of stuff with Nyagya's. It is not always easy to do that. Finally, Nyagya's itself, I mean, from putting on my, taking off my sister admin hat, putting on my developer hat, Nyagya's, the architecture of Nyagya's is what our place is, all I can say. I mean, it was designed in the 90s for the problems of the 90s. And everything since then has been layered on top of it and batched in. And a typical Nyagya's setup now, we have stuff that's written in C and C++, you have Perl, Shell, God knows what else mixed into it, right? Every new, a new architectural fix that needed to come in has been layered in different ways. So, you know, you have things, it is, it is complicated. And the various elements of the solution have got different sense of dependencies, right? It's been, you know, so you've got different things like these, NCSA, NFBE are different ways of getting checks executed. Kierman is a way of distributing, adding all that stuff is, it's all add-ons on top of that. Similarly, people have done rewrites for different reasons, for UI to add usability, trying to be backwards compatible. And quite often you have setups where you use a combination of those just to get a reasonable configuration set. And because of that, it's reached a situation where you have Linux distributions, the open monitoring distribution, an entire Linux distribution there, just to help you get Nyagya's setup. And God say, this is something that's just getting, checking alerts and getting the results back. Yeah, it's not a trivial problem, but it should not be so complicated. That's the problem that I have with Nyagya's. And this, actually, this motivation for this, it came from a very specific event, a specific project. A few months ago, I was working with some members of the audience on moving an application from a data center onto AWS, on some web services. For various reasons, we needed the dynamic nature of Amazon. And so this is a thing. So now one of the things that we learned as we were trying to study Amazon and what kind of, the kinds of architectural decisions that you need to take is one is that you have to plan for failure. Amazon is not necessarily a highly reliable, it's not a highly reliable environment. So you have to architecture your solution to just deal with failure as it comes, because you are going to be seeing a lot of failure. However, doing that with Nyagya's is not easy, as I said. I mean, redundancy is difficult, load balancing is difficult. So, I mean, for example, how do you take, how do you hand the back, if the monitoring mode goes down? We have to go through the redundancy stuff that I was seeing, I was talking, well, we could do that. In fact, we're probably planning to do that anyway. But it's not good, it doesn't feel you, it doesn't leave you feeling very comfortable. Similarly, the thing is, as you add nodes, on a cloud infrastructure, you typically be adding new application nodes, database nodes dynamically, adding them, reducing them. Every time you want to do that, I have to send a sig up to Nyagya's. It's a reload, come on. That is like so 90s, man. I really don't want to do that. Ideally, this part of the entire problem of deploying to the cloud, you want everything to be node-driven. The node comes up, it pulls, it gets its function, provisions itself, sets up its own, starts pushing out metrics, starts pushing out alerts as needed. It shouldn't be driven from a central point. And this is, again, the centralized nature of Nyagya's in itself, I have issues with that. So, yeah. So these are the reasons why I thought of writing. So, when I started thinking about this problem, it's only been a few months. I started writing code only fairly recently. Now, one thing, obviously, simplicity. I'm not such a smart guy, so if I'm simpler, I keep it the easier it is to manage it. And I think that's true of most of us. We've got a whole bunch of things that we need to deal with. Let's try to keep all our tools as simple as possible. From a development perspective, both perspectives. Be extensible. Just look at the history of various tools in Nyagya's and how it's been hacky to add on to it. It's probably a good idea that from day one, you should be ready to add new functionality. Don't hard limit the functionality early on. Be configuration management. Everybody's using configuration management of some kind, Chef, puppet, CF engine, whatever, what have you. So be friendly to that. Try to push a conflict to that. Don't try to build up. No web-based configuration of alerting, no ways. That's not very configuration. That's not scalable at all. Provide mechanism, not policy. What is that? So this is something like, you know, it is don't take too many decisions about how the system should work because every deployment is different. People running that deployment will have better ideas of what they need than me, upon my algorithm, trying to fix and find out a solution that will work for everybody. So try to make as few decisions as possible. Try to make things as flexible as possible without being too configured. Peer to peer. Peer to peer is not... What does that mean? I struggled with that particular line. I think it's probably something related to simplicity itself in that don't have too many moving parts. Try to reduce them as much as possible. Don't try to have too many code bases out there. So ideally, if each... If that one code base that I have is flexible enough to do different things, that'd be a great thing to do. If any of you have used... So I got inspired by a bunch of things. This interacting of being flexible and being simple. A lot of it comes from collective, if any of you used collective. I've also been inspired by things like Senso and Monkly. They're probably not very famous in the... But they've definitely got a lot of buzz, especially Senso. I mean, why I just didn't go with Senso is probably out of the scope of this conversation. We'll probably take it up later for those of you who are familiar with Senso. So... I wanted to draw diagrams, I'm not very good at drawing diagrams, so I'll go with text. Okay, what's the architecture of... Of Baronet? Essentially, on each node, we run two main processes. One is a scheduler, the other is a pipeline. The scheduler, fairly straightforward. It, from the configuration, figures out what checks need to be executed and runs that on the schedule, gets the results back, then just sends the result back to the pipeline. Now, the pipeline receives these check results and processes that will arbitrarily see this, or what are called policy models. Now, policy modules are things that it sees a check result and takes a decision and does something with that. Okay? And you can... Now, policy modules, what are the kinds of things that a policy module can do? It could potentially change a check result. So, for example, say, I get a critical error from one node of a 500 node cluster. Okay? Well, in the bigger scheme of things, that's actually not very critical. So maybe you could have a policy module that downgrades the criticality of... So it could... That's one use case for modifying. There could be other reasons why you might want to... Maybe you'd want to increase the criticality of something. Maybe you'd like to put in more information into it. Not sure. We'll figure that out at one time. That's what I'm saying. I don't want to take too many decisions up front. It could delete something. So, supposing... I don't know, maybe you... Say, supposing you've acknowledged an error. It's known that, okay, I've seen this problem. I don't want to get any more alerts for this. I've acknowledged it. So then I can, for as long as the period of that acknowledgement is valid, just delete the same kind of error as it comes. It could check... It could create further results. Now, perhaps... Let's go back to the cluster situation. Instead of modifying an existing result, you might want to push that result untouched, depending on this thing. And just figure out another error, okay, or another kind of notification. So you could say, you get in the data from multiple places, and based on several different alerts to develop a heuristic to generate another kind of alert. So I've got a database that is giving some problems there. I'm seeing also some disk errors, something, something. I'm going to make a mistake and come to the conclusion that, okay, maybe we have an incipient disk failure coming on. If you can work out a heuristic, shove it in, so it can create a new alert result from that, right? And the final thing is, I was saying something, I'm sure it was useful. Yes, we had last talked about creating check results. And the last thing is just run arbitrary code with that data. For example, send the check result to another machine, to a pipeline or another box, for example. So now my expectation now is that typically on a node that is running some application, I mean that is being monitored, a monitored node, you would probably not have a lot of policy there. You're likely to have only just a retransmit policy node there that just collects and then sends it to some central processing node, right? So that's the general, now the pipeline is interesting enough, can actually receive checks from, can be configured. It can be configured to get results, all being on check results can be configured to be completely distributed. But so like for example, to take an example of something like a Git repository, right? There is no, the Git does not impose any architecture on you. But effectively for any normal development network, there would tend to be a central node anyway. So, yeah, I mean it obviously doesn't make sense to have, collect alerts unless you're pushing them to some central location where you can see them. Well, maybe, maybe not. I mean if you don't want a UI you could just be sending out notifications directly. What are you thinking of them to have the same So, yeah I'll get to that. I have some difficulty. Yes, okay. Okay, so as I said, so policy modules can run arbitrary code. So let's go on with that. So anything, so this is the core of the system. Anything that is not concerned with executing checks or routing those results somewhere is either a plugin or an external application, right? The plugins which I'm expecting which I haven't gotten yet. Obviously, as I said, transmit which is going to send results elsewhere which I assume is going to be the most commonly used one. Maybe one for doing something like flag detection is just an alert that's coming up and down. Maybe a logger for logging stuff. Things like filtering for acknowledgements or down times. Maybe an email notification. Rather than email notification, I probably just do a major duty integration or something like that. Job execution? Okay, so these are job execution Well, yes, so that's something I actually neglected to put in the slides altogether. When the initial release what I was planning to do with job execution is just execute Nagios plugins. So Nagios has got a kind of standard for that. It's just an external executable which returns data in a specific format. And with my initial releases I just thought, so just use that so that we can just use that entire initial library. Maybe I might move away from that later on. It has its own icky bits but we can go with that. But that's not necessarily a plugin. I mean that probably would eventually become a plugin architecture. It is not a plugin in a sense because these are all external executables. So initially Nagios compatible in the first release. And then finally pushing that to some kind of permanent storage. Maybe to a database or maybe a writing to Ajax. So both have got their pros and cons. I'm still thinking having quite the set. They are exactly how In fact, we might have both options for different use cases. Then probably obviously some kind of UI application so that you can see the results and do things like making acknowledges and setting things down and so on. So these are the big elements of the architecture. So few bits of technology bits. Configuration purely in JSON. Use JSON. JSON is a format that is pretty easy to generate in scripts. So generally it's also fairly available by human beings. So if you actually have to go and edit it, you can. My expectation of course is that a lot of the configuration is going to be auto-generated out of some configuration system like Popette or something like that. Maybe with some kind of database system at the back. But this is a nice common format which can be used. Communication between this transmission internally between processes as well as between nodes using 0MQ. 0MQ it's small, it's light, it's fast. It is it deals at a slightly higher level of abstraction so that I don't have to deal with low level socket nonsense but it's still reasonably lightweight. I'm thinking now currently with the earlier reasons I'm just going to be sending a payload of pure JSON. I'm still trying to figure out whether that's a problem or not. My thinking is at this point that with 0MQ and JSON going over it, that's sufficiently cross-platform that the various bits and pieces of the system could be written in other, if other people want to contribute and they want to use they have a different preference with choice of tools or maybe the problem that they want to address is best addressed with a different technology base they can. I am currently writing the code with a booby. I'm kind of comfortable with it. It's a nice language but as I said I'm trying to keep things sufficiently isolated with that maybe later contributors can write stuff in other languages. I don't want to be tied to one individual. Okay, so how's that? I mentioned some goals earlier. As I said, right now we're talking about implementing a design. It's not fully implemented yet but how well does this design approach the goals that I had set out earlier. Simplicity, well since the actual underlying core is not taking a lot of decisions, the core of the system is going to be very, very small. Right now my best estimate is that it's going to come in 100-1000 lines of code at most. It would probably be a lot less than that. The design and we're talking about on the vast majority of nodes we're going to have just a single agent running over there probably as simple as a gem install or perhaps a single IBM install or something. And it looks to be fairly simple to manage with some kind of configuration management. Extensibility as I said. Now, trying to have defined interfaces with the communication so in theory anybody who starts listening on the same, anybody who can understand the same protocol of 0Q plus JSON. JSON itself is a very extensive program. I'm not imposing any expectation on what the contents of that JSON data structure be. Various, the internal processing is going to be this has been written in such a way that it is, it does not make too many assumptions about the contents of that JSON data structure that's moving up and down except for some obviously the certain things that cannot be left out but anything, for example any policy model needs to add additional stuff can do that. I'm hoping that way it's going to be fairly extensible and it can grow as needs change. And I said that internally also most all serious logic is going to be a plugin. So if after some period of time there's a complete new kind of architecture that we need to live with I'm hoping we should be able to adapt to that and grow along with it. Configuration management friendly as I touched. It's all JSON text, it's very easy to manage. Organization, say for example things like how does each user know what checks are relevant to it or how does a notification plugin know which check results to send where. The organization is tag based so essentially tag based can be easy it's possibly I think a good balance between flexibility and excessive complexity. So load balancing so now let's just discussion about the problem load balancing in this architecture a lot of the every logic is actually spread out if you look at an Agios instance a lot of the time it's just deciding which checks to execute and the scheduling logic is all being pushed out to the nodes. So in each node the amount of actual overhead is going to be very less because there's not going to be that many checks on each node. So that's spreading out the nodes I assume expecting now that the central nodes should not should not get heavily loaded I'm now then there comes the bits about how do you analyze the results store the results now if you're pushing into a database with a web based frontend that's a solved problem everywhere we know how to load balance it's not difficult and I am assuming that this is going to be very trivial to do it how do you do redundancy I figured I need to have at least a couple of drawings otherwise it's going to be a complete hand way beat up so yeah so each transmission the very nature of how 0x works makes it very easy to just transmit to multiple nodes when you're transmitting data so just push data through multiple collection nodes each check result will have a to identify it so even if there is a duplication going through the system it will get eliminated when it finally gets persistent so redundancy shouldn't be too complicated to do once once the agent itself understands the concept of multiple head nodes but in this case it's completely peer to peer then shouldn't be too much of a problem to do redundancy the other thing is distributed alerting now doing multi data center monitoring with nagios again there are existing solutions with that it is not very nice in this case since every node assumes that it's going to be able to send to another node suddenly this becomes fairly trivial the other thing is that since we are talking about a simple JSON data structure it's easy to start batch updating within the same protocol so there is nothing to be sending batch updates across it if for example that's the best way to handle say a link links between multiple data centers so here since it's by the very nature of it it's hand-off data across so multi DC should not be a problem we'll have to see how it goes in the real world of course so now that we've seen what bright ideas I have when is this crap going to be available for you guys to use so the core should be done I was hoping to be done before this but a conservative estimate about a month or so then I have to get started on plugins that since I haven't figured out I haven't looked at that problem in depth I'm not willing to put a timeline on that but it's probably a couple of months in the real world the web view I should be fairly trivial to do I'll use something there are lots of frameworks and that's hoping somewhere later on but things that I was trying to do here is sort of try to validate my ideas try to get some feedback find out exactly what are the big holes in my grand thinking and that's what I'm trying to demo here and try to get some feedback so I do give stuff to you guys to play with but I think that will take a little long I'll probably tweet about that as soon as I've got a release and I'm hoping you guys will try that out so criticisms, feedback insults not an insults I hope yeah, go ahead so can we change can we allow this to go in can we change or allow this to go in do these here how do you do it open question okay it's quite, it would have to be done at each node and I'm not going to need to figure out exactly but to ensure that we have something that's actually going to be unique I think they are existing solutions for that if not we'll have to work that out so so yeah so I have to do the harmony there it's also something that I think I'm not entirely thinking about it I'm not sure that it's going to be a problem until we start doing multiple routing and multiple routing that is essentially going through multiple collection nodes and trying to do fully regardment system which is probably something that I'm not going to be trying out very early on so yeah it is something that needs to be fixed but it's probably not my number one project at this point of time is there something that I mean are these so called policy modules things that they question to do from somewhere or are they just there so policy modules if you look at the actual implementation it's nothing more than class that implements a particular mix a kind of particular interface and the way the loading of that works is drop a .rb file where with the definition of this into a particular directory and it will automatically load this up so I'm expecting that what people are going to do is basically push out policy modules using something like puppet and drop them into this and probably a configuration to do is that they can modify the check results or add check results sequence in which policies are executed is significant so probably a configuration on exactly how the policy modules are going to be stamped on the pipeline or sequenced on the pipeline so that will be probably pushed out in percent probably won't be sent yeah I have it is interesting I thought that basically slightly overkill for what I was trying to do if at a later point of time it does look if it looks like that one I'm trying to keep it as the score essentially just going to be calling a callback kind of thing call a thing get a result back out of it if that is since I'm not really looking at things like like forking I'm not trying to do that that would be kind of implicit by doing things like maybe having a transmit module with some decision but at this point I'm not entirely sure whether that's needed so which is why I'm trying to let's not add a lot more features than I really need so yeah I looked at it I thought maybe overkill for now might be worth looking at I got to check I got to check if one is going to be okay yeah so the see there could be multiple ways of doing this but yeah you probably need to do something of that kind it might be that as I said I need to figure out exactly what kind of space I would need for that GUID so what kind of routing would it be okay so the routing so that's why I said I'm keeping the policy kind of relatively free at this point okay okay so in the way that I'm thinking of it right now I'm not so the routing is more useful for things like building redundancy and so on I'm not expecting a situation where certain nodes will have certain functions and other nodes will have other functions so for example it's not that node A is going to be a flap detector node B is a right okay okay right yes yes so okay so I'm so one of the things that the API for this thing the policy module will have is access to some whatever persistence system that you're going to have so it is necessary obviously that your persistence system is going to be accessible anywhere where a policy module needs to access it so those are going to be some certainly it's not going to be a completely free form peer-to-peer kind of thing but I think that's part of the API that the policy module should have so I don't want to decide too many things and advance them more I decide to advance the less flexible policy model but what I would like is so give us much raw information to the policy so I can figure out what it wants and again give me as well historical information access to it so these are things that probably many different policy modules like history is going to be something that so for example a problem when you have checks being fully passive checks that have been pushed out of the node if the node suddenly dies how do you know because you're not going to get any negative results out of that so there's got to be some other way of checking that so a couple of ways of doing that is probably having say another node that's monitoring this node which kind of gets back to the central problem again or you're going to have some kind of check on your head node which would be checking your persistence to figure out when was the last time I got a result from this node so things like this trying to figure out the history is going to be important and history is important for other analytical reasons I mean that's one of the things that I have with Nagios as well is that it doesn't persist by default you have to do things to get it to persist it is so there is a github repository it's a little out of sync right now but by that time after that I want to go into a fully transparent node actually there's no real reason for it not to be transparent but in fact I'm lazy as that well I just have a lot of real life interrupting me but yes I want to do this in the open as I said earlier pretty much I've done a fair amount of stuff open source development in the past I used to describe myself as an open source and evangelist open is better I think going in open will end with result in a better solution at the end of the day people than me will hopefully will not be able to do that I think we are out of time I wanted to put my email address here I don't know if I did but it just says when you guys promise family anyway but yeah viju.ch at directi.com send me mail there viju.chaco at gmail send me mail there viju.chaco on Twitter if you have failed catch me oh yeah I had that question a little earlier so there is a class of lizards called monitor lizards and their genus is marinas so an instance of monitoring is that any one kind of monitor is that in Latin could be called a marinata so this is a kind of monitoring lizard it's a monitor of some kind Latin pun that's what it means so some species can actually come to six that's what I thought actually I didn't know that I was just putting on the monitor part so maybe we can get them to do some monitoring how many of us