 Sorry for the slight delay. So let me give you a heads up. This is not going to be a conventional tech talk. This is my first tech talk, not yet. All right, please, please come in the front in that case I'm running out of time. There's any kind of feedback or comment, please find me on Facebook, Twitter or Bharat matrimony or whatever. Yeah, that's cool. So yes, I am Rohit Nayar. Hello. Yeah, so I'll be speaking about Suru today. And let's get on with the problem statement. So let's think of this. We go into a grocery store. And what if the grocery store has this rule wherein you cannot buy groceries without having an assistant with you. You always want an assistant with you to assist you to buy groceries. You cannot do it yourself. So what happens in the process? Yes, your grocery, your executive is assisting you throughout. He doesn't, he's not going to leave you or he's not going to give you all responsibilities of buying your own stuff. And what if the executive finds some attractive customer or, you know, and leaves you stranded? OK, what happens after that? He gives attention to the customer. What happens next? You are left stranded and you're waiting for another executive to come by and help you buy groceries. So you are dependent on the executive. And in the process, as a consequence, your dinner gets delayed. So what could you do possibly? That's a possibility, but we're not going to go into that. So yes, get rid of the executive and do it yourself. All right. So yes, what I want to stress right now is about self-service. So developers always need resources. And resources could mean you or, you know, resources in terms of your physical boxes or your time and stuff like that. So what generally happens is you are always, as a sysadmin, you are always interrupted with a lot of attractive work. And then the developer is now waiting for some other executive or a sysadmin, basically, to come and help him out. So the developer keeps waiting and then keeps waiting. So yeah, so what if the developer had the ability to gather resources himself without really waiting for the sysadmin to, you know, get him acclimatized with everything. Yes, we would never doubt our attractiveness as something else. And things would definitely be on schedule. So yes, we have a similar scenario, basically. So we have an array of web servers. And we always, our developers are always trying new stuff because of which we always have to be with them, assisting, deploying code, and we have to stick around with them. So as to check if everything is OK. Because most of our testing runs on production servers. And so we have to be there, along with them, to see if things are properly scheduled and that there are no mess ups or anything of that sort. So yes, that's what we deploy, we test, and we repeat the procedure every single time. So we realize that when it comes to testing and staging, we need not invest a lot of time along with the developers. We could just give them the handle and let them navigate throughout. And we could just be there, do something else, and invest our time in something more productive. So we thought, why not you, Suru, an application which helps them take care of their own resources. Let's them play around with the stuff they want. So yeah, so what do we primarily manage? We manage web servers, CDNs, Redis cache servers, and stuff like that. And we basically scale horizontally. Our web servers are always scaled horizontally. So I'll let you know how Suru helps us manage a staging environment. So yes, these are just basic stuff about Suru. So Suru was developed by Globo. It's an open source pass. And we can write apps in any programming language of our choice. And we deploy it using Git. So yes, this is the architecture. I think it's not very clear. So let me tell you how it works. So yeah, is it visible to all of you guys? Oh, crap. OK, that's fine. I'll just walk you through it. So the laptop you see here right now is your application developer. So the application developer basically contacts the Suru server. He makes a core deployment. Basically, he pushes his core through Git. And the Suru API, or the Suru server rather, will now contact Docker. All right, will now contact Docker. And Docker will now spawn as many instances as the application developer requires. So basically, so here we are using Docker as a provisioner. So basically, you can use Suru with Juju, which manages your AWS instances as a provisioner, and Docker also. But we went with Docker because we did not want to use AWS in the first place. Yeah, again, on the top right beneath the cloud, we could see an HTTP router. So that's a Hippachi router. It's basically a load balancer, software load balancer. It's a software load balancer. So all the requests will now hit the HTTP router, and it will be load balanced among all your Docker nodes. So that is how it works. And we also have a Gandalf server which manages all your Git repos. OK, is everybody well-versed with Docker? OK, so let me just explain what Docker is all about. So Docker is basically, you can consider it as a wrapper around LXCs. And LXC is basically a kernel feature available from 2.6.24. So yeah, it doesn't have a hypervisor. All right, it has C groups. So a hypervisor basically resides between your kernel and your hardware. We don't have a hypervisor. We have C groups. It's a built-in functionality. So yes, silly stuff. That's OK. Yeah, that's us. And so we are also to create a kernel. We are also to create a container which shares certain characteristics which are provided by the kernel. So yeah, that's what. There's no hypervisor, and it makes it really fast. Yeah, that's about it. That's Docker in a nutshell. Just in case you guys want to know what Docker is, please Google for Mithun Brain Transplant, and you'll probably understand that. Yeah, so the required entities here are Hippache that I mentioned. It's an HTTP software load balancer. Gandalf for managing JTrippers. Redis is used by Hippache for mapping. The thing is the Docker nodes that come up, they are all ephemeral in nature. So basically they have different IPs that get spawned as soon as you spawn a new Docker instance. It's allotted a new IP. So there has to be some kind of mapping, and Redis basically provides it. MongoDB basically it's used by Suru for a normal bookmarking. Demo I, let me see if I have. OK, just hold on. So we are just trying to deploy a basic WordPress application via Suru. And as you can see on the right hand side, this is created by my friend Kalyan. He's, you can see Facebook and other tabs open. That's a little window. Yeah, so we are going to try and create a simple Suru application. So yeah, we are creating a WordPress application. So we created an application. Now we are unzipping everything into a root directory, initializing the Git repo. And after this is done, we will push all of this to the Suru server. OK, happens to everyone. Go on. So as soon as you commit and push, Suru will contact Docker. Docker will start spawning instances. Yeah, this is when the magic starts. Exactly, so Suru has now contacted Docker. Docker has created some instances. Now we list applications. And as you can see, there's this one application that has been set. And now we see there is an error because it doesn't have a MySQL application running. So we will bind the Suru, the Docker instances, to a MySQL application. So we require a requirements.apt file wherein we mention all that is required. And then we push it. Now what happens is it will delete the old Docker instances and respond the new ones. Yes, so now we will try and export the database settings to the Docker web servers. And this is how we do it. So basically we are binding a service to the WordPress application. This is all done via Docker instances. So as we can see, our site is up now. And these are your environment variables that get exported into the web servers. So there are times when we want to scale now, there is only one instance of Docker. So what we'll do is we'll add more instances to this setup. There's only one unit, and we are adding two more. Now we can see there are three Docker instances that are running. That's about it. And so there are other tools like Flynn, which is also does similar stuff. But then the difference between Flynn and Suru is the importance given to service, importance given to scaling services. Now we spoke to a couple of engineers at Globo, and they said that it's all about the design principle. They wanted to basically scale web apps and not services because we don't know how services can be scaled. They can easily either go horizontal or vertical. So the only way they could come up with adding services was binding proprietary services or stuff like that to Suru. So that's just the basic in Flynn and Suru. So yeah, the challenge is less documentation. Suru, the only place where you can find Suru documentation is on the website, nowhere else. There are two dependencies. Like we saw there are, I mean, there's a Hippache server, there's a Redis, there's Circus, and there's whatnot. And it's only built for web apps. Now that's the way it could be considered as a merit also because we will only focus on scaling stuff horizontally. And yes, there are buggy scripts. With Git, there are a lot of pre-received hooks that has to go in, but they were all buggy. And my friend Kalyan, he actually fixed all of them before we could go on with this. Yes, so do we have any questions? Why do you think it is better than or what do you have? What's the opinion about heat and open stack? Basically heat, by heat you can decide your templates and put all this thing over open stack. So why do you think that this is not good or? I think this is easily manageable. It's very easily manageable. I mean, you can just spawn instances. I mean, you can just create an application, give it to them, and they'll manage it. You don't have to really interfere into anything. And we thought Docker was the way to go. Right now, how do you handle multiple servers in this case? So basically currently talking about one server, let's say I want to have 10,000 or hundreds of instances running, how do you handle that in this? As of now, we really didn't have a scenario wherein we had to look into 1,000 instances. But we can easily manage 20, 30. That works for us. OK, less on production, but. Yeah, it's a staging setup. Hey, hi. So you have done something like you committed the code and then it committed. So after that, a trigger has been done to Suru to start the Docker and then to initialize it. Yes. So how do you do that sort of a workflow with? Like, you have to have a predefined installation of Suru on that system. And any code post hook trigger, so you keep that Suru based configuration, something like that. What do you do that? Yeah, there are a lot of files basically. So you mentioned all of those where you want to deploy your code. All of that is mentioned in your Suru Conf. And you have a requirements.apt file. And there is a deploy script also. So once that is read, your deploy script will deploy. All of that is mentioned in your requirements.apt and just create everything. You really don't have to look into anything else. OK, does it work at a branch level? Or does it work at a multiple Git level? No, it doesn't, I guess. I'll have to check that because it's a post-receive hook. OK, so I mean like staging, development. So these are all different different branches, right? So then if I want to check it, then how Docker understands it and how Suru does it in different configurations? No, I didn't get the second part. So I was just asking like the Suru-based configuration on different branches. How exactly the Docker knows it, actually? I can take it, we'll take it. Yeah, I'll take it off, I'll take it off, I guess. Hello, am I audible at the back? So it's a Friday evening, you are sitting with your friends that you are having your favorite beverage, whatever you like, and suddenly this happens. A server goes down. And you were not even responsible for it. You should not even have received that alert, but you received it because a new guy misconfigured the notification system. And that ruins all your mood. And believe me, it has happened with every one of you, right? So as you make your next multi-million dollar company, you add more people, you add more hardware, you add more software, these problems will go from bad to worse. And at DirectEye, we have faced the same issue. We have over 1,000 servers, physical boxes, 100 VMs, distributed across 10 colocation centers, and over 25,000 and growing checks every day. You can imagine the amount of noise being created and the number of people being disturbed when they should not be. So this is what happens. This is the monitoring system. It explodes alerts. And there's one single guy, the L1 engineer, staring at the dashboard and thinking, oh my god, now what to do? Same like him. He's surrounded by alerts, and it's a flood. And there's a team lead on fire. And he doesn't know what to do next. He has no information. He has just a million of events running down to him. So my friend, Sathe, as we call him, has wrote this. We have made an app, an app suit, that meets the need of every guy in the chain, from an L1 engineer to the head of, say, the operations, to the Sysad guy, to the new joining, to the team lead, and the end customer, the customer to whom you're serving your services. So one app to rule them all. One app to find them. One app to bring them all in, and obscureness, combine them. So what do we have? We have events. We have alerts generated from monitoring systems. You have Nagio, Springdom, any kind of monitoring systems you might have. You might have hooks from LogStash, raising events when there are anomalies. You might have metric events. You just have events. That's it. And what do we want? We want SLA. We want to make sure that our product is meeting the uptime. We have promised our customer that we'll provide you 99.99% uptime. We are better than the other company. But how do we figure out at our own? We want statuses. We want status dashboards. We want progress reports on an incident. We want correlations. We want RCAs. We want a lot of things. And what do we have? We just have events, alerts that are generated from the system. Now coming back to the L1 engineer, what would have helped him if there's an alert that has occurred? If there was a knowledge base existing where he could just search Apache and, say, the server name with the tags. And he would have got a recent history saying, OK, this was an issue that occurred. Or, say, a new guy has joined in today. And 10 days ago, a similar issue has occurred. But he has no idea of what occurred 10 days ago. So if there would have been a system where he could have knowledge bases. Or there would be a way where he could escalate alerts automatically. He doesn't have to call up folks manually saying, hey, dude, it's 10 minutes. I'm unable to fix it. Can you look at it? Then even after half an hour, it's not fixed. Again, he has to call up someone else. Hey, can you look at it? It's not yet fixed. So for an L1 engineer, that's what he wants. For a team lead, he wants dashboards. He wants product health reports. He wants to see if the uptime numbers are met. He wants to see which is the most problematic endpoint. What are the frequently occurring outages? What are the server that are going down frequently? So he has his dashboards, where he can see trends. What are my S1 events? How many S1 incidents are there? Which is the most affected colo? Which is the most affected product? What is the frequency? At what time am I getting these incidents? And he can see the list of all server and server that are affected currently. Or he can see uptime numbers. Say you have 10 different endpoints for a particular product. You can see which is the most trouble-making endpoint. For an end customer, as a customer, you want to know the status of your services proactively, not reactively. It's always good to be proactive for a customer. So if he can know what your services are like now, are they up? Are they facing any issue? If they are facing any issue, what are the actions being taken upon? And what is the ETA and everything? That's when a customer will be happy. So we built this system called Slant. It's an app suit which has a various number of apps. It manages automatic escalations. It manages SLAs. It manages incidents. It manages your contacts. It manages your on-call. It manages your calendar. It manages your scheduled maintainances. And it manages your product definitions. It knows when which server belongs to which product. So for the new culture, we are changing the culture at our company. So now the L1 engineer, when he joins, the first time he logs in in the system, his contact is automatically created in the app. He just has to make sure he's in the correct groups. His on-call calendaring is correct. And that's it. All the escalations are set from previously. And he has a knowledge base of RCAs. When we have outages, we make post-mortem tickets, we make RCAs, we add tags. And he has a knowledge base of it. So whenever something goes wrong, he just searches in that and he finds the answer. For a team lead, not only for a team lead, other members in the team, there are dashboards. There are apps which shows what is the troubleshooting endpoints, what are the incidents, what are the current actions, who is working on what, who has acted, who has commented. You can just type out a comment and make it public so that the customer can see it. For a customer, he has everything proactively. He doesn't have to call your helpline and ask, hey, what's the status of my SMTP? Why is my mail not going out? So this is what we have built. Yeah, so any questions? Could you just clarify how this bunch of events which you get from your monitoring system, how does it flow into the various dashboards that you've seen on the show? If I go to the architecture of the application, it's a very simple architecture. What we have done over here is we have built a system in Ruby. You can make API calls from any of your monitoring systems. So we are so diagnostic. We don't care what your monitoring system is. Whether you are in Pingdom, whether you are using Nagyos, or any of the monitoring systems, we don't care. We just deploy a monitoring system once and configure it to send events to our slant. You can make API calls, or you can send emails. You can write wrappers over it, and you can process those events. Once those events are received, we have a rules processing engine where you can write rules. This is simple Ruby code where you can write if condition, saying that if this is a server and this is a service, make it an S1 or make it S2. That's a rules engine, and once that's a processed event is formed, S1 events are taken into computation for outages and SLAs and notifications according to the escalation policies. So that's a fair, rough architecture about slant. How do we plug in my custom events into it? Let's say a simple cron is running, looking at some basic parameters, CPU, iOS, et cetera. And it decides that CPU is above 40% and there's a warning, CPU is above 60% and there's a critical. So it's not a Nagio or a Sensor, well, it stands for... Yes, so you can basically make API calls to the system, right? So you can just create a hash of... So it's a REST API? Yeah, REST API, yeah. So you just mentioned that your tool calls APIs from multiple monitoring tool, and then it kind of helps your bunch of group of people here. I'm just curious to know how many hours is your team spending to constantly build these REST APIs because let's say I build a service or an app, right? When I build my app, I'm gonna have a bunch of templates because I wanna know certain conditions about my app to be monitored, right? As a developer, like it's in my org, a developer builds their own template for monitoring. Then you have this tool which has to probably monitor the same thing, right? So you make an API call to us. So... Whenever there's a... If you are making... Your tool basically, I'm sorry, the tool basically is the front end for your knock person, right? I'm looking at your L1 person, right? So the API calls, there are all these different monitoring tools. Let's say there's an OS or whatever, right? There's gonna be templates lying in there. How are you connecting it? You're constantly spending cycles... No, so what we have done is... Not really... They don't need to be aware of your systems prematurely. As soon as it comes, it would process it into the common category, into the common data pool. And if you want, you can add your own service configurations where you would say, if I get an event with this particular condition that's met, process it in certain ways, have certain associations to it. So you are exposing... We are exposing the API, yeah. It's completely on a simple web server. I mean, it's all on the UI. You don't need to do anything. Okay. Will it send before the... Happen that event or after the event? Sorry? You're saying it is a handling on the events, so how we configure that event? Will it send that the event is happening before that or after that event? We suppose... So when and us, yeah? So suppose my API server or whatever server is hitting around a thousand requests or more than request, if I... Will it send before the... Before hitting that request? No, it's when the server sends an event to us. Yeah. So when the server sends an event, according to your escalation policy and your settings say email or mobile phone, whatever, you get an SMS or an email. So that will take care of the logic or whatever you're providing that will take care of. Yes, according to the policy I've set over there.