 I'm Dushan and I manage the system and session team at Directlype. I'm going to give a short talk about the tools and, you know, the applications and the things that we have to manage our entire application stack and infrastructure and stuff like that. So, to give a brief introduction about myself, I've been with Directlype for the past eight years and when doing system administration stuff on Linux and stuff like that for ten years now. And the main, I mean, most of the infrastructure at Directlype is on Linux and some products do use Windows. I'm only covering the Linux infrastructure and automation surrounding that and not even those infrastructure side. So, the teams that we have in place are mostly engineering teams which consist of system architects, software engineering and some sysads as well. And the operations which are the 24-storey seven teams like system operations, data center operations for remote hands and network operations. The corporate IT team basically they only handle our internal infrastructure. And so the entire set of all the system stream, former systems operation group with various skill sets and focus on different areas of the infrastructure. The products that we have, you know, again, a lot of various products add monetization which belong to the media and business units. And then we have a lot of hosting products and then the messaging platform to inform and so forth. They're all hosted mostly from collated data centers in the US. So, we buy our own hardware, you know, we buy our own networking equipment. We are with ISPs. So the whole infrastructure is managed by us. And so we are moving some stuff to AWS at some stage. The applications again, you know, a lot of open-source applications. Most of the application styles of the products use the open-source applications. Some of them are listed out here. And so coming to the infrastructure tools and automation. So the core part of any infrastructure automation effort is the inventory, right? What do you have? What hardware do you have? What networking equipment do you have? What applications are installed and stuff like that. So we started out when we used to be a... I'm coming to infrastructure management. We used to be the traditional folks where we used to have scripts that we run on each box to do stuff. You know, and over a period of time we realized that it doesn't scale. And so we started automating the entire infrastructure from the ground up and realized that the first part that we need to tackle is the inventory. So we started using this application called Rack Monkey. We were in Perl and, you know, collecting our inventory bits there. And later on, you know, we made an arrow or a fork of that in the press in the SDC box right now. So this application basically consists of all of our equipment details, starting from racks, location centers, the servers in them, their hardware specifics, their networking information, which port and switch are they connected to and stuff like that. So we basically provision racks for any product. We provision a rack, stack up with an entire load of servers, and then, you know, deploy applications on it and forget about them. So then they get used whenever the capacity is required. So when the rack is loaded, all the barcoded information gets read into the inventory tool. And then we have an integration with Cobbler for, you know, ascending roles to the fresh hardware which was stacked onto the rack. So the Cobbler profiles, when they start, they get appropriate Cobbler profiles. And for Windows boxes, I mean, it goes to WDS and stuff like that, but that's not integrated audio. And so there's other integration with other devices, like the Cobbler strips to get power consumption of a rack or a server. And it also has integration with networking equipment for some stuff. We also do busker management like that. So these are all, again, the authentication of this tool itself. So when we have mostly Dell hardware, so all automation is running again with Dell hardware with some stuff like create creation, et cetera, that happens. After a machine is racked, a live CD, you know, it boots for a live CD, live CD, and then you do the create creation settings and stuff like that. There's the metadata that is collected and uploaded to the inventory tool, and then the OS provision is kicked off. And then once the OS is provisioned, you have configuration management stuff, which happens by puppet. And so we again wrote a lot of modules and, you know, for each of the applications that we use, and many of them are custom according to the needs that we have. And we build a lot of custom RBMs. And we use Koji for that purpose. So all we need to do is upload a spec file in the source RBM, and it gives us RBMs for all architectures and TOS flavors that we use. And chip for revision control, and it's a shared repository of all the puppet managers sitting there that every project uses. All development changes trackware puppet. And we have written a lot of custom tracks in puppet, which gives us information of each of the bugs. Like LLVP is a tool which tells you about which switch a machine is connected to, and it also updates the switch also when LLVP is enabled on the switch, networking switch, it updates the port aliases. So then, you know, when you look at the switch configuration, you will know this machine is connected to that switch, this particular port. So the host name is separated on the switch. So this custom fact helps us in making the puppet manager easier where we can decide on what configuration to put into that box based on that fact. We do some application deployments that have a scan as well, so that the developers have control directly to deploy to their stages. We also use a tool called MCollective, which integrates with puppet as well, and then you can run commands on a local node, say, you know, run this command on all machines on this particular rack. So you can power down the entire rack if you are on a maintenance and you can boot up the entire rack if you are on a maintenance. So those are the positioning tools, configuration management, and for monitoring we use NagUs, I mean, standard approach most of the infrastructure is like this. We did investigate a lot of tools like HyperA and Xenos and all these commercial tools that are available out there like spin-offs over NagUs with a good UI, but they all seem to be for regular IT admins who need a UI to do stuff like add a box and stuff. But for us, the flexibility that is in NagUs configuration files provide is invaluable then. So we wrote our own NagUs puppet module type and a provider which helps us configure NagUs from the plain configuration files. So it's a simple module which looks up all the host information and the host groups they belong to from CSVs, using corporate EXT lookup, and adds the host to the NagUs and assigns it to a particular host group or a host role and stuff like that. And also these checks are associated with these host groups. So that any host, when the new host is added, we just have to assign it to a host group so they can read all the checks automatically. We don't have to configure every check on the new host. And we have network service dependencies, you know, a lot of times you will have some network issues and stuff like that, so you will get flooded with thousands of others on your inbox. So these dependencies help minimize that. And we use take by XSH, not in IP, so that we don't have to maintain client-checking durations. And most of them are active checks and discipline monitoring is working on. So we also use Pingdom for external monitoring, protract or SLAs and stuff like that. So all our main services which are accessed by our customers, we have a service check in Pingdom which tells us if that service is done from a particular location. Pingdom checks from 14 different locations so we know that if it's done on any of the java areas because of routing issues and routing issues and stuff. And for instant management, we classify alerts as different categories as when being most critical and as when being the most important in all the others. So the August when loads are handled by the operation stream which helps us with the books and escalations and stuff. So both Pingdom and Amuse Alerts are integrated with Request Tracker, so which is the RD, which helps us to track every event or all that happens on our infrastructure, political analysis as well and we do have some other questions surrounding Pingdom to add services and checks to their system. And for trading, we use Gambia, which is one of the mostly used monitoring solutions for collecting metrics from your systems. And it says can be scalable, easier to collect, you can write a quick script to collect some metrics that you want to collect from the system. If you want to graph load over a period of 18 months or a year and stuff, let us include script and script and it runs the applied steps via gmetric. So those things are very default provided by Gambia itself like load, CPU, metrics and stuff like that. But Gambia makes it easy for you to collect other metrics that you want to collect, like say on your website you want to track the number of registrations per day on a graph. You can write a single script, you are going to get it from the database and send it to Gambia for a graph. And use RRD CacheD, which is an enhancement for any RRD based solution to speed up. So if you have used the RRD tool and any solution based on RRD tool like CacheD and stuff like that and you have thousands of rows, you know that. I hope it's saturated because of a lot of random IO because a lot of RRDs are very small, small pyres and stuff. So RRD CacheD acts like a layer, a caching layer between a tool like Gambia or CacheD and the disk. So Gambia talks to RRD CacheD, to send all updates and RRD CacheD caches everything and writes, you know, every few seconds as a sequential IO. So we group clusters based on project roles like for example one particular project, like Schenzo, the Schenzo web will be a cluster. So all the web servers on that, having that role will appear in that cluster and you can do cluster comparison and see how much each box is performing or how many web requests is each box getting and stuff like that. And GMOD is the client utility for Gambia which collects the statistics and runs on each server and the... so it contains a vital modules and you can also write geometric collect metrics from servers. So we have written some cluster modules and we have used SAR, you know, that provides most of the statistics on system like IO, SWAP, you know, paging statistics and new statistics and stuff like that. So we simply collect it by SAR and set it all to geometric so that we have trendable metrics. You know, we don't want to look at a, you know, a metric when something goes down or when you trend it over a period of time to see patterns and figure out what issues and a lot of my better modules for all the Apache and different applications and custom splits. So each project gets its own views with important metrics as a dashboard so that we can, you know, see the performance of a particular product at any point of time. We also use GraphMate for custom application metrics, business metrics, et cetera, like stuff that I spoke about, a number of registrations, a number of connected clients and stuff like that for certain projects. So from Java, you know, the statistics are directly sent to GraphMate using this cold hail modules. We also have some Python code too. So if some application exposes a metric as a JSON, we can fetch it via this Python daemon and send it to GraphMate. For the code monitoring, we do cacti, observium, which is rather new, pretty, and what's-of-goal is a machine pretty for its flow and jflow analysis and stuff like that. And log management is plunk for the critical launch with core operations scheme. And yeah, log stash and ask-to-search for the large data sets, the large amount of logs, like, you know, our middle of state platforms, you know, we send a lot of mails out, we see a lot of mails. So we set up index all those logs via log stash and ask-to-search so that we can search for, you know, email delivery logs or say somebody say, that's not complete is that I have not received an email, we can quickly search and see what kind of learning it's like. We also use way log courses and other projects for project application logs, like log4j, and there's going to be data sent to way log or having a JFC, if you want to see what your application is doing and your application has enough logging so you can use the log to have an index bottom of the logs. So that's it. I think those are the all the tools and, you know, stuff that we use to run our operations and many of them are, you know, very mature. I think it's easy and there's a lot of documentation out there to set up a quality audience. I can take up any questions if you guys have related to the voice positioning or the infrastructure itself for your applications running.