 So, hi, I'm Ayush, I'm going to talk about Bluetest, that is one of our cost-evolving tools on the AWS cloud that we have built as I'm in-house for us. Yeah, so before we get started, something about myself. So, I work as a part of DevOps team at index. Apart from my daily development job by, that's the part of the DevOps activity at index. I try to be an open source contributor to in-house open source projects, which we have opened up for the community. You can find them at vs.index.com. You can find me moderately at the user name of I am everywhere. And lastly, I am the agenda for the talk. In the next 15 minutes, I'll be walking you through our work that we have done at index over the cost management that we have done. And of course, the tool called Bluetest that we have built for monitoring and tracking the untracked cost, which untracked activities which actually causes some kind of cost fluctuation in our AWS bills. So, how do we track them and how do we tackle them? That's the kind of tool which I will be talking about. Okay, so before we jump any forward, let's have a brief look up at that how are AWS infrastructure at a current state as of now looks like. So, at any given time, we are approximately close to 1,000 and 1,500 machines as of now. That includes one-demand spot nodes as well as the eye instances. We have more than around 50 to 60 S3 buckets that are around TPs of data in it. And that's been very active since we are handling the data processing jobs at a very large scale. From a deployment perspective, we have rapid multi-AZ deployments actively done at every given hour of the day. So, and now since all of these things are together and we still are a very mid-level startup company, so every penny we spend upon AWS throughout actually that comes to us. So, as follows the last point that from our perspective, we incur a significant cost in AWS Cloud and we do have to take a strict amount of measures on to it so we have to control the cost that we incur upon AWS Cloud. Yeah, so moving forward, so how so far have we done controlling costs for AWS Clouds? So, these are a few success stories that I would like to share with you all. So, our major costs on AWS Cloud actually goes on to our Hadoop clusters. So, we run a lot of data processing jobs on Hadoop clusters, right? So, definitely at any given part of the day. So, we have both our staging and production Hadoop clusters which run for like 24-7 for every week or let's say for a month times, right? So, what happened was that since most of we wanted to do Hadoop clusters in such a way so that we could see that our Hadoop clusters should be scaled down and scaled up based upon the application metrics. So, that it should not be the case that let's say we don't wanna use Hadoop cluster and all the application and the Hadoop cluster is not in significant use and we still are running that cluster like on-demand instances and due to which we are spending or basically paying it a price for the Hadoop clusters, right? So, we built a tool called Vamina. So, yeah, just to point out Ashwant who is just going to be next speaker, so we built this tool Vamina which actually takes out the application metrics from our Hadoop cluster and it then automatically scales up and scales down the cluster based upon the application metrics. So, we don't typically are up to with this option that our clusters are incurring cost because of the ideal state, because they never go down to an ideal state in a typical now. The next thing is that we use typically spot instances around 50% of our industry and infrastructure is on spot. So, we do need to find the right balance then since spot actually comes with a fix that we do need to decide that they write availability zone and write the spot price for it. So, we do wanted a kind of a resilient infrastructure on spot which could be a kind of a reliable one that we could rely that instance on spot should not go down very frequently. So, then we wrote a tool called Matsya. So, what this tool does is it basically identifies the right spot market in which we could launch our cluster which based upon the spot price and the availability of that particular instances in that particular availability zone. And the added capability to this is that, let's say in that particular instance it doesn't match the criteria that we are actually having for the spot price and as well as the availability. Then we switch back to the on demand instances and cluster machine is like fully functional back again. And if we approximately calculated it how much it actually impacted us over the year and that's one of the very true findings for us that it's almost saved a million dollars for us or for over a year, yeah. Moving forward, so as you can see in the following graph that that's being, that's are basically cost pattern for like the past two years, the monthly release bill that we encountered. So, if you can see before February 2016, right we were incurring, we actually incurred to at in some of the months close to $200,000 a month, right. So, we wanted to really control that and in the February 2016 month, we then came up with the Matse and Ramana tools like Matse and Ramana and then they kind of stabilized our cost, right. So, everything was going pretty much fine for us. When somewhere around September 2016, we realized that our cost has piped up to something another level of escalation and we didn't know, the worst part of this was that we didn't know that what actually is like this cost. So, then that was a real challenge for us that how do we keep track of all these events which actually lead us to these kind of fluctuating costs and hence lead us to a kind of a hefty year that we need to pay over the month, right. So, we took this up as a challenge and then we thought of this as an idea that why not have a kind of an automated tool which actually helped us identify that, what are the resources that we would like to closely monitor upon AWS Cloud which we seem to be thinking that these are the kind of volatile resources which might have some untracked activity and might lead to some of the fluctuating costs. So, we thought of building an automated tool around it and this gave work to something which we're gonna talk next is the Plutus and yeah. So, before talking about Plutus, we definitely explored upon the options that how exactly we could see that if some open source solution is available to it. So, Netflix Ice is definitely available but we didn't want it to go something, we wanted something a more level of customization as per our application needs. So, we didn't go with any of the solutions that we could see are available on the internet cause we didn't feel that those are kind of fitting into the kind of aspirations that we were looking for the challenges to be solved by the product. Yeah, so now we introduce Plutus to you that it's a kind of a notification tool which reads AWS cost usage report on a daily basis and sends alerts to you based upon some defined threshold limits that let's say some different, some threshold limits for a particular resource process and we do need to then process that and immediately send alert to people in whatever form possible. So, that's how on a very high level how Plutus works for us. So, this is a sample slack conflict. So, slack is an internet tool we use for the communication in an office way. So, this is the kind of a simple JSON conflict that we specify. So, it takes up the arguments like the service we kind of monitor, that is AWS S3 in the first example. The channel, let's say we wanna alert in, in the slack channel, the kind of a person that you want to get alerted to. So, in our case typically our engineering managers are typically very much interested upon their systems causing how much bill upon AWS. So, we do specify the engineering manager's name. Then we specify the granularity that at what rate you want that report to be fitting into you, that is you want the checks to be happened on a daily basis or on a weekly basis or on a monthly basis. And then, so Plutus supports the user defined types that it's very important for us, we follow this culture by heart and in this that whatever resource we spawn up at AWS we definitely have to target. So, definitely then we actually leverage this thing and in Plutus, so all the users, so if Plutus takes the input of the user defined tags like let's say the name tag and that it's value to be platform icon analytics. So, it would identify the resources with this particular user defined tags and calculate and calculate a computed cost for them for it and then would compare to a set of sample threshold with which is specified, let's say 0.03 dollars, right? Yeah, this is sample slack conflict and this is how you get alerted on slack. So, let's say somehow your some day or some any given point of the day your threshold actually crosses for any given cost that you have specified in a JSON conflict and immediately you would be getting alerted on slack even if there's a 0.7 difference in dollars of cost also then immediately you will be getting alerted. So, that is how we kept kind of a control on our AWS cost and how actually we could track all the unknown events on the cloud which we otherwise wouldn't have been able to. So, now let's take a deep dive into it and see that how they, we came about driving this tool. So, from the AWS billing console, we leveraged out the AWS cost usage report that's a CUR the short form notation goes to. So, CUR gives us the advantage that we can daily take this CUR in store into an SV bucket. AWS gives us an option to that. So, once we configured this parameter and the AWS billing console that our cost usage report has to be supplied daily into the SV bucket. Now, with the cost usage report, the beautiful thing that AWS actually provides us as a part of this is that it provides it with a sample schema for the CSV file of the cost usage report that it is being given to. So, now what we can do is that once we have the CSV file and the schema associated with it. So, we straight away imported this particular CSV file along with its schema onto AWS service for Redshift which is a kind of a large-gill DB store for AWS. So, once we imported this to Redshift, so Redshift itself had the capability to convert this entire CSV file into a kind of a SQL store or kind of a database through which, on which we can run queries in a lot easier manner which otherwise we would have to be like passing the CSV file and that would have been like a very good job. So, yeah, once we are having a kind of a report in AWS Redshift, we then read our JSON conflicts which we earlier specified for reading the, for the alert configurations that we have set. Based upon the, so yeah, I'll just roll back. So, based upon these parameters, it basically forms the queries which are to be run which have to be run upon the AWS Redshift. Now, once the queries that are run upon AWS Redshift the output for those queries would be the AWS for the cost for those particular resources. Now, once we have that cost, we compare that against a certain set of thresholds. If the threshold actually crosses then we have to get the alert. Now, how we automate this complete process? So, we use the serverless approach as serverless is like pretty much dominant in this complete session over here. So, we use AWS Lambda for completely automating this. We have two or three Lambda functions in together all of this which actually does all this job together that is pulling out the CUI reports from AWS Building Console, importing it to SC Bucket and then finally importing it to Redshift with a particular given schema and then running queries on top of it and also sending a Let's Us Lambda. All this complete job is done by AWS Lambda for us and we have set this as a kind of a crown and it runs on a daily basis for us and the kind of impact that has left us. Okay, let's see. So, this is our, so the graph that we saw earlier, right? So, until September 2016 when we saw this kind of fluctuation in our AWS build, right? So, post September 2016, this is the graph that we have followed and so far it's been like an year for us and we haven't encountered any such event in the past one year post deployment of Flutus in production that we couldn't even, that we couldn't track kind of an unknown event on the resources that we are using on AWS. So, it's been pretty much very stable for us. We are actively working on making this project as an open source. It has a scope of a lot of improvements and a lot of features have to be added on to this. We have a lot of plans for this. So, yeah, pretty much this is all about from my side and just to mention that we will be soon releasing this as an open source project and yeah, that it would be great that if you guys could just have a look upon this code and we would be really great if you could help us contribute on this and make this will a better thing for the community as a whole, not just for inputs, yeah, that's nice. Oh, we have two minutes for Q&A. So, if you have any quick question that you wanna get it answered right now, you can use this opportunity or else they're gonna be out there during lunch. You've noted down how he looks. So, you can find him over lunch and then speak to him if you have any questions. That's the one question over there, yeah, sir. And yet, if any instance or, which is not, you have mentioned on your JSON script, it's been created, it incur cost, probably I didn't figure it out. So, I just said I couldn't follow you over this until, sir. In your JSON script, you've also specified as far as 3M, I've had it, right? In case if it is not packed and a particular object has been created with somebody made a mistake and they've incurred cost. Yes, sir, so that I completely agree, that's a bottleneck for me, definitely. So, if the instance or the particular resource will, let's say, if that is not packed, right? So, that is the next use case, actually to be honest, I'm solving in the Brutus code base. So, in the current setup, that which I explained to you, yeah, definitely it's a bottleneck that if that particular resource is not packed, it won't be able to identify that particular instance, right? But we have that as a kind of a feature request from one in our developer teams only, that what if we want to track all those resources which are untapped on AWS and how do we identify those resources?