 Hello everyone. Very good afternoon. I am Sarjan from Intuit who work as an engineer in the operations team. I lead the operations initiatives for Mint and I have Linu with me. He is another engineer. He leads the development initiatives for Mint. Today we are talking about moving Mint to AWS, some of the important memories that we had while we moved Mint to AWS. For people who don't know about Mint, Mint is actually the largest personal finance SaaS which serves customers in the US and Canada. We are almost on all the major platforms like mobile devices, web and desktop. We have around 22 million users, 70 terabit of financial data and around 20k FI support as of now. While talking about Mint, I would like to walk you through a high level view of Mint. We have a typical 3 tier architecture which is the web, app and data. The communication between each component happens through the central messaging system. That's about it. That's one difference that we see from a typical 3 tier architecture and Mint. We call the entire architecture as a pod or a swim line. We have several swim lines which constitute the complete Mint and which is under a load balancer. While talking about moving to AWS, what really motivated us is nothing really different from what everyone had while they moved to AWS. A couple of things that made more sense to us were the high availability and disaster recovery which we wanted to have, which we didn't have in our physical data center and the second thing is scale as much as you want. Within minutes and the other thing that we thought about is the automation possibilities that we can have when we move to a virtual infrastructure like AWS. I would like to talk about some of the experiences that we had when we moved Mint to AWS. Some of the failures that we had, some of the surprises that we had and some of the learnings again. Mint Intuit as a company over the 30 years gained a lot of press from the customers which is something which we cannot compromise. That's what we thought about when we were about to move Mint to AWS. We decided that security is a prime concern for us while we move financial data to a public cloud platform. We have done a couple of things apart from what AWS gives as a standard to every other customer and a couple of them are a network encryption that we have done like any data or communication travels the network is encrypted. Between instances, the communication whatever is happening is actually encrypted and which prevents network snooping. The other thing that we have done is application encryption. What does that mean is the data what we have in database is stored after encryption from the application layer before storing into databases. All the sensitive fields in database is encrypted which means even a database engineer cannot get sensitive data in plain text from database. The other thing that we have done is storage encryption. So whatever storage that it is, maybe an ephemeral storage or an EBS volume, we encrypt them before we use it for a storage. So the ephemeral storage that we have or the EBS volume that we have are encrypted before we store it and we are using DM for encrypting that. Which means it's at the block level. All this three means the plain text data that we have in our instance are only kept in the memory. And then what we have done is to make sure that we are not compromised there. We wipe out the memory before we give the instances back to the AWS pool. Which means there is nothing in the memory while we give the memory back to the AWS pool. All these encryptions made another big problem for us which is nothing but managing the secrets. So we use a lot of secrets which are really independent from system to system with changes. So we had a lot of secrets and we wanted to store it in a secure way. That was another concern and then we came up with another service called Intude Secret Service which is actually a combination of the key management system that AWS offers and some of the Intude specific services which make sure that the secrets are securely stored. They allow us to retrieve the secrets in plain text in a secure way. After verification of the instances which asked for secrets and all that. So the other thing during the journey what we found is like all these encryption overheads and all that when we moved Mint to AWS introduced a lot of latency. The one reason why we had a high latency in AWS was the latency due to the encryption that we have. And the second thing was because of the cross availability zone communication. Which means our instances for the HA purpose are span across multiple availability zones. Which means the web, if you have 10 webs, the 10 webs are span across three availability zones equally. Which means 3, 3, 4 or something like that. So when the communication happened from component to component what will happen is the communication is going across the availability zones which are like maybe 40 miles apart from each other. So that added up some latency in there. So what we have done for this is we worked out on the caching mechanism that we have in our application and the friend and latency. We have done a lot of work over there and then reduced the latency significantly to make sure that even after moving to AWS with a lot of encryption we are not compromised on the OPEX metrics that we have on SLAs. The other thing that made a high latency was the wrong selection of instance types and application configuration. So when we moved Mint from a physical conventional data center which has a huge amount of RAM, the beasts where Mint was on. We moved from there to a horizontally scalable smaller instances in AWS which made a lot of latency because the applications didn't work the way we wanted because it had a different configuration. So we had to trial and error with the configuration and the Mint and the instance types to make sure that Mint is hosted on AWS properly. Let's talk about some of the other challenges that we had which may be a challenge for everyone who has a big amount of data. We had then around 60 terabit of data while we planned to move Mint to AWS. And it was a question how we moved this much amount of financial data to AWS. And then what we desired, we have tried multiple things but a lot of them did not work well and then we tried to use the AWS import-export job to move data physically from our data center in Sunnyvale to AWS data centers. We moved 60 terabit of data in different sets to make sure that our databases comes up and replicate with the masters in the physical data center before the data gets stale because you have to have the bin logs. Again, we are using MySQL. We wanted to make sure that the data does not get stale before the databases in AWS comes up. The other problem that we had then was we still failed in getting the data properly on AWS with AWS import-export job because a couple of charts that we had became stale and then what we did is we had to use a different mechanism to move over the wire. We tried different things like NetCAD, R-Sync and all that and then Tsunami worked well for us which is working over a UDP protocol. Can I take the questions after that? Then the cost surprises that we had. We also took away where we don't care about the cost and then we moved Mint and made it on AWS. After that, we decided we should optimize on cost because our director was falling down from the seat all the time because of the cost that he could have. What we have done initially was to try out with our instance types and all that but it didn't work really well for us because we had already done a lot of optimization there by the time we moved to AWS. The most chunk of a bill that came to us was from ABS and the IOPS that we were using which mostly we didn't care about. We were really over allocating our databases while we moved to AWS and we were using around 1 terabyte of EBS volumes for every other shard and then we optimized there to make sure that our databases storage allocation is just about the actual utilization. Then we automated that to make sure that it dynamically allocated the storage space when it is required. The other thing was the IOPS optimization that we have done because the IOPS optimization was also another factor because we were over allocating there. Then we dynamically allocated or adjusted the IOPS looking at the historical utilization of IOPS per EBS volumes. We have done one more thing there which was separating the read write to IOPS to EBS volumes. What we have done is we identified what read and write should be there on an EBS volume which are required and all other things have been moved to the ephemeral storage that we have. For example, the log files and all that have been moved to an ephemeral storage which does not use the IOPS of the EBS volume. The other thing that everyone does is clean up the infrastructure aggressively so make sure that nothing is lying around which are not in use and things like that. So EBS snapshots, EBS volumes, instances, test stacks, everything gets deleted upon our retention policy and things like that. And plan for reserved instances. So look at the usage that you have and then make sure that you use the reserved instances wherever you can. But for the security reasons are not using reserved instances but we are using dedicated reserved instances which means on any availability zone you have a set of hypervisor which are dedicated to you but that's a risk as well because your instances can go into the same hypervisor. Multiple instances can go into the same hypervisor and if that hypervisor has a hardware issue then all of the instances that you have on that instance is getting affected. So you'll have to take care of that when you decide to go for dedicated reserved instances. Let's talk about some of the implementation that we have done maybe differently while we moved to AWS. So one of them is actually the AWS services that we are using. So what happened? So we are also using a lot of AWS services as what everyone does but couple of things that we have done, as I said, maybe differently is usage of API of AWS. So what we were doing initially was whenever the instance comes up we wanted some of the metadata from the AWS API which is given by, which required to make sure that our instances are coming up and then talking to each other properly. But we were using directly, we were calling AWS APIs directly and then what ended up in happening is we got throttled big time because whenever the instances comes up in a huge number AWS started throttling us and then the other problem happened is like AWS API calls are really expensive in terms of time and if the instances started calling the AWS APIs when it comes up the bootstrap time for each instance significantly was really high and then we couldn't get our infrastructure up in the way we wanted. And then what we decided is we created an internal API which exposes the data from that to all the instances under that. What it does is it calls AWS API on an interval and then caches data, actually a superset of data which is the only data required for our instances to come up. And our instances when it comes up calls that API and get the data and then start booting up which means all the data is there in one call and then which is really not expensive because it's an internal API that we have and then our instances got a lesser bootstrap time. That's one thing that we have done. The other thing that we have used is AWS tagging. In a different way what we have done is we started tagging other instances in a way that we have a different completely independent minstack in a single... a multiple number of independent minstack in the same VPC which we call it as endpoints. Endpoints are nothing but completely independent minstacks from data to the presentation tail which talks to each other but communication from one endpoint to other endpoint is not possible. So for example as an engineer I want a test stack. I can create a test stack as Sergeant and then another person can create another test stack A or something minus independent of the A stack whatever he has created. So that's one thing we have done differently. Let me talk something about the bootstrap and stack creation that we follow into it especially for Mint. We use cloud formation template to describe our instances and we have infrastructure code creation mechanism which is completely a config driven one which means we can create any type of stack with different configuration and input to create our stacks, to create the cloud formation templates which means we are not hard coding or keeping the cloud formation templates as such in our infrastructure to create our stacks and which made the infrastructure code scalable and reusable for any other platform with a couple of changes that we have. The other thing is we use AWS meta data AWS is tagging as a meta data as well as I said in the last slide. So that is also used in cloud formation template while we create the cloud formation template and then thanks to Mike who came and talked we use salt stack extensively to create our infrastructure. One thing that we found there is everything in cloud is dynamic and then especially for security reasons Intude has a very dynamic secret strategy as well. So there is no hard coding in cloud formation or in any bootstrap script or configuration file. For example, if you want to talk to your databases you don't have the password right away. When it comes up it calls Intude secret service and gets the passwords and which is if you have 100 databases, 100 passwords so you have to create all the configuration files on the go which is what I meant by everything dynamic. The other dynamic secrets is other thing so we have 100 shards, 100 passwords so for example how we set passwords for our databases when the databases comes up it calls Intude secret service all the secrets that it needs and then we expose it as salt pillar data and we use it to create all the bootstrap scripts and then bootstrap scripts dynamically go and set the passwords for each database and which is the same thing we do for our application as well to talk to the databases which means there is no password which we know nobody knows the password. Something that we have done with infrastructure code is how we managed to get testing for infrastructure code which may be a little different from people does or maybe how we do a salt in our infrastructure is not a master and minion method. We check out the code to the instances and call a salt local which means the configuration is locally available and we call salt and then we bootstrap our instance which means as an engineer I want to test my branch I create the infrastructure from my branch and the infrastructure creation code will tell the bootstrap script that you have to check out the branch which I am using to create the stack and the same gitsha that you have to use and then it checks out that and start creating which means everybody can create their own stacks and then test their own infrastructure code whatever they have written before they check into master and then for the deployment what we do is we identify the deployment cabinet after we deploy it to the end to end environment which is a testing environment that we have and if we say that everything is fine with the infrastructure code we branch it and we tag it, we use a git tag and then we use a git tag to create the deployment and then for any hotfix within that deployment we create another tag actually the hotfix for the infrastructure is going in there that branch itself and then merge back to master so we can move back and forth in the deployment as well now the last one is the high availability strategy that we have we are using multi-availability zone instead of multi-region the reason why we went for multi-availability zone instead of region is nothing but the operational hassles that we have when we go for multi-region the main reason is to make your data available in different regions which is really difficult and operationally that has a lot of problems so what we tried during this journey was to basically balance operability, high availability and reliability so that's what we have tried so that's one reason why we didn't go for multi-region and of course it's in a lesser cost while you're compared with multi-availability zone and the other reason why we went for multi-availability zone is because some times multi-availability zone within the same region is more reliable than multi-region for different reasons and then Amazon is hosted on AWS the same way multi-availability zone instead of multi-region so there are a couple of reasons why it is high available than multi-region hosting whereas multi-availability zone is lesser cost, easy to handle and then it is highly available so that's the end of the talk and I am opening for QA and if you have any dev-specific questions or something I have Linu here who can give you a better details okay, so now means actually UDP protocol it's an over-the-wire transfer protocol UDP protocol is used and then it has an aggregator at the one end and then it emits the data in different formats and doesn't wait for a response back like TCP so what happens is like the data gets there and then the aggregator makes sure that the data is complete so it is faster than that well, we didn't find any laws actually so that worked well for us hello, hi you mentioned that for testing your infrastructure there's any brands that anyone can check out and then can bootstrap the whole infrastructure and test it so for testing that bit did you have any automation around it or was that testing manual? well, so there are two things right one is actually the build testing and the infrastructure code testing so there is one automation which we have for testing infrastructure code automatically and the build automatically so the build one is completely automated infrastructure code change is completely devop-specific and so for example, I am an engineer I want to test my infrastructure code which is on my branch I create the stack and then test it so testing is completely automated but what you want to test is a question so what are the changes that you do on infrastructure code I don't have a unit test for that so you have to do it in a way you want so we do two types of testing one thing is the infrastructure code level testing which is ops-desk and we are still improving on it to see where we can fill the gaps but for protection release it's similar to what the traditional one so ops creates the stack with the infrastructure code and we execute all our regression and functional tests to make sure that it is worthy for the customer to serve the customer and in that way indirectly we do test right now but we also have the like infrastructure level testing like security groups are open that is still in progress for functional testing it's a web, it's a Selenium or STAP8 infrastructure code infrastructure code doesn't have I mean what are you changing in infrastructure code maybe you have some security group change maybe you are changing something in the AZ level changes instance types you are changing something some bootstrap a lot of things like that are changed so what do you want to test is a question so if I want to test my change for example I have completely changed the secrets management system recently so my test is like I have to create a stack with the new secrets management system everything should come up properly I do a REST API call or whatever that automation that we have done which makes sure that the infrastructure code has brought up our entire stack completely without any failure so that's what we tested one more thing that we are experimenting is something called server spec it's a Ruby based cucumber based stuff which we are experimenting to test the infrastructure hey you spoke about increasing and decreasing the size of your EBS volumes and you spoke about some automation that was done to do it could you speak a little more about that any specific libraries you used and why and also how often do you need to do something like this okay so first coming to the cost optimization that we have done so we came down to around one by fourth of the cost that we were spending in AWS and then what we have done is very simple we are not using anything big there so every night we create a snapshot of our database EBS volume and then which is also encrypted so what we do is like we create another tag which is the utilization of the disk how much disk is utilized and then whenever we recreate stacks because everything is dynamic every deployment is actually a recreation of stack for us so whenever we recreate a stack when you create a database the infrastructure code looks at the utilization tag and then expands the volume if it has reached the threshold so same thing happening with IOPS also IOPS we look at the IOPS IOPS from the cloud watch and then when we create the snapshot we say what is the IOPS that is required and then looking at the history of historical data we actually do a 95th percentile of the historical data and then we give the IOPS according to that yeah hi so the question is when you choose stall stack did you evaluate Amazon service called Opswork we as Intute as a standard they were saying Opsware and then Chef and all that but Mint is a Python shop and that's the reason why we went for a salt stack because it's really easy to hook up your Python scripts and all that and then it's a really good framework you can do anything with it but frankly speaking we haven't tried Opsware but we are so happy with salt stack and the second question is about RDS so you mentioned multi-AZ so does that mean like it will scale we are not using RDS by the way multi-AZ you were mentioning you had the slide where you said you should not do multi-AZ I didn't say don't do multi-region but I said we are using multi-AZ because so I'll tell you what happens so if you are spread across multi-region the Amazon deployment model is pick one availability zone from each region and then deploy so there is a higher chance that you get to a buggy code if you are on multi-region and then whereas in multi-availability zone there is only 33% of chance that you are getting to a buggy code and if you are got into a buggy code 33% of the users are impacted and which can be so for our SLAs within 5 minutes we get that that database come up and then all the application take up the traffic in the other availability zone and last question is about spot instances have you played around with them what is your experience for cost saving we didn't try hello I had a same question with respect to the here this one I had the same question with respect to that multi-zone and multi-region so since we have only in a single zone suppose that that particular region goes we have in multi-zones suppose that region goes down so then we don't have an HE rate I'll come to that so if we are in US West too so if California goes under the water we are down and the SLA that we have is 8 days to get mint up in the other region if California goes down they have more bigger problems to solve so by the time we will get our stack in the other region so but on a serious note, mint is using multi-z but Intute as a company is focusing for their high availability and disaster recovery on multi-region but it is the operational hassles that you have is really high because we are using whether it is RDS or the RDS the problem is we don't have we don't right now have an option to replicate encrypted data between regions and the other problem with snapshot you have a snapshot which is in one region which is not accessible in the other region so you have to ship your snapshot from one region which is a region-wise copy that you have to do which takes a long time so cost-wise and operational-wise it is really difficult as of now but a practical scenario one region completely going down is really really remote and our SLA for that is 8 days so on the time front I think right now they have launched something called on an S3 if you store the snapshot they have this crossed S3 replication so that actually happens really fast well we have databases of like 800 gigabyte and all that which is which takes quite good time actually like last time when I tried 3 months back it was around 6 hours for me which is not accessible for a high availability alright thank you yeah sticking to the AZ and region question again I work with SAP labs where in a similar boat when it comes to handling data encrypted data is a must but then there are also legal requirements that come into picture for especially financial data a customer who is from the EU for example will say no I don't want my data to go and sit in the US so this is more a product issue than a technical issue I got it but where are you in that stage so we are not global yet but we have that in our pipeline but what we have decided to go is serve from the region where they are from so that's one reason if you look at my if you have seen my motivation slide we wanted to go global and then to span our data centers there you have to have AWS kind of thing because it's really difficult to set up physical data center if you want I think I am almost thank you thank you