 for the whole, you know? Are you cleaning on your own? Greatly so. Yeah, we go across. All the filovers are... We have nothing. No problem. Okay. They're still running on the major list. Oh. Actually, nine years. I see those are pretty common. But what I mean is, you are running for nine years. There were some that were... Yeah. Yeah. Yeah. Yes. We will do our best. Have you come with everybody? No. No, I just come to the rows. I need rows. I need rows. It's about databases and stuff. So, okay. Can you go to the front? Yeah. Because it's not... We're just speaking there. We're all here. Thank you. Hi, everyone. Welcome to this tiny audience session of migrating and running service towards AWS. We're going to talk about how we migrate them all to Amazon from the classic data center solution. Touchable hardware to non-touchable hardware. I'm Nick Pinoff and this is Ricardo Amaro. Can you hear me? Just fine. Just a little bit about me. I currently live in Ghent. I see one other person here that also lives in Ghent too. No, you don't. But also lives here in Barcelona for two years. Also in Lisbon and more recently also in Boston. I've been with Drupal for eight years. Drupal was a little awkward in the beginning. But then moved into the search space. And more recently moved into Malm and also the Aquialift products. And I'm a principal software engineer at Aquia. So it's very good to be back here in Barcelona. Spent a lot of time partying a little bit of Drupal with a company called ATACY Stamos before. I don't know if anyone heard of that. Well, it was fun. But I'm not going to talk about me. I'm going to talk about Malm. And I suppose most of you know Malm. Is there anyone who doesn't know Malm? So that's good, right? It's your Spam Protection service. And we try to offer a fully managed software as a service which is free and has a paid version as well. And tries to come back with a result called Spam or BAD under a 50 millisecond ratio. That's what we try to offer. And it's built in Java. So I'm going to let Ricardo introduce himself. Hello. So my name is Ricardo. I live in Portugal, Lisbon with my family. I also try to facilitate the Drupal local community. I'm on seven years now using Drupal. I'm an early adopter from Linux since the 90's and I work for years now in Acre. Where I am a senior tier two ops engineer. So before we go into the actual migration let's give you the big picture on how was the state of Malm when it was acquired. Operations is a very important part of Acre engineering. To operation arrives basically every product that needs love and also some unknowns. Malm was not different. So here's the text that we had some time in our wiki and just explain how we were going to handle the things for Malm in terms of operation. And you can see that we were only responsible from the Malm service being up or down like any and the basic services being available sites such as SSH, MySQL, Apache NGINX and if further problems would arrive then we should articulate directly to the Malm engineer. The whole infrastructure looked like this which is a result of a lot of engineering work during four years before us we are not focusing today on diagram itself but on the complexity that it was. Especially because it was on classic data center across multiple regions so you know like normal service to manage. Malm had just one Java engineer and some freelancers to help out with the building of this product and this person was really not hired to be available 24-7. However most of the problems didn't however most of the problems did reside on the product itself but on third-party software such as MySQL Cassandra, NGINX etc. So let's suppose this Java engineer was a cat. After Acre acquired Malm Ops supported the infrastructure and the Java engineer was the expert on the Malm codebase. However 95% of everybody's time was dealing with alerts and outages that were caused by infrastructure issues power lines that were cut disk failures lack of automation so and did I mention this is non-cloud environment so you can imagine. So this is a graph of alerts that were generated by the Malm infrastructure and it shows how this was working while we were receiving like 20 million HTTP requests per day 8 million of spam requests per day and in that case the worst day is there we had like 300 alerts to deal with and you can see that point there when we switched to AWS how we were able to reduce the amount of alerts on the Ops team so less wakeups of course. Also some of the problems were really not automated due to the lack of expertise on third party software so this is one really funny example of this case this one example was continuously waking up operations at 3 am so it's really not very good notice that reading these instructions at that time of the night was really bizarre and who has tried from the audience has tried to run RM-RF on something at the middle of the night here like and with a star so yeah bad things can happen there that's for sure so we got rid of that we're just giving the bad picture here of course so now it's a good picture alright so we've prepared some exercise for you so you're all devops or engineers or anything else you build HA applications who doesn't build HA applications okay so I'll skip you that's fine what we're gonna do here is we're gonna build a highly available audience application and we're gonna do this with cards so one row is one component and I'll show you the components later and I need to be able to take down one human and the audience the order is important so I want you to start from the front to the back and these are the components that I want you to order yourself by okay so I'll answer some questions if you have but it should be pretty self-explanatory it's easy we just delivered the tariff you figure out how to do that here one more okay so you can all read what's in the cards don't mix them up because you can only have two components for the whole row so each part is one region your U.S.C. for example your Europe and I want you to start with the first row so what service are you so you're a varnish so when I type in google.com so hold it up high right but that's not HA yet so I want you to do this exercise right now make sure that the first row is HA the second row is HA etc so first thing is DNS second thing is okay and then the third one so I expect two cards in every row right at least yes because if I take one down everything is down single point of failure that's not what we want they keep it up high enough so that people can see third row it's behind it's just a load balancer you know so well I see web and then I don't see anything anymore so DB I only see one cache you have multiple caches three caches that's good four caches alright so if I shoot down one cache you're dead so how does that work with the web server how does the web server know which cache to go to it's a difficult question I know actually another load balancer there with the caching you should probably answer that yeah I did answer that myself but you need another load balancer with the caches so you can see it already gets complex and we only have six components and a pretty small audience you remember that diagram from before from all of them so this really wasn't an easy exercise to move all these single components from one infrastructure to another without taking it down alright you can keep the carts if you want so I'm going to talk about ephemeralism this whole clouting we all heard about it's really not that physical server that you don't own anymore it's a thing that can always disappear ephemeralism is a word that had a really hard time understanding and Peter Wellin is here in the audience and he first told me oh let's make this server ephemeral what is he saying I've never heard of that word but in Amazon servers can just disappear so this book was an eye-opener to me and I really recommend you to go and get that book if you're into building infrastructures building cloud systems or distributed systems I won't go into detail what it actually offers but it tries to tell the optimal story the most optimal story of a distributed system that does not exist today it doesn't talk about technologies itself so this is one of the theories that goes into that book maybe some of you are familiar with it but it was important for me to understand this theorem so for example if you have MySQL you need consistent consistency you need always that right after your words you read it you need to get that right MoLM is not such a system we have eventual consistency that it could be that a spam or two spams go into the system and we don't care about the order how it actually is processed that's as much as I will say about that kept theorem but look it up if you're interested so fast forward this is where we want to go towards and explain what this all means as we continue is there anyone who knows all these Amazon services or who works with Amazon on a daily basis except for Aquia guys so no one aside from those three has ever worked with Amazon like maybe you know EC2 the virtual machines but there's more and I'll try to explain it for MoLM for this whole migration it all starts with cloud formation cloud formation is a kind of a language which is built in JSON or it uses JSON as a convention and we use it to spin up all these resources that you've seen in that diagram so as an example you can see here this is how you define and I think it's a load balancer so here you define the load balancer and it will spin up that load balancer for you the good thing that it is versioned so whenever you make a change to your infrastructure you actually know what change you made to your infrastructure remember in Amazon you don't have that single guy you can call to actually change a configuration or to replace the plug or to do anything so cloud formation was a critical key component for us to make sure that we went on the right route and to make sure we could repeat repeat what we did we could throw away everything and start again within 5 to 10 minutes so the first part of the migration was to switch these API nodes and they consist from auto scaling groups which is for example if you have more load you add more servers like elastic load balancer which is basically the load balancer you had in front of you elastic compute 2 it's just to VM on Amazon and then we have the Java application on top of it so if we continue the first part we actually do is to devise our network infrastructure Amazon has a service called VPC and you can define subnets external IPs it's this one networking infrastructure where you define all your architecture in your databases etc it's very important that you start if you start with Amazon to start with VPC because some of the new regions like Frankfurt only have VPC it might not make sense to you now maybe you start in 2 weeks and then you realize I should have listened to the guy so this is more how it looks like so you see there's internal IP addresses some of the servers within Malm don't need exposure to the public internet so they don't need a public IP for example that load balancer of the cache that we had in the back there's no reason that should have a public IP you shouldn't be able to get there and you have to first go into your instance no the ELB is not something you log into so it can be internal also again you can define this with cloud formation and then we get into the database part of things so when you do a migration with data it's always hard and Malm stores some of that data in MySQL it stores some of that data or stores some of that data in Cassandra and we couldn't have rights to both MySQL database at the same time in the old data center and to the new one because at some point maybe we would have a mismatch and then it's really hard so the first part of our migration was to actually move MySQL to Amazon and we use this service called RDS it's purely MySQL and Amazon there's nothing difficult about it it's MySQL you can also define it as being HA and there's a whole bunch of metrics and other configurations you can put in there so to continue the biggest part of Malm is storing those comments analyzing those comments making sure that you have the right captcha all this instant data retrieval have you worked with Cassandra before? you've heard of it? so it's kind of a key storage no SQL, all this buzzword thingy DynamoDB is exactly the same but then the Amazon version of it the good part for us was we are no experts in Cassandra we are experts in the software and Cassandra was the biggest pain point for us, we had no clue we really had to configure it optimally so going towards a service that in the end has the same price tag while you get the added benefit of some people behind it that actually know what they do it's great so we read this, this is Cassandra without the alerts so once we had our database moved to Amazon we decided our key storage solution we don't really care if there's some inconsistency so we will write to the both places at the same time does that make sense with that cap theorem in mind? so some of things really have to be consistent like a bank, some of this comments we don't really care if it's like maybe it's there, maybe it's not so that allowed us to do load testing on a service we didn't know and the tricky thing about DynamoDB if you ever get to it is that it has fixed limits so you set for example 100 writes per second and once you go over those 100 writes per second the API will tell you you're not allowed to do this and if you're used to running your own software that's not something you can think of the limit would be your CPU or anything else but not the API telling you no you have your limit and it gets more tricky when you get into the details of DynamoDB so DynamoDB is this massive service and you don't know if it's one server or four servers or even more you also don't know how many partitions it has if you have 100 writes per second you have four partitions each partition has 25 if you write to one specific item more than 25 times a second you reach your limit but you don't know how many partitions you have so you have to be very careful in designing your system so that you don't write to the same key many times it's different thinking so as we move on we also had issues with DynamoDB as it didn't have a time to live for people that work with for example Radis you probably know a time to live you store a key maybe 20 minutes later it's gone if you have 20 million requests a day you don't want these things to live forever because you pay for your data and DynamoDB doesn't have that support it doesn't auto delete these things so we came up with a solution called Dynamic DynamoDB Manager and we have rotational tables so everyday a new table gets created and after a week that table gets deleted because iterating over every item we would reach our limit so there's a lot thinking that comes with it if you use service this software is open source and available if you ever need this so moving on we have the hard part done the data is done so we can move on to the stateless API servers and this is what I think you guys might be familiar with the regular EC2 instances it's very easy you spin one up it has Linux you start messing around and then you don't remember what you did right? and then let's try not to do that so what we have as a vision for EC2 is only in emergencies you go and log in and make a change if you spin up the service it should configure everything from the start right? if you want to make a change you delete the server you start a new one if you want to do an update you delete the server you make a new one you want to revert a change you delete the server you spin up a new one with the previous version remember your infrastructure is version because you have cloud formation so you can always go back to the previous version except for your data that's like a wild card you cannot really go back to the previous state of your data unless you have backups so it's the easiest system and most known system and we treat all these EC2 instances as a model as disposable so we can kill a single one and it will spin up a new one automatically it can go away and Amazon can tell us sorry we don't know where it is but it's no problem it will automatically spin up a new one because in cloud formation we said we need at the minimum four servers so this is also how it looks like for cloud formation if you want to mess around it at home you can do so you can copy this I can also give you some of the actual text and you can also filter it though for some secrets and using these things with the load balancer and EC2 gives you added benefits like access logs it could be that you need to adhere to certain audit formats so it stores it to the storage system S3 of Amazon have health checks if it fails for example in Malm if one API server for some reason goes down as in the software Amazon takes down the server and it will automatically spin up a new one we don't care if Tomcat crashed we are not restarting Tomcat we're killing the server too bad for Tomcat that's what it does it also has connection draining so whenever it kills it it actually waits until the connections are completed and some more stuff so you think okay so this server you're configured with puppet or chef that's possible right because that's what we all use that's not true we're not using any puppet nor chef it's all in cloud formation we don't do any orchestration of the software it's whenever the server is started it installs the software it starts the software and that's where it stops it's a little different from what I think you're used to or maybe not depending on how you work but for us this was an eye-opener we don't have to mess around with a puppet server anymore because the puppet server I don't know how you experience it but to me it's horrible that it needs that SSL communication all the time and if your server dies you have to redo all your clients chef is a little better probably but you still have to manage that chef server and with this we actually remove the part of the infrastructure also metrics is important so Amazon comes to the server called CloudWatch and it's great but expensive and on the servers itself we have a software called Diamond and Diamond can get the metrics and send it to either CloudWatch or StatsD you can see you can get pretty in detail but this Diamond tool actually creates the alerts for us on the fly so remember in the middle of the night the server goes away it starts a new server it starts the service it adds the alarms and it just does everything automatically so that if something ever would happen that needs to reach out to operations it can and then alarms where you use pager duty maybe that's known to some of you I'm not going really into detail how much time do I have left? probably not much well so these alarms who of you here are operations or actually receive alerts have you ever received an alert bomb so one server goes down or service goes down and takes down everything else and then it's up to you to find out what actually is the problem if you haven't had that before bless you it's really horrible to get this storm and alerts are useful if things wouldn't be connected to each other but in our world everything is connected to everything for instance we had another structure which is the search part and imagine this we had like how many cores we had in Java like 300 cores and we had the check for each core but if that Java server went down you had like 300 alerts just because one server went down which was really painful because you would just need to fix that server and you would go search for the right alert between 300 alerts so we fix that that's an example of this alert bomb so what we've built to help operations with this is ordering in that alerts so when those alerts occur the top one is the most important one to actually look at at 3am that's very useful so that's more or less the whole infrastructure and then the actual migration how do you migrate a service if you only have one endpoint so we had api.model.com and we could have switched the IP on the DNS and it could have exploded it probably would have exploded because sending over 20 million requests from a cold service from a service that's running to a cold service is asking for trouble I hope some of you cannot relate so there's this service from Dinect and also Amazon Route 53 can return a different IP based on your geolocation so if you're in US East and you go to api.model.com you get a different IP address which is the old data center if you were in Australia you would get the new data center so this is how we switched modem bit by bit without using a different domain name it really helped us to get a small segment of our traffic onto a different or the new infrastructure do some testing do some bug fixing and not taking it down I highly recommend this if you're actually migrating services because it's very cheap and for some reason I don't know but I haven't seen this a lot in Stack Overflow as a certain solution but I think it's awesome so as a result now we have happy deving and happy opsing and operations only jump in to actually own operations that cannot be automated like disk size or hard drive ops sizes if you spin up an instance with 100 gigs it gets full you still need someone to actually say I need it 150 there's no way to currently automate that really easily you could probably but we haven't done it yet and your auto scaling is not fast enough to react to add more servers you can have an alert ask ops please add more servers in the ideal world all operations are automated and developer and operations is not separated but combined in one role to me you can call it DevOps I call it an engineer so with that I'd like to thank you for listening and hope that there's maybe some questions and you have to go to that mic on the other end I haven't used CloudFormations but can you go into some detail about how you do the provisioning the provisioning sure so CloudFormation you can store it in S3 in Amazon and in the console of Amazon I want to launch this specific script now it also allows you to do updates of certain scripts so once you make a change you update it to the same S3 bucket you go into your stack and that's in the console it's all in UI or with scripts you can choose and say I want to update this stack with this script and then it reads out that script in Amazon that you can actually refer to SAP scripts or SAP files so you can split up your CloudFormation files depending on the service and maybe I can give you an example so I have to stop this first for some reason it's not very useful alright so I'm not going to show much of the CloudFormation itself there could be some confidential stuff in there but on the left you can see the structure so our base template is the base file where we define the security groups and certain things that we really need in advance the first thing that then happens is it refers to the VPC template to create the network and you can actually rely on outputs of certain sub stacks and Amazon or CloudFormation is smart enough to know that it has to do that first to then provision the second thing to then go to the third thing and then the only script that we have here this is not porn but POM just telling you yeah yeah so this is the only script we actually have to deploy this is an upload to S3 scripts there's no other bash scripts in there, there's no other puppet there's no chef, there's nothing else there's nothing on your local computer that you need to rely on if another new employee would come in he clones the repository he follows the steps and he's got his life and I remember when I started with some products within Acquia it takes much more time I mean to do certain things so I hope that answers your question a bit, I did, thank you any other questions? you talked about the MySQL and how did you manage to I suppose that MySQL should be online all the time and did you move all in a single step or had you to suffer from any latency issues when MySQL was migrated or in which order did you take the MySQL migration? so luckily it was all in US east so our database center where we were before was also more or less in the same region I think Philadelphia I don't remember where they were and then the new database was in Virginia US east from Amazon it was the hardest problem so we had to actually have some time time with Malone to say this is where it stops and we had a backup and then we had the replay of the logs for the last 15 minutes for example and then we were back up to speed but we had to stop it and then until we actually migrated everything to Amazon we had all added delay or latency towards that database luckily the the service is built that way that we don't rely on MySQL for your spam request or for your comment analysis we only add results to MySQL after it affects asynchronously so it's only Cassandra that's really that needs to be available or DynamoDB and we were able to do those two at the same time and now with the DNS switch some people were in Cassandra some people were in Dynamo depending on where they were one question from my side did you run into any issues with the ELBs when you were switching traffic over initially or was it like just working without issues what kind of issues that you have too much traffic on the ELBs so we weren't on it in advance we told them we're adding this amount of requests because we knew exactly how many requests we would expect there is a process I don't know if you all know but it's called warmup it's not very documented but you can just call them and say hey this is going to happen please warm up and scale it up to this range more or less and they will ask you what kind of connections are you going to have thanks any other question last question no, so tomorrow you all start with Amazon just wanted to remind you I'm not working for Amazon or preaching them but some of the IDs you can also take to other data centers that offer cloud services so I hope from this on a cloud system can go away and build your system around that and less alerts alright, thank you for your attention so