 I think it's ready to get started. Welcome to Lessons Learned from AWS, how ops and devs can work together to improve. A lot of this is about my experiences running websites and Amazon web services. So probably a good time to do a few quick about me. In addition, I have about a decade of experience as a full stack developer before I moved on to doing operations and dev ops. Over seven years using virtual servers and the five of those were with Amazon web services. And I've worked for companies ranging in size from a two person startup where I was pretty much doing all the tech work to enterprise corporations. For this talk, we're gonna do an example of a project called the Super Awesome Content Builder 3000. With my experience has been in advertising, content marketing and user-generated content. So it seemed like a perfect fit. And it's gonna be a combination of a bunch of different things I've worked on over the years, the problems I've run into and what I learned from them. So what we want in the end is something that allows users to quickly and easily build super awesome content their customers will want to share. And our goal is to test the market as soon as possible. And if we're at the Polystalk earlier today, we're gonna start with everything on one server because that works really well with Vagrant. And so we have the monolith. And as you can see here, because I was more interested in testing my idea first, what I'm really gonna do is a WordPress install with some custom plugins. And so I have my web app, Nginx or Apache, my MySQL server, those are the core components. I realized I wasn't quite getting the performance I wanted on some stuff, so I added two cache layers. So that's my memcached and squid. And as always, there's always gonna be some tasks that I need to run, cron jobs, whatnot. So as I was building this in Vagrant, I finally decided that I needed to get some feedback on how everything was going. And I didn't really feel like trying to figure out how to get internet access to the world, to my local computer, because I think that has some security issues, I'm not quite sure. And so I wanna put everything in Amazon. And because it's already set up that I can easily build this Vagrant server, I said, hey, there's this EC2 plugin. Let me just start up an EC2 server, install everything I need, and put it on there. And then I started getting alerts. I put a Pingdom and New Relic on there, and every once in a while they'd be like, hey, your site's not up now, do you know that? You might wanna get on that. And as I started looking through the logs, trying to figure out what was causing those things, I found a lot of it was resource contentions. Things like normally people would only, I had assumed people would only access my web app during the day, so nighttime was a great time to do my tasks, but then a web spider came along and would go through every page on my site, which kind of circumvented some of those protections that I had in the cache layer because they weren't in the cache anymore, they'd been evicted, there was no need for them to be there because nobody was accessing those pages. And so the first thing I had to do was break up it into separate components. I have two EC2 servers, one running my web app, another one for the tasks, because as I saw it already had problems with resource contentions there, and I put, I use other products for parts of the site that it didn't really seem necessary to me to reinstall my SQL on an EC2 server and still be responsible for my own backups, for figuring out how to failover, and all those things when relational database services figure that out for me. And the same thing with Memcache. Elasticache is fully compatible, you just say, hey, I want a cluster and it gives you an endpoint and you put that into your code and CloudFront was much better than Squid in part because it's a content distribution network, it's worldwide, it'll be closer to my end users. So I did that for a while but I kind of was still losing a little sleep because even though I no longer had a potential failure where if I had a hardware failure on the machine before, everything went out. Now I spread out the possibility of everything going out but if any one of these were to have a hardware failure or Amazon says, hey, guess what, you're running on our older hardware and we really wanna move you so you're gonna have to reboot, my end user is not gonna be able to access the site. I have incrementally improved it but I haven't improved it enough. What I do there is I build in redundancy. So CloudFront already has a redundancy built in but my web apps, I need it more than one and in order to do that, in Amazon, the best way is to put them behind elastic load balancers so they're still one point and you can quickly and easily swap out those machines. Elastic Cache serves as a cluster, you can put multiple nodes in there and then RDS has a common setup where you just say I wanna do it in multiple availability zones and you'll get two MySQL servers, one is the active and one is the standby and the other thing you notice here is I put them in two different availability zones whenever I built anything and that's because in Amazon, one of the things that can happen is you can have a zone not be available and that's equivalent of having a full data center outage if everything is in that one zone so you wanna spread them out a bit. So I built that new structure out with the multiple servers and during my quality testing, I was getting some random intermittent 404 errors. Files just weren't available and as I looked more and more at the logs, I always realized it was on one app server and not the other and there were two common cases where I found this happening. One is when I had something automate the front end building of CSS or JavaScript much like some of the grunt stuff that was talked about earlier today, they would be built on one server but not the other so and the elastic load balancer doesn't know, hey, always send that traffic to the server that built the thing and not the other one and the other thing is that when users upload content, again, that content goes to the server that the user hits, not both of them and so unless you have a way of putting those files in a central location, you're gonna get 404 errors. So the way we decided to solve that when we ran into it was actually anything that was shared static content that was images, CSS, JavaScript, put it in this S3 bucket and then make sure the web app always pointed people to there, whether that was at the Apache or Nginx level or in the code itself and one of the interesting things that I'm looking forward to playing with here in the future is that Amazon has a new service called EFS, Elastic File Storage and that can be like a shared mount where you can have content of those files together anyway and also solve the same problem without having to do a lot of changes to your code or your configurations. So the next thing we ran into you is that our traffic grew, everyone likes that. We noticed though that it was variable pattern and so we had to keep increasing servers but at the time it was a very manual process and we would just keep the number of servers we needed to serve that peak traffic even if we were down in those valleys and this is actually one of the problems Amazon had and why they ended up creating AWS is because they too saw that they were having all these servers sitting around during late spring, early summer when nobody was really shopping and so what we decided to do is build an elasticity when we need more servers, our auto scale, we build an auto scale group which points to the AMI image that we want to use for our servers and we tell it what type of events if it sees certain types of events increase the number of servers available and then when those events are over, decrease again and you can see that during the day we can go from two, three, four back down to two at night and we can save some money and match our pattern better and on top of it, I no longer have to start up the servers. Probably ran into that is we at first we saw these increased error rates we hadn't really thought through what we were doing when we made this change. So you see here the low two servers and then as we got to three or four we saw errors and then when we went back down to two the errors went away and a lot of this is because what was happening is that we used the same AMI we had a release but we never updated our AMI and so every time we had two servers we were using the code we knew worked and when we started and went to three or four we were now running on an old code that was no longer working correctly and then so we fixed that problem we figured out, hey, what we can do is just have the servers pull the latest tag the release tag and put code in and that worked great for a while until we decided to do a release tag before we actually did the release and then we have this increase in servers and they're running newer code than our original servers were and so if it was an impactful change the new servers would be running code that would produce errors just the same and that's when we really started having to think about our code pipeline and deploy definitely every time we created a new release whether because that release was because of code or configuration changes we had to create that new AMI and update our auto scaling groups in order to get it to work correctly and then we realized what we really wanted to do was make sure instead of pushing those changes out to existing servers and creating an AMI based on that what we really wanted to do was create the AMI first and use that same image through our different release step processes so we wanted to in our first level of test environment where we're just integrating with all the different changes not just our own on our local machine we wanted to create an AMI there and test that with the other changes other developers had been making and then when we got to our next level where we were doing some more build and unit tests and automated acceptance tests with those tests what we wanted to do was make sure that we create the environment fresh so that existing servers and infrastructure wouldn't hide problems and also that and make sure that all of the different transformations we were making when we built a brand new server were still true and that everything would work correctly. When we got to stage and prod with QA people actually looking at the new servers and something closer to production value there's some argument to be made about whether or not to update and create depending on how long it takes for you to have all of that data come in and whether or not data is part of your release process but what we're doing here is making sure that the AMI and the configuration we have to do at each levels of this is working correctly and we'll feel more confident when it comes time to release the prod. So then we ran into another problem. So our code has actually been improving really well. People have liked our product, it's up there. People are really enjoying it. Before we saw these variations and you can see most of the time there may be twice the amount of before but occasionally we get these really big spikes. At my previous job, one of the times we got this was a cable network decided to use our product to run a sweepstakes during primetime television. We'd never run into that sort of thing before, it was a unique learning experience and what we found was that there were, people were complaining that they couldn't access the site. I mean we've all seen this during Super Bowl or other events, you just can't get through. I don't want to get there yet. That makes me sad. So what's really happening here is that we're using elastic load balancers in front of our servers and elastic load balancers are set to be a portion above your normal traffic. So it's expected to kind of grow with you but when you suffer, when you have to suffer, when you have the great opportunity of having this exponential growth, the yield just dies. We ran into this before, now they allow you to see, to monitor where your surge queue is, how many processes are waiting to go through the ELB but back then they didn't have any of that stuff and all we knew was we were looking at the logs, we were looking at, was load up or anything like that, nothing, we couldn't see anything and that's because the load balancer basically just stopped listening. So there was no way we could know other than the fact that pinged them like your site's really down or like we know, can you give us any more information and what happens, so when it suffers that type of event, you just have to communicate with Amazon. There's nothing you as an individual can do because what they need to do is they need to know that you're gonna have that event in front, they need to have an idea of how much traffic, what the traffic's gonna look like every time I had to contact them, the question list just kept growing and growing. And what they will do is they'll preload your ELBs and part of the reason you don't wanna do that is because you'd have to send that sort of traffic pattern to your existing load balancer in a stepped up way and you'd have to pay for all of that and then you risk also de-dossing yourself basically. So that's when you just have to start talking to them and actually what I found with Amazon is that overall anytime you're going outside of a very normal usage pattern, you really wanna talk to them up front. They prefer to know, they prefer to work with you to make it all work. This could be anything from wanting to do security testing that may look a little weird to other people and they might be like, hey, what are you doing? And they kind of prefer to make things safe. So if you're doing something really weird, they may just turn you off instead of, like, then ask questions later. So finally, we run into a rather big problem here. One of the things that I'd like to point out is that we started from this big monolith app that we didn't quite understand and then we put real traffic out and we just kept trying to respond to the things that we saw and we did get more and more proactive, but we're just quickly trying to release new features and we didn't always necessarily think out what that would look like with live traffic and one of the features that we released was API calls and we kind of sort of just made it part of the product. We didn't decide to separate it out to separate the hub domain. There was no way other than through the request path to know what type of request it was and all of these took different time. So the D requests were the ones we really liked because they were nice and short and we can serve them quickly. A and C were okay, but the Bs took way more time than anything else and as long as the request type D, those short requests were the majority of our traffic, we could sustain our growth, we could be happy, we wouldn't get paid just at odd hours, but the moment that we released that new feature and we had that sudden change in our topography or request structure, it became unsustainable. Suddenly we found that request B, the longest type, was the most common one because people were trying out a new feature or we actually really solved the problem that the market was having and the ways that we actually saw that a lot of times in the logs, whereas we were finding timeouts and we were seeing or we were seeing that all of the child processes and genetics or Apache were just running it full till we were having load but a lot of the requests just weren't coming through. So because we didn't, so the first thing is elastic load balancers are really simple. They say you, they give you an endpoint and you tell them which servers on the backend that you want that endpoint to point to and that's a lot of pretty much all they do. So the fact that we were using request path meant we couldn't divide the traffic up at that level. We couldn't say, hey, if the request path looks like light, put it to those servers and if it looks like heavy, put it to the other ones. So we had to actually add another load balancing layer in and we did that with HA proxy. And one of the difficulties there, like it solved the immediate problem, which is great and it was actually a really quick solution to implement. You just started up some new EC2 servers, you put the configuration correctly, you put them into the load balancer and everything worked. And I think we were even able to put them in with the other apps still in the load balancer so there was no downtime. But the problem here is that we lost that elasticity that we fought so hard to get before. We no longer could spin up more app servers because we didn't have an API to put them into the proxy anymore. So we could build all those app servers if we wanted to respond to traffic needs but the traffic wouldn't end up going to them. So while this was the great temporary solution that meant that our site didn't die a horrible death and have lots of articles on the web about what happened to us, this is what we really had to do. And if you went to the talk earlier on microservices, this would probably look familiar and DePaulie kind of talked about how the monolith app makes a lot of sense at the very beginning because it's quick and easy to get set up. And eventually, if it gets popular enough, you will need to start pulling these things out. And that's what we did. We worked with our customers and we told them, hey, I know we told you to use this endpoint before for your API calls, but could you switch it over? We could do redirects, whatever we needed to do to get that off of our main bread and butter, which is the dub-dub-dub site. Because at the end of the day, the customers cared less about how they access API calls as they did about their customers and their users seeing the content that they worked so hard to put out there. And so now what's great about this system is for right now we can just easily, our problem wasn't at the elastic cache level and it wasn't at the database level. So we can just put those load balancers and we can keep the rest of our infrastructure the same. And we may even decide down the route so we can keep the same code base. We can decide down the route to segment out that code base too, it depends. But for now, everything else is the same other than we're using two different endpoints to access it. And we can put the autoscaling groups on those elastic load balancers and even if depending on what resource contensions we're seeing, we can change which events spur the change in the number of instances in the load balancer in the autoscaling group. So I know I went through a lot there quickly and that's actually kind of on purpose. One of the biggest lessons I found in running Amazon is that we're able to change that much faster are how things look. I know I've skipped over how virtual private cloud works, how security groups work, all that networking stuff but overall we can make quick changes in response to what we're seeing from our market. In addition, I didn't worry about trying to figure out which database server at the time was going to be best because I didn't know what was gonna be best. We haven't seen any live traffic yet but I knew MySQL and RDS was gonna fit the bill for now and they gave me a lot of growth options but I made a side down the road because it is basically a content server, maybe DynamoDB is the better option but I can switch it out because it's just an endpoint. We can change the code, hopefully we've encapsulated the calls so that we're not changing the MySQL calls all throughout the code base but that's a topic for another day. Also the cost structure and the ease of creating new infrastructure makes horizontal scaling easier. One of the things I found, because when I first started working, doing a lot of this stuff, we were in a co-low location and part of the reason that monolith model became so prevalent is that we only had so many spaces on the rack and it was really hard to grow out more, really what we just decided to do was put more RAM in or put in bigger hard drives and you would just create these monstrosities that if any particular component was starting to fail or you needed to do more stuff with, you actually didn't have the physical room and Amazon cuts that out because you can just spin up more virtual servers, they figured that part out for us and also it's the cost structure of pay as you go. So when I did that two servers in the morning, three in the afternoon, four at dinner time and then back to two, I'm only paying for the number of servers I'm using throughout the day, I'm not paying for four for the entire day. So I don't have to justify why it makes sense to add more because in the end the cost should be less than if we just kept vertically scaling, particularly because a lot of the vertical scaling cost the doubles each time you go up in size and memory or in CPUs. It also allows team members to work together to solve the more difficult problems rather than just simply throwing more server at it and in large part because it doesn't really make sense to vertically scale, but also because we're no longer waiting weeks, if not longer for a new database server to become available, to order the hardware, to modify it, we're also not, I feel a lot of times with the product set that Amazon has set up, the argument should be why not to use something rather than which technology should we use. So if Amazon offers a queue service, I personally would say, hey, does that queue service meet your needs for now? And if so try it out, write your code in a way that you can pull it out later, but now you're getting right to market and you're seeing if the, because in the end we're creating products here, right? People don't care what queue service we use when all they care about is whether or not the photo gallery they create it is accessible to their end users. And we're doing everything in response to need. We don't have to plan out how we're gonna get more rack space or more servers. And it allows us to adopt more mature design patterns, but doesn't force you to. So we had a really immature design pattern at the beginning, that monolith server pretty much everyone knows at the end of the day isn't going to serve well even once you start putting real traffic at it, but we could do it. There's no reason not to at the beginning. And then we eventually ended up into a pattern where we were actually setting up a continuous deployment pipeline and potentially even doing continuous integration. They've offered more services that help with that sort of stuff. Code pipeline itself and Lambda I believe all work to do that. So next steps. Most of this could have been done with the WordPress site. And that's what I suggest the next steps for you guys are. One of the first things you can do is Amazon offers a free tier 21 products or services within certain usage limits. I actually ran my personal site on this for over a year at that level. Obviously eventually I had to start paying. And it was great because I used Drupal instead but I definitely had that web layer, the cache and the RDS and used CloudFront in front of it all. And they have great white papers and other information for getting started. You can use Elastic Beanstalk, which we didn't talk about here. But it's one of those things. Amazon offers a variety, often offers a variety way of doing things depending on what your needs are. So if your needs are really simple, if you're doing a PHP app or a Java app or something and you just wanna get it up on a web server, Elastic Beanstalk can help you get there right away because it actually has solved most of the other problems that you'll face. Or you can use CloudFormation, which is a template that you describe what services and infrastructure you need and then you run it and it will build it for you. So those are two ways right away that you can get up and running and then just start playing with those parameters and see what happens. And then if you'd like to contact me, my email's toherlyatconstantcontact.com, Twitter is Tracy I'm Hurley and my website is tracyhurley.com. And so is there any questions? Thanks, Tracy. I was wondering if you could explain how you do performance testing in AWS and how does Amazon handle that? Okay. A lot of it's gonna vary on which tools you want. I'm trying to remember her name. There was a presentation right before this where they talked about web page test and how you can use that to do testing on how long it takes for your URL to load. I've gotten pretty far with AB testing at the very beginning, Apache benchmark. And then what else? The other nice thing is Amazon can also use templates and everything can also be used to start up those groups to create servers for your web testing. What about like data on the backend stuff? Like the connection between or SQL data or SQL performance data or the end to end type of figure out where the slow link or whatever. Right. I don't have quite as much experience with that stuff. A lot of times I just did log analysis to try to figure out where things were at. So, but there's gonna be plenty of information online for helping with that one. So, does Amazon supply any of that information or? Yeah, so there's CloudWatch, which will definitely give you CPU times and memory times and stuff to see how your servers are performing. But also, I know at the elastic load balancer level, it will tell you how long it's taking for requests to be served. Great, thank you. What's the timeframe on all of these changes? Like, obviously you presented it in about half an hour. It took a lot longer to do. Yeah, definitely. So the question's a little hard to answer only because this is like a combination of the things we've done. But the company I worked at before, we create, including coding everything, we put a product to market in under six months. And what the product was is we allowed people to build surveys, galleries, stuff like that and publicize it out. And it had much of the same infrastructure and that was under six months. You mentioned that when you had a very high peak, you know, you had to call Amazon. Right. So is it still the case that you, it's impossible to automate everything because of the load balancer need babysitting? The load balancer itself is kind of a tricky one to do because they set your upper capacity limit for it, for you, based on your current traffic levels and they don't want to make it easy to move that because you could be taking capacity away from other people. Like in the end day, they have a finite amount of ELB capacity. So some things you can do, because you can run into the same problem when you first turn. If you decide to do a deploy strategy like Blue Green, where you have ELBs in front of them and you've created a new load balancer and you're gonna put that in front of your current production traffic, you could see how you go from zero to your current that itself can be kind of a hockey stick issue. Some of the things you can do is contact them or also use the DNS level to slowly ramp it up. The only other way to do it is to try to warm it up yourself but then you could run into paying for all those requests going through your load balancer and also Amazon might get mad at you if you send obviously fake traffic through. So they might think you're dossing yourself. For a personal experimental project, do you prefer infrastructure as a service like AWS or platform as a service like Heroku? Okay, so personally, I attend like Amazon. A large part of that is just, it's what I do, I'm a systems engineer here. So understanding how the infrastructure works is pretty important to me. And a lot of the things like Heroku didn't exist when I was first putting my website out there. So I would have to do a lot of new learning. But everyone's gonna be different. And I think that for some projects, like I said, if you're just running WordPress, there's no reason not to use something more like Heroku if you don't wanna learn the infrastructure part. Do EC2 instances go to sleep like Heroku Dino, do you know? So they only do that if you, the way which is you say you wanna server, if you put it into an auto scaling group, it will scale up and down depending on your traffic. But that's the equivalent. Thank you. Thank you.