 My name's Neil Amtige. I'm a senior consultant at Ensono Digital, which didn't actually exist when I wrote this slide. We were still part of Amido. We got absorbed by Ensono Digital in August. So some of the slides are going to have the word Amido in them. Some are Ensono Digital. And we still haven't come up with a good acronym for Ensono Digital. I've been around some time. I was most recently, I've been working on government projects for Amido. Before that I was in charge of the cloud operations team at Skysgana, and I did the Kubernetes implementation there. Before that, VMware wasn't by choice, we got acquired. We were working on their vCloud Air platform, which sort of disappeared soon after we arrived. And before that continued, and before that I'd go a long way back to the 1980s, so I'm fairly old, and I did main framework. A small disclaimer on this talk, these views are my own. It's mainly AWS focused, but that's because I've worked on AWS for the last 11 years. The stuff will apply to GCP or Azure or Oracle Cloud or IBM Cloud. I wrote a lot of these slides in the summer, so things may have moved on since then. And obviously I've hidden some of the identities, and I'm a pretty rubbish presenter. I haven't presented in the real world since before COVID, and it's quite strange looking at a group of people without a blurred background, which is how I've been living for the last three years. I live in the Scottish Highlands. I don't see people anymore. I think this is the first time I've been on a plane since 2019. Today what I'm going to cover, do a quick intro on the cloud. Cost management is main focus because that is where things get quite expensive in the cloud. Limits within the cloud, security, running Kubernetes in the cloud, and architecture for failure. I've done a lot of cloud projects over the years, and I've seen a lot of things go horribly, horribly wrong. I've been on a lot of instant calls, and I've learned things the hard way. And I know it's lunch next, so I will not overrun because I'm hungry as well. I'm certainly not a cloud expert. Anyone who claims to be a cloud expert is either lying to themselves or lying to you. In the last year, AWS made over 2,000 posts to their what's new feed. I certainly didn't read them all. Anyone who claims to have read them all is not an expert. I don't think AWS has an expert in the cloud. Having all the certification does not make you an expert. It means you can pass exams. I have worked with people with all those exams, and they are pretty useless at what they do. I currently don't hold any of them because I haven't got around to renewing my own stuff. Right. What is the cloud? This is the Wikipedia definition of cloud. I'm not going to read it because I'm dyslexic, and I will get the words wrong. But the main thing is the last sentence will lead to unexpected operation, operating costs for unaware users. Essentially, we all know what the cloud is. It's someone else's data centre. You get the cost savings of scale. You get added services. You don't have to employ any hardware people. There's CAPEX versus OPEX, which I could explain, but it gets rather boring quite quickly. Essentially, you're moving from that spaghetti mess, which is something I actually used to look after to some whizzy Google-owned facility, or AWS facility. I couldn't use an AWS photo because they don't seem to supply them. There's hyperscale providers. I hate the name, but it seems to have stuck around. Amazon nowadays is the big boy in this market. Azure is creeping up behind them. Google seems to sit on the bottom for some time, and there's a few other Chinese suppliers. There's also a number of cloud providers. There was 294 when I looked in the summer. There's people claiming to be cloud providers that probably aren't, but the cloud can mean anything, and you can also obviously run your own cloud. OpenStack still exists. I was going to put the word VMware on there, but I just couldn't make myself do it, but there are an awful lot of private clouds running on VMware stacks. Why move? You get elastic capacity. You can scale up and down with a load. In theory, there's no limits, but as we'll go through in a while, there are. This was a graph from an advert that was run at Skyscanner. They didn't tell us that we were going to run it in South Korea, and that was the sudden peaking load we got. This was a real incident. If we'd still been running in our DC, we'd have spiked out around then. We'd have hit that line and lost all that traffic. So we'd have lost the load. Luckily at that point, we were in the cloud for a majority of our workload, and we actually scaled up and absorbed that level of traffic. The next morning, we asked them to maybe tell us next time before you run an advert, and we could have pre-prepared a little bit, but we actually absorbed that load by scaling up. Essentially, in the cloud, you get instant compute resources if you've got a credit card. Preferably someone else's, not your own. You don't have to wait for provisioning of services. If you want a test environment, you press a button, you grab a coffee, and you've got your test environment. And then, of course, you forget to clean up after yourself. And this is where the money starts rolling. Cost management is probably 50% of my time in the cloud nowadays when I'm doing a customer deployment. Controlling cost is very hard in a cloud environment, especially AWS. It can just spread. AWS do not provide really good tooling on controlling cost. It's not available without switching it on quite often. I think it's the same as yours slightly better, but AWS is quite bad. I've been doing this 11 years, and I make mistakes. I'm quite happy to admit that. I quite liked this graphic that I saw the other day on Twitter. It's not my own, but it sort of makes sense to me. Basically, you start on a simple deployment. Developers wanted a simple application to test in. You create a little IAC pipeline and a Git branch, which will deploy when you commit to Git. So you do everything right. It's going to cost about $400 a month. Everyone's happy with that. Management are happy. You get sign off. That's fine. And I still remember sitting in the meeting when we agreed that we would actually skip the automation of cleaning up environments and let the developers tidy up after themselves, because they don't. They never will. So we found we had 50 environments running. Suddenly, those 50 environments are racking up $200 a year, and people start to get slightly more annoyed at that point when you're spending that sort of money by accident. Then, of course, they want more services. So you put in some SQSQs, some NAT gateways, some Kinesis, some DynamoDB, and, all of a sudden, you're looking at a bill of $1.2 million a year. I was going to say pounds, and then I've just realised they're about the same now anyway, so it doesn't matter. And, all of a sudden, you're not the most popular person in the world when financing that, or you're potentially looking at this situation, which I've seen happen. So you want to build in from the beginning and cleaning up your resources. Never trust developers to clean up after themselves, because they won't. AWS Nuke, I really like. It will just wipe an account for you. It cleans up automatically. Cloud Custodian is another open source product, which will do much the same thing, and it will give you an idea of what's running. Or every month, I normally have a quick skim through an AWS bill to see what's going on and look for things that I'm not expecting there. AWS are really good at providing automatic monitoring tools. They look really sexy. Or, if you've got a lot of money, you go and buy Datadog, which does much the same thing. We switched this on recently on an EKS cluster. It was really easy. A few lines of terraform, whacked it in, get commit. Then I noticed within about a week, we'd blown $3,000 on CloudWatch metrics, because we weren't aware that, well, we hadn't bothered checking how many metrics it would admit. This was a small test environment. Unfortunately, it was also a customer's environment, so I had a discussion with them, wondering why we'd spent $3,000 on virtually nothing and the old favourite. Data transfer costs. In the cloud, if this picture comes up, this is not my picture, I stole it off the internet, but I don't understand how you would work out data transfer costs in the cloud. You pay for Trent data gain all over the place. I sat down with an AWS TAM, you also go to try and work out how we were going to get charged for data transfer and you couldn't tell us, which was quite worrying because he didn't understand it either. So you need to keep a very close eye on moving data around. Essentially, you pay for data moving between regions in your own account. You pay data for moving between availability zones in the same region quite often or between different cloud services. When was it? At Skyscanner, our model was very strange with our data transfer because we were pulling a lot of data in and we were spending $200,000 a month on just data coming into our accounts. So, and of course, Bezos hasn't yet got to pay for, so this is probably where the data transfer money is going. Let's have a look at scalability. This is... Common perception is the cloud is infinitely scalable. It's a magic thing. It's a magic place and it's not. Resources are finite and my click has stopped working. There it goes. Quite often, when you operate at large scale, you find EC2 instances are not available at any one time. This was trying to remember when I was trying to scale a Kubernetes cluster and there were just no C5s in that region. Because at the end of the day, a cloud is not a magical place, it's just a rack of servers and they do run out. Especially if there's a big incident going on in another region, capacity just disappears. This is my favorite quote. I sat with a QA engineer at VMware and he wanted as many IPs as he wanted and he wanted them rooted to his laptop and I said, you can't. You can't break the rules of IP networking. We only had two subnets or something at the time and we couldn't root as many IPs as he wanted. You need to plan for unavailability. Expect you won't get what you want at any one time. Generally, if we run Kubernetes, I'll run it on at least five different instance families of compute which allows that instance family to just disappear. Don't assume you'll get what you want essentially. Have some backup plans. This was at Skye Scanner at the time we were running five different instance families and we were looking, quite often, we ran on very old instances as well because no one wanted them and they were cheaper. But then you get to the point where AWS start ripping out those instance families so you need to think about that as well. Cloud essentially functions via API calls. So any time you use the CLI or an SDK, it's calling an API somewhere underneath. They're quite inconsistent as well. Certainly AWS is. And every API has a throttle or limit to protect everyone out so you don't get one guy killing their API servers. Quite often, you get throttled. Every account seems to have a limit and it's API usage and they don't publish those limits. I sat with an AWS guy a few years ago and he explained it to me and I glazed over within minutes. There's some mathematical way of doing it. And they seem to change as well so they're not consistent with the limits. What you got yesterday doesn't guarantee you'll get those API calls today. So if you run a bad script, you can suddenly run out of API calls in the account at the account level. Kubernetes used to be really bad at this. The cluster autoscaler used to hammer the API. I know the sig lead of the cluster autoscaler so I can slag him off, but we did make it a lot better than it used to be. It's also very difficult to find the script that's hammering the API quite often. The TAM, sorry, I should have changed that. It was meant to be the technical account manager at AWS. Has got access to tools which can help you if you ask him nicely or bribe him or something. You can drill into CloudWatch logs, but it's not easy to find. The best thing is to educate your developers not to do stupid things like hammer the API consistently until they get the response they want. Maybe some back off and jitter into their code would be good or just go and slap them if not. The account do have limits to stop you hurting yourself, so if you get a brand new account, quite often you can only have a certain number of EC2 instances. And then you need to raise those limits. It's a good idea, but it's really annoying if you spin up a load of production accounts because it means you have to raise loads of support tickets. CloudCustodian has a really nice feature to raise these limits automatically. Essentially, it raises AWS support tickets on your behalf, but it has a small problem where it's certainly used to. It didn't actually check to see if it already raised that ticket, so every hour it would raise a new ticket and then a new ticket, FAWS, had an action debt. And eventually the support team get really annoyed at this because they get a lot of duplicate tickets, but I need to check. I'm not sure if they fixed that code because it can take some time to get account limits raised. And I found if you ask for a big raise, they won't do it automatically, but if you ask for lots of small raises, they tend to go through automatically. So if you do little jumps, so 10, 15% limit get raised. If you ask for one big jump, there is a way, certainly with enterprise customers, you can now get a new account that will have limits raised automatically for enterprise customers. Spot instances. This is AWS focused again, but I know it applies to Azure and GCP. They just got slightly different names. Essentially, AWS has a huge pool of unused capacity at any time. And spot instances essentially allow you to use that resource that no one else is willing to pay for at a much reduced price. But it can disappear with two minutes warning. They actually tell you two minutes before they're going to do it. So you need to allow for the fact that that capacity may just suddenly disappear. Kubernetes runs really nicely on this sort of environment because it will auto heal. You can't monitor and predict it. Some people claim you can. You used to be able to in the past, but you just guess. So if you're running on spots, ask for lots of different instances and availability zones. Give yourself a nice spread. There was an incident in London last month when they lost the air conditioning. And all of a sudden, all the spot instances in London disappeared because they got consumed by people that were willing to pay for it. Which, unfortunately, at the time, I had Kubernetes clusters running on spots, which just disappeared. But you can save tons of money. We did this at Skyscanner, and you were saving in the order of millions of dollars a year by running on spots, which is why we did it. We took that risk. I mean, it was at this point where we actually materially changed our share price because we moved to spots. Security. This is a common trend on Reddit. I looked last night and there was two more people claiming their accounts had been hacked. I mean, there's just thousands of people claiming their accounts had been hacked every day. Normally, it means they haven't set a password or a decent password or an MFA. So people are just hammering their accounts to log on. If it was as big as this problem, I think AWS would have solved it. I think it's people getting in with poor passwords. If you create a new AWS account, put a MFA token on it, people will actively try and get into that account to abuse your resources. I've had it happen to me. I've had an account. I'm not going to use the word hacked. I had a Jenkins server which had a flaw in it and someone got in through that way. But within an hour, they had instances running in every region, Bitcoin mining. And within hours, we'd racked up a huge bill and it was an embarrassing situation because it was on my boss's credit card. Or as I also have done, you posted your credentials in to GitHub. I did it in a demo conference. I did a Git push by accident. Luckily, AWS are pretty good. They scan the same feeds that people are looking on GitHub and actually tell you you've just compromised your own keys. But again, the account got abused quite badly. The general trend, if you look at Reddit, is it's not my problem. It's AWS is full. It's not. You signed up for a service and didn't follow common sense, essentially. AWS could help by enforcing MFA on their root accounts, but I know as someone who deploys lots of AWS accounts automatically, that would really be a pain in the ass to me to have to put in an MFA every time I created a new account. We tend, when we do automatic deployments, not to set root passwords on the account at all. So they can't be exploited that way and we only set a root password when we need them because you need access to the email account. You are responsible for the bill. I saw one guy last night had got a $50,000 bill on his credit card and he was slightly concerned at that and assumed AWS would just write the bill off. They generally do help, but he's still had about $20,000 to pay. I mean, people seem to assume that if they do something stupid, AWS will just absorb the cost. They're not going to keep writing that off. When you sign up to an AWS account, you have the shared responsibility model and the key thing there is access management. You are responsible for the access of your own account. If someone got into the hardware, that's AWS's problem and I'm pretty certain that would never happen. I'd be very surprised if that happens. AWS free tier is not free. Another common mistake. It doesn't mean there is no cost involved. Only certain services are free and you can run up huge bills. This is where, to me, is better. You can have free accounts on Azure, which stop you spending more money. They just stop working. When AWS, they won't stop you spending money. I suppose he has a yacht to pay for, but it would be useful if AWS had the same facility as having a free account with limits on it, but there are no limits on a free tier account. If you leave, you can run a simple CloudFormation demo and leave it running and it will rack up. $50,000, $100,000 in a year if you just forget about it and all of a sudden you've got a huge bill that you are responsible for. I personally at the moment don't run AWS accounts for demo for any purpose of my own. I've got an A Cloud Guru subscription which gives me access to their sandbox AWS accounts, which just disappear after six hours, I think. So, to me, I pay $400 for that service a year and it just makes my life a lot easier. There are some limits on those accounts and what you can do. You can't spin up lots of instances, but for testing, they're great. You just spin them up and leave, just forget about them after six hours or clean them up if you want to be sort of helpful. Or regularly use AWS Nuke, which is a nice tool. It will go through all the regions in your account. It will tell you what's running and destroy them all if you run it in that mode. If I'm running an AWS account, when I finish for the day, I'll just run AWS and you can kill everything to make sure it's clean. Because there's always something lurking in the corner that you've forgotten about, racking up a huge bill. And an EBS snapshot left lying around or cost you money. If it's two or three cents, you don't think it's a lot of money and it just accrues over a year and all of a sudden it's a lot of money. Be super careful with access keys. With the new SSO and things like that, short-lived credentials, you tend not to need access keys. So try and get rid of them. The new, there's a new IAM roles anywhere which actually allows you to authenticate with certificates. We've just started a new project and from day one we decided we will not have any keys anywhere. And it's making life a lot easier. GitHub Actions allows you to authenticate without keys and get short-lived credentials and put 2FA on everything. A password on its own is not enough to give someone access to an account which they can spend hundreds of thousands of dollars on very quickly. Again, AWS is poor in this area there. 2FA is not great. You can only have one per root account and things like that. And don't commit keys to GitHub. You can subscribe to get feeds from GitHub listing keys as they arrive. As someone commits them, you can pick them up and start exploiting them. With AWS keys, you don't even need to know what AWS account they relate to. They just work. So you can log on to a random person's account and start abusing it. In seconds, as happened to me. Bit about architecting. This was a diagram someone actually provided to me for a technical interview, showing me how clever they were. Using just about every AWS service in the world. AWS has got 200 services, GCP and Azure about 100 each. You don't actually have to use all of them at once. It's not a game, it amazes me. Things keep things simple. You don't have to over architect things. It's easier to maintain. A good rule I always follow is if you can't fix something at 2AM with Hangover, it's probably too complicated because it gets very difficult, especially if you're still drunk. Servers in, certainly I haven't used it as a year for a while, but in AWS servers will just die. They just disappear or they become unresponsive. If you're lucky AWS will give you some warning, but quite often they don't. It just happens randomly. Every host you deploy should be replaceable with no manual effort. So if you need to log on to a server to do something, you're doing it wrong. Again, if you need to SSH onto something, I'd rather it be disabled. Quite often nowadays, you don't even run servers in the cloud anyway. You've got lambdas, fargate, which is quite a nice thing for running EKS and ECS containers. Essentially AWS maintain the hosts for you. This is something that I found from a talk I gave about nine years ago. Essentially, we used to deal as servers as pets. They're now cattle. You don't look after a server anymore. You just shoot it if you don't want it anymore, which as I actually live on a farm now, I'm sort of quite used to the shooting cattle bit. You don't worry about a cow or a sheep. You shouldn't worry about your servers. This is something which really annoys me and I keep getting caught with it. When you create an AWS, you get the option of validating it by email or DNS. And you always click email because you can't be asked to set up the DNS record. But 12 months later, you get an email to say your SSL cert needs renewing. And you quite often forget to click the link in the email or you set it to go up to an email account, which you no longer have access to. Or it goes to a group account and everyone else thinks they clicked the link. It takes quite a bit of extra effort to do DNS validation, but once you do it, you don't need to do it next year and the year after. I just put this in because yet again, I got caught by it about a month ago. Someone didn't click the link in the email. I think AWS have made it easier now. They even give you the commands to set up the DNS automatically. All regions in AWS are not equal. USA1 is a dumpster fire. Anyone that runs stuff in USA1 is stupid, essentially. But a lot of the core services run out of USA1, so you can't block it. You cannot access USA1 because I am runs out of the... So if you block it, you can't authenticate. USA2 is now also showing the same signs as USA1, so I wouldn't use that one either. I think Netflix is still quite heavily in US, some of these regions, and they get capacity and stuff that you're never going to get, essentially. First of all, instances. Really cool, really cheap. The T-series in AWS, and I think the zero of their B-series. Essentially, it allows you to pay less for an instance and have some burstable capacity when you need it. So you earn credits when you're not using the CPU. The instances are cheaper. But you can run out of credits and they flatline. They essentially sit there and they will do nothing until you start accruing more credits, so you need to be a bit careful about using burstable instances. Click Ups. Don't run things through the console. Don't create stuff in the console if you can avoid it, because you won't remember what you did the next time someone asked you to do it. So spend a bit more time scripting things with CLI or Terraform or CDK or Crossplane is something that I'm starting to look at, which is really nice because it enforces infrastructure. If you make a change, Crossplane will go back and change it. It's a bit like Puppet used to be. For hosts, it will enforce that infrastructure. Terraform is quite good and quite popular, but it's only when you run Terraform that things get put back. Crossplane will enforce that. So I haven't managed to convince a client to use it yet, but it's on my list of things I want to do. And do everything through a GitOps workflow if you want just to be lazy. Commit to Git, do a PR, run that, on that PR, run an action to deploy the code. So everyone's got audit traceability and life's easier that way. And you get to blame the person that approved the PR for missing the mistake as well. And it sort of makes life a lot easier. Should keep going as fast as I can. I'm not going to reach the end, I don't think. Don't expect resources to be available. AWS is quite good at releasing new features that are not available in all regions. So you need to check that it's in the region you want to run in. And they don't actually give some customers Git pre-release access to things. I know when we were at Skyscanner we got EKS before it was available. I said it was rubbish and they went away and sold. At the moment, I think it may be getting better. There's a GPU shortage, so GPU nodes are quite restricted. I'm going to skip that. This is my favorite thing. If you've got a consultant seen to do a lift and shift check, they remove the VMware agents from the hosts because it looks really bad. I'm going to skip this entire thing because I've got two minutes left. If you're running Kubernetes, the default should be to use their managed service unless you want to run something really specialised because it's cheaper. AWS EKS is like $60 a month. Rather than running my own cluster, I'm maintaining it. Just give AWS $60 a month. That's lunch, probably. Skip that. Kubernetes autoscaler game will fight cloud systems quite regularly. The autoscaler will be scaling things up. The cluster autoscaler will be scaling things up and the AWS autoscaler will be trying to balance nodes across regions. So the two end up fighting each other. And it happened. I've seen it. The nodes are just going up and down. And it's mess. You either create an ASG per autos AZ or use Carpenter, which is the new AWS native cluster autoscaler. Same with PVCs. This is stateful storage in AWS. By default, they're created as EBS volumes. And those EBS volumes are stuck in one availability zone. So if the AZ goes down, your data is staying there. EFS works quite well for that. But it needs a bit more extra work. So in summary, wasting money is bad. If you save money for the company, hopefully they're going to spend it on you. Bear that in mind. I was quite pleased when I saved millions of dollars for a large company and they gave me 100 pound Amazon voucher for it. I'd rather they hadn't done it. Learn from your mistakes and share them. Try not to make the same mistake twice because it's genuinely embarrassing. There's my contact details. And I will post these slides because I have a red sign at the stop that says, stop at the back. I will be around for a while. And feel free to ask me any questions. I'll be hanging around for the next few days or contact me on any of these things. Right, thank you.