 Hey good morning. Yeah thanks for coming by. This conference is all about sharing, and we already heard great keynotes where a common theme was sharing. Some people shared code, kudos to Solomon for pulling this off live on stage. Some people shared experience. While Amazon does share code as well, not everybody I talk to is familiar with that, but we do contribute to, for example, the Linux kernel or the Zen project. We even have our own open source projects. Do you know that all of the ECS, the Lactic Container service control plane is available on GitHub? So you can go and check that out if you want. But my real focus today is not on the code. I actually want to share experience. Our CEO Jeff Bezos once said there is no compression algorithm for experience. That means you really need to make all these mistakes to learn from it. Now you don't have to personally make all these mistakes. You can watch others fail and hopefully learn from them. So as Dan Coombe pointed out already, Amazon is in this business for 10 years now. So about 10 years ago we started with the S3 service, the simple storage service that provided a blob storage on the internet. And now shortly after EC2, our elastic cloud compute service came. Now I'm from ECS, as Angela pointed out. We learned lots of lessons over the past 10 years of what you can do and what you shouldn't do. So today I want to actually share 10 lessons that we learned the hard way. And can I get an account here? How many of you have tried to build a cloud themselves? Yeah, it's hard, isn't it? You learn a bunch of lessons. So for those and the others, let's dive into this a little bit and see what mistakes you can make and what we can all learn from this. So number one, build available systems. The cloud is all about sheer, infinitive scale. Infrastructure as a service is the promise that when you need compute power, you get it. You just order it, use it as long as you do. And when you don't need it anymore, you get rid of it. Now on the hardware side, while there is certainly some challenge there, it is something that you can do. Essentially it's all about adding more rex to the data center. But to do the same in software is much harder. Most people that design complex software systems, when they ask, do you scale, what they do is to think about some kind of worst case scenario. You want to scale for some dimension. And you know what? You get it wrong most of the time. Either you over design and you put in stuff that you never need or, as we've seen recently with some other bigger software events with, you expect some kind of complexity, some number of users, and you end up with 70 times of them. And that just brings down your system. So the lesson that we learned really isn't to design for a certain scale. You always get it wrong. But what you rather want to do instead is design your system so you can evolve it. You can evolve it over time without the customers or users knowing it. For us, we see ourselves as somebody that provides a service to customers that really don't care about all the intricacies of IT. They want their service just to be up. And that means for us, we need to build systems in a way that we can make those changes without the customers knowing. Essentially, this comes down to having a plane flying it and then over time turning it from a Chesna into an Airbus 80 in midair without the passengers knowing. That's tricky. But that's really what this business is about. Let's look at the next one. Number two, expect the unexpected. No matter how much money you pay for your hardware, it will fail essentially. And I'm not talking about software which can obviously fail in many, many interesting ways. But this isn't a game you can win with money. And Amazon, we try to have reasonably expensive hardware that is very reliable. But the meantime between failure is not infinity. And the biggest challenge is what on paper may be an error that shows up every few years or a failure pattern. With enough scale, this is a problem you have multiple times an hour. At that point, it really hurts. So you need to design your systems in ways that you can deal with failure. You need to check for failure. You need to deal with it. Make components that isolate failures and then react to it. Control the blast radius and embrace failure as a natural occurrence of your software and hardware all the time. In EC2, one of the examples I can give you there is, for example, we gather crash locks, sometimes machine crashes. And it's a bad sink because a customer or many have lost their instances when that happened. But if you at least take crash dumps, you can analyze them and react to it. This way you have a tool that when you roll out, for example, a new kernel, you very quickly realize, oh, something's wrong. And you know what action to take to control the blast radius. Lesson number three, primitives. So this may be a bit of a surprise to some of you, but we don't see AWS as a framework. It is a collection of primitives, tools, building blocks that we encourage our customers to use. You can use them, but you don't have to use them. In fact, if you think that a partner or some other service provider offers a better building block for your need, by all means, use it. Let me give you an example there out of the EC2 world. We have our own operating system, Amazon Linux. It's a default every time you launch up an instance. If you don't make another choice, you will get Amazon Linux. We really work hard to make Amazon Linux the best operating system. You can run on Amazon EC2 instances, but at the same time, we spend a lot of effort helping the greater Linux ecosystem and other operating system vendors to optimize their OS for our cloud as well. We don't want Amazon Linux to be the best operating system. Certainly, if you don't have another requirement, fine, use it. It's a great OS. But if for some reason you prefer something else, we have no problem with the fact that you use something else. If it suits your need better, use it. The reason is we don't really know to a full extent what the customer wants to do, and we don't want you to tell you how to do you need to work. So if there's something that works better for you, use it. If we start a new service or feature, we often start small. We want to get feedback from customers. And if we see that they take it in new creative direction, so he haven't thought about it, we want to keep up and evolve quickly to follow the needs of the customer. So the agility is something you can much easier do with primitives rather than frameworks. And obviously automation is key. If you want to scale up, you need to have some form of automation in place. And automation is also helping you with anti-patterns or to avoid those anti-patterns. If you SSH into a machine, into a server to do operation support, you're not doing it right. Think of how you develop software. You have source code control systems. You have automated test and build systems. You automatically deploy, if you don't do, but on the operation side, a lot of people fail. The automation on the operation side is still in its infancy with most people. Change that. It's important. If somebody locks into your server with a root account to make a mutating action, literally that person destroys all previous knowledge that everybody else has about that machine. And sure enough, next time you have a problem, nobody knows what happened to the box. It simply doesn't scale. So get rid of it. How to do this? Very simple. You need APIs. You can automate the interaction. At Amazon, we're great fans of APIs. You see that we're not using it only internally, but pretty much every service that we have in AWS not only has the web console you can interact with, but there is also an equivalent API that you can use to automatically interact with the service. That helps you. Essentially, you can build an old complete virtual data center based on adjacent files, if you want. This is the level of automation you really need to reach. But APIs have drawbacks because they're essentially forever. At least if you want to keep your customers happy. Obviously, you can keep changing APIs, but those are usually bad ideas because every time you change an API, you're lowering the incentive for your customer to reconsider whether this is still the best solution for them. And if there's an alternative, they may use this opportunity while they still need to do changes and go somewhere else. So APIs are good, but APIs can be really tricky. On the EC2 site, many of you know API calls like run instances or describe instances or attach volumes. Those even made it into competing cloud offerings, which is great, by the way. But what about the ZenStore API, the QEMU device model, the X86 architecture? Do you realize those are APIs, essentially, as well? They are hard to change. And think of QEMU. If you change the device model, there are certain operating systems that now think, oh, great. I'm on a new computer that I haven't been registered on. I need to re-register. If this is an automatic service, you just want to boot and take customer calls, it's really great if that machine stops and asks you to re-register the OS. That's just bad. So you need to be very, very cautious and conscious about the APIs you have and make sure you don't change them. Even those. In the QEMU case, we were forced to essentially fork QEMU because the upstream QEMU went and made changes to the device model, which we didn't want to halve in our cost because it would have exposed customers to suddenly a new API that cost pain. So those are the kind of price you need to pay, even though I really didn't like that decision. Know your resource usage. Another fairly important lesson. And this may be trivial, but what you really want to do is offer your service at the lowest possible cost to your customer. But you still want to be operationally profitable. So understanding all the costs you generate you have is important to setting your pricing model. With S3, when we launched, it was the first service that I said earlier, we had one of those mistakes. S3 is a great service. You can store any kind of binary block on there and retrieve it again. So we said, okay, we charge you for the storage space you need and we charge you for the bandwidth you use to do transactions. It turned out that some people found S3 very, very useful to host little thumbnail images for stuff they sold on eBay. Now, thumbnails are relatively small. So what we found out that the traffic costs and the storage costs were comparably cheap, but the frequent calls to our APIs generated a significant cost factor for us that we hadn't taken into account. So we had to change our pricing model to also include call rates. So this is one of those lessons. This is what customers hate, by the way, if you later on change cost model. We really learned some lessons there. If you change your cost, lower your prices. That always goes well. Don't try to increase prices. Okay, next topic is obviously a very important one, security. The cloud business is all about trust. If you want customers to do serious things on your cloud, you need to be able to establish that trust relationship. If they don't trust you with their data and their company often because their services rely on you providing them all the time, then you have a problem. So how do you get the customer trust? You can make all kinds of promises and they may or may not hold in the end. But what helps a lot better is if you can give the customers the tool to generate the confidence that they need. And security is obviously a key thing. So if you design your systems from the ground up with security in mind, and that's a process we have, if you design your systems from the ground up, it is important that you get the security involved in the design, watch the implementation and then do the sign off and also do regular check-ins with the service if you're still secure as it evolves over time. And often security is a hard trade-off. In the EC2 world, we learned that really the hard way. I pointed out earlier that we are really very interested to not have scheduled down times. Customers really don't like that. It's very disruptive if you have scheduled down time. So we don't have that. Yet the hypervisor and the kernel sit between the instance and the physical hardware. So it's very hard to change that. Think of this picture here. It's like changing the dome underneath the guy here with the child without suspension cranes and anything else without the guy falling over. It's that hard. If you want to change the kernel underneath an instance, it's tough. We developed some hot patching technology that allows it to do this. But it is non-trivial thing. So you need to keep that in mind. If you want to run your account continuously in a secure way, you need to come up with very creative ways to enable that. Another thing to consider in the security world is encryption. The best way you can prove to your customers that the data is safe from access from other parties they don't want to have access to is to have them encrypted. Inside of AWS, you can encrypt all your data at rest anytime. And with the key management systems that we have, we can guarantee that only you have access to these keys that you use to encrypt and decrypt the data. And that gets you a long way ensuring that, well, even if somebody gets access to your bids, it's still just gibberish because it's all encrypted. And the encryption topic is pretty important for us, not only at rest, but also in transit. I don't know if you've seen the announcement. We, I think it was in June 2015, we released a very small and audible version of TLS to open source. It's also on GitHub. So if you haven't taken a look at it, if you want to have a very small and audible TLS version, check it out, the S2N library from Amazon, a pretty good option to use there. Good. The importance of the network. This is probably the hardest part to get right. And the reason is that the network is pretty much a shared resource for everybody for all use cases. While you can have different instant types backed by different servers that can easily scale to the various applications you host there, the network is for everybody. And you have things like video transcoding, HPC, real-time conferencing. They all have very unique requirements and often contradictive ones, especially if you throw cost into the mix as well. And getting that right is really hard. When we started 2006, there was really nothing out there that helped us to solve these problems. So this forced us a little bit to go on our own and the networking infrastructure that Amazon has developed, hardware and software is pretty much a unique development that we have done. You see that on the EC2 end of things. We had a problem there with the NICs. The server of the NIC is a shared resource. And depending on the resources the instances on the servers are doing, you will see latency jitter because the network obviously is a shared resource. Now we developed our own NIC that we are using within the EC2 servers that exposes SRV functionality, so essentially a virtual NIC to each instance. And that has allowed us to reduce latency because it's much, much faster than traditional NICs by a factor of two. And it helped us to reduce the latency jitter by a factor of 10, which is very important if you think about customers like real-time audio. Okay, so finally, last slide, no gatekeepers. This may be a bit of a surprise, but we really don't want to limit our customers what they do. While there are some terms of services, you can't do everything, but really the idea is primarily to protect the instances of other customers or services of other customers from bad impact of other customers. Other than that, we really don't want to apply too many restrictions, especially not business or what you would consider political restrictions. If you are a building block provider, you need to be able to deal with the fact that somebody comes up with a building block that competes with your own building block or service for that matter and maybe is better. At best, this is a signal to your developers that they need to do a better job, but really you shouldn't hinder the others to provide that same service, even better one. And we have good examples, for example, Netflix is a service that makes good use of AWS, and as you all know, it's an alternative to Amazon Instant Video. Yet, I'm totally fine. This is the way it should be. The more open you are with your platform, and I guess all of you in the room really know that, the more success you will have. With that, let me close. If you're interested to learn more, we have a boost down at the show floor. Feel free to catch me there. I'm happy to have further discussions. And obviously, we're hiring, so if you want to actually do clouds yourself, feel free to pop by. Not only a kernel and hypervisor developers, but also operations folks all here in Germany, because the kernel and hypervisor is developed here in Germany. With that, thanks a lot. It was a pleasure talking to you.