 Good afternoon. Hey everybody, nice to be here in Texas where everything is big including the Expo Hall. Sorry I'm a little late. I think I should have taken an Uber across this place. So welcome, welcome. I'm from Texas. I grew up in California but I was born here so it's really cool to be presenting here today at the OpenStack Summit. If you're wondering who I am, if you don't know me yet, I'm Clint Byram. I work for IBM and IBM Cloud. I'm a cloud architect which 10 years ago I would have been a developer lead but now we get to call me an architect which is cool. I also think it's funny that I architect amorphous blobs of mist, right? I'm also a huge open source aficionado. I used to work for Canonical. I've always been a Debian user since almost since Debian existed and I love Ubuntu as well and I haven't run Windows in at least 12 years. So get ready for some hippie talk, right? So the question now is why am I here? I love OpenStack. I really do. I've been involved for a couple years and I've watched it grow and really has turned into this amazing movement which we're all a part of now. I also hate complexity. I really, really hate complexity. I think some people do love it, whether they'll admit it or not. It's fun, you know, when you get to the top of the Jenga Tower and it sort of resembles some sort of sculpture but I really, I really don't like it. I want things to be as simple as possible and the reason is because I love to scale. Who else loves scaling things? Awesome. Who has absolutely failed at scaling things? The hands actually went higher, which I think is indicative of the fact that we've all tried. And finally, what I want to talk about, I may spend more time than you might think, even though I love to scale things. The main thing that I think is hard to scale is people and I personally have experienced sort of the magic and the power of continuous delivery and I want to make sure that I do whatever I can to help you and save your team from the stress of the traditional release cycle. So, hopefully we can talk a little bit about that and learn. But the question also might be why are you here? Who else hates complexity? All right, great. Well, then we can agree. Anybody dealing with their hands up can just go because we're going to talk about keeping things simple today. I imagine all of us have failed with OpenStack. I know I have. The stable release end of life before you could even get it out the door to your users. The stable release didn't end of life before you could get it out to your users, but then the next release came along and now you have to upgrade. And you spent at least three months in panic. And then we've all been scaled to death and it's just a part of running clouds. So, the question is why am I qualified to tell you how to do any of these things? The answer is I'm not. But I'm willing to stand up here and at least put my word on it that I have some ideas that you can take home with you and hopefully you and your team can leave here happier. But what am I trying to do? I'm building a big cloud. We want to build a big public cloud for IBM. And right now we just have a few nodes. We're trying to prove a few things. But we believe that we have the right mix to really build something big without destroying our team and having a good time doing it and hopefully make a lot of money for IBM as well. So, we're aiming at a thousand nodes in a region soon. And we have some hardware. We haven't tried it yet, but we have good ideas for it. And eventually we'd like to do things like run 10,000 computes in one region. And that's something I know very few orgs have done. And we want to join that very short list. And the biggest thing is I'm working with a really amazing distributed team. So, it's not just me. Most of these ideas are a combination of ideas from you out there in the community, from my team, and from books and other things. So, it's important to collaborate. But how are we going to do it? Right? That's a big cloud. And does anybody, is there anybody who's willing to say that they believe I can just install Mitaka right now? It's going to do it. No confidence. It's okay. I don't believe it either. But we're not running Mitaka. We're running Newton. And when Newton is finished, we're going to be running Okana. And the idea is that we're actually going to be running whatever's coming from the community. So, we're going to run from master. The way we're going to do it, we're going to measure things. If you don't measure things, I think in the past I've run stops in various ways. But my favorite times were when I actually had real numbers, something I could actually look at and say, here's an experiment that we ran. And here's something we can do to actually improve it. And most importantly, we work upstream. You know, you don't see any documentaries about the story tale of the salmon going downstream. Why? It's really easy. You just go. But getting upstream from downstream is really hard. And I see it all the time. You're running a giant set of patches. You're running whatever it takes to get live on the stable release. And then it's time to influence the next release. And that is just you're going to kick and fight and scream just to get here to the summit, get your ATC badge maybe, get into that one session, and then they run at a time. We don't want you to feel that way here in OpenStack. We want everybody to be able to operate upstream. And that's a big goal of ours. So not only do we want to run upstream, but we want to improve the infrastructure so that everybody can have a chance to run their cloud right off upstream, have their tests influence the developers, and actually make a difference without the stress of that lag. Before I go on, if you have questions, there are a couple of mics, just please just run up to them, shout at me, throw things if I don't see you, and we can have a conversation. So everybody we told we were going to run for master either said, oh, okay, or that's crazy pants. And I get that because it does sound a little bit crazy. We're so used to upstream makes a release, and then we start to test it, and we see if it's going to work, and we put our changes on top of it. That cycle is built into IT. We don't feel that, but when you're trying to push the limits, especially with an open source product where you can absolutely influence it at every single level, you're missing something. And so I put this quote here, software delivers no revenue until it's in the hands of the users. And this seems obvious, but the stress of a release is something I would like to say goodbye to. It doesn't mean we're not going to have stressful jobs, but could we maybe spread it out so that maybe every day we spend 20 minutes stressed out instead of every three months, we end up burning three people out and have to go recruit more because we all stress out over the release. So the idea is you need to get stronger. You need to start small, move small changes. And the difference in continuous delivery, right, is the change is going to come in from upstream, and you're going to try to move it into your testing. And you need to have a lot of testing. But the reality is most commits will improve things. We hope. And the ones that don't, you should be able to test and filter for. And the amount of effort that you would put into a stable release process is not small, right? And we think it's the same amount, maybe even less, to go ahead and build automated testing. You're not alone in this though. OpenStack wants this to happen. And we have infrastructure that's trying to prevent bad changes from landing. Of course it doesn't. The gate is broken quite often. Poor Mr. Dague, sorry, Ston Dague, is probably right now fixing the gate, honestly. But with more contribution, it's gotten more stable and more stable and there's more tooling around it. And now we're getting to a point where you can consume what comes from upstream and absolutely run it on your servers. But you will have to have your own extensive testing. But this is still a lower risk proposition, especially if you factor in the cost of stress for your team. One of the things about stress is that you can decompress from it to a point. We want you to have a chance to do that. If you release every day, you have the whole rest of the day to decompress versus building it all up in a big bang release. So your question might be, how do we get started? Well, if you read the book that I refer to in the quote, you need to start with transparency. Can all your developers see what your ops teams are doing to all of your servers? Do you agree with me that that's a good idea? If you don't, I understand. It's controversial, okay? But the idea is that your developers should know what's going on. Your ops teams should know what's going on. Some people call this DevOps. I call it sanity. We should just watch what's going on in our product. Watch what we're pushing out to our users. And from that, we can actually collaborate. So you need to cut back on manual changes for that. I see manual changes as they're like Twinkies. If you eat too much of them, you won't be able to get on a plane and come to the summit. So the first step, once you've got transparency, and this is just simple, go set up Jenkins and never again run Puppet or Ansible or Chef. I don't care what tool you use. By yourself. Run it through Jenkins. Let everybody see the log of every run. Obviously, you have to SSH sometimes, but the idea is to minimize it. Get to a point where the robots do the work so that you can collaborate. You can actually say, look at this log. Look what happened in the last Ansible run. Look what exploded. Here's your commit. Let's fix this. Let's work together. Nobody has to work on it alone. And a big thing here I throw in there is be here. And don't just be here to hear talks. Be in the sessions if you can be. There's a big split going on. Be at both events if you have to, but be here to influence this open source product. Because if you're not, this happens quite often. Things get done in a way that may not support your use case. But what are we doing again? This is all nice future thinking forward thinking. But what am I doing? I'm building a big cloud. I said we're going to do some science. We need to do experiments. We need some data. The first problem, I don't have a thousand nodes. I don't. Well, I might soon, but I don't right now. What we had to do, we came up with a cool idea. Let's make a few nodes pretend to be a thousand. There's something really cool in Nova called the fake driver. You're trying to have a no-op driver. Let's just test and see if the control plane could even handle that. What did we expect? Quite frankly, we expected epic failure. Things should explode. Rabbit MQ. That's toast. My SQL honestly isn't very busy, but we expected all these things. And so we set up this environment. I think somebody's clicker. Excuse me. So we set up an emulation test bed, and I don't want to thank some members of the team, Jose Castanos and Xu Sang and their names are on another slide. They set up this environment for me and replicated my own results. But what we did was we just ran a bunch of fake compute nodes. So instead of having a thousand servers tied up for a week, well, we had three. And in fact, eventually we can boil this down into a couple of VMs and put it into our automated test suite. And then if anything breaks this, right, now we have some visibility into why. So we ran Docker. We put emulated nodes. If you see the word Quasar, that's just a code name for an open stack. And we got graphs. Gotta have graphs, right? And they told us things. What you see here is that when we ran with four hosts with a thousand dockers or eight hosts, things were mostly the same. But there's a weird dip in the middle. When we tried to run with 400, 800 or 1600 fake compute hosts, it actually got slower with 800 and faster with 1600. This is really kind of weird. But this, you know, it took us a while to debug that. So we started looking in and we measured more. What we found was that, that RabbitMQ was eating up gigs and gigs and gigs of memory. Because we had the management interface turned on. Turn it off. Memory usage goes through the floor. Everything smooths out. And we actually proved that we can run a thousand compute nodes on a single RabbitMQ server that isn't all that big. Maybe 25 cores. I know. That sounds huge, doesn't it? Thank you for laughing at that. But it actually is possible. And in fact, the only reason not to do that is for failure, domain isolation. It's not because Rabbit can't handle it. Rabbit is a fantastically scalable, scale-up product. The problem is when it explodes, does OpenStack handle that well? Probably not. But this is just one example of something we were able to address with our configuration. And then we started going deeper. What else is breaking? And we found, okay, we want to have that failure domain isolation. Is anybody running Cells Version 1 in here? I am not surprised to not see many, if any, hands. Do people know what Cells Version 1 is? Okay. Well, let's talk about it. This is Cells Version 1. You group compute nodes and you have a master control plane that ends up cascading down into multiple cells. This was developed by Rackspace, which is one of the larger OpenStack clouds that exists. The problem is it's not tested in the gate. And it doesn't work right with Neutron. And it's generally considered to be your on your own when you go to Cells, right? But I mean, who bought into OpenStack thinking, well, you know, I don't need it to scale, right? We absolutely need it to scale. And we're getting to the point now. It's mature enough. We should expect that scale to be built in. But unfortunately, it's not. So we're sad pandas because of Cells Version 1. But there are developers on the case. And they've come up with something called Cells Version 2. And the way Cells Version 2 works is simple. This master control plane goes ahead and just talks directly to those back-end servers. That's not the interesting part of Cells Version 2. That's cool. It's under development. What's really interesting, what makes me a happy panda, is that I'm in the gate. So now every commit from my CD cloud, as Cells Version 2 gets better, my Cells Version 2 gets better. So I would expect everybody to start running Cells Version 2. It should be built in. The scale is built in. But at this point, it's not clear whether you will even land this in Newton. It might land. Some of it will. But by uncoupling from the release cycle, we're actually able to say it lands when we put the effort in to land it. And we get to take advantage of the fact that other community members are in the same boat. They want built-in scale. But we get it. So going backward, there are a whole bunch of other things that will help you with scale if you start to run off master. Has anybody heard of OVN? That's open virtual networking. So if you're running ML2 and DVR, which is distributed virtual router, you may be finding that your Python agents are very, very, very busy. And so is your RabbitMQ. Some other very smart people, way smarter than me, discovered this as well. And they went back to an old axiom, which I'm trying to find a shirt printed on this, which is nobody argues with C. They rewrote it all in C. So they wrote a little tiny little database server. They wrote some nice tiny little agents. Interface directly with OVS. No more Python code doing any of the agent work. Instantly better scale. Now, they also went a little crazy and started using the Linux kernel for a lot of things. So until you're running, I think, Linux 4.6, you won't be able to take advantage of silly things like floating IPs or load balancers of service. However, you can run master Linux kernel as well, right? Once you embrace the idea of, like, we're going to automate and test everything. We're going to send our robots out into our testing cluster, and we're going to trust them to tell us when things are broken. Running off of the latest Linux kernel doesn't sound so crazy. So that's the kind of things I want people to be able to think about, and I want us to collaborate on. So while we're here at the summit, you know, let's talk more about that. And there's more. You know, why should we even do cells, right? There are a number of technologies out there, not developed in OpenStack, that we could take advantage of. And because we have a cloud that runs off master, we can develop these features and test them in our cloud and use them and prove them at scale immediately. And I'd like to try that. This is all future thinking. So join me if you're interested in trying any of these things. You know, VTES is, every YouTube video you see is coming in off of a Maya school query that ran through something called VTES. This is something that YouTube developed and they gave to the world, right? Just like OpenStack. It's an open source project. Why doesn't OpenStack just use it? Kafka is a scalable queue, so it should be much more scalable than Rabbit. There's a whole host of ways for nodes to start talking together. GRPC, Thrift, ZeroMQ. Some of these are being tried. There is actually a ZeroMQ driver that's getting very, very good that might eliminate the need for cells entirely. And there's also work on using things like ZooKeeper, Console and CD to reduce the amount of traffic on the message bus. One of the biggest things, if you're watching those 25 cores of Rabbit disappear before your very eyes, about eight of those cores when you're running 1,000 compute nodes is just all the compute nodes saying, I'm here and I have six gigs of RAM. I'm here and I have four gigs of RAM. Well, when you get 1,000 nodes doing that constantly over an inefficient RPC bus, that gets slow. But if they were able to just push that into a service discovery, there's no polling, there's no message bus, it just happens. So these are some ideas that we think we can push at scale. And what about Neutron? We talked about OVN. There's also some fun stuff happening that we're doing because we have a master cloud such as VTEP. So doing VTEP on the switch so that we can actually have networking between bare metal nodes and VMs, how amazing would that be? Right? Those are things that are coming down the pipe. But if you're not on master, you're going to have to wait. So I just want to thank Mike Dorman, did a great post on cells so I can understand it. Go check that out at this link. And also Josex Sanis-Huikeng Shu Tao, and as they like to be called the X team at IBM for helping me with those experiments and also letting me steal their slides so I could share them all with you. That's all I have today, but I definitely want to answer your questions if you have comments or you're interested in talking about this more. We have a little more time here, but thank you very much for your time. We contribute everything we do in OpenStack or the appropriate upstream repository. So we use Ursula to deploy, which is a set of Ansible script developed by Bluebox. You might have heard IBM purchased Bluebox last year, and we're building on top of that for our deployment. And then our automation is mostly just Tempest, Rally, Cloud Cafe, all the things, and we try to do things as close to how OpenStack's current testing infrastructure works. And if it doesn't work that way, then we try to make people explain to us why it doesn't work that way. And if they think it should, then we change upstream. Can you say a little bit more about what you used to fake all the compute nodes for the scale testing with the control plane? Yeah. So we used, this is again, we were copying something that's actually in the gate of OpenStack, which is the fake driver in Nova. So at a detailed level, what we did was we built a container, and we told Ansible it's a real server, and we made it configure it exactly like a Nova compute node that we would run. And then we told Docker to make 1,000 of them. The only change was we changed the configuration from the livevert driver to the fake driver. And we told it to not talk to IP tables because we didn't want to test the scale of things like that. And we told it not to talk to, not to wait for Neutron because we took Neutron out of the loop just so that we could test just the pounding of Nova compute on that message bus. Those scripts actually are not upstream, so your question reminds me that I actually need to get those pushed into Ursula. Do we have any other questions? Oh, this is all, this was in the future. I didn't tell you guys. No, we've solved some of these things, but this is all stuff that we're working on and it is in flight right now. To be perfectly frank, the talk that I submitted six months ago was planning on talking about what we might do. And now I'm more talking about what we absolutely are doing right now. So you might notice a little skew between the description in it, but I think this is more interesting because it's actually happening. All right. Thank you very much, everyone. Enjoy the rest of your fine Texas day.